Data Science with Python

Introduction to Python for Data Science

Tutorial configuration

Recently, we published an introduction to data science in R for the beginner in programming. This is a complementary article written using the same approach, but this time focusing on Python, which is another open source programming language. You will learn how to use Python in a Jupyter notebook to manipulate a data set and visualise the results.

Python has an even larger following than R, so both articles should get the beginner up to speed in the two main languages for doing data science. The differences in the tutorials highlight how R and Python tackle the same task. Often it is your own experience, and the data science task that you want to do, that determines which language to choose.

The tutorial requires some basic knowledge of Linux, but other than that we shall go through the steps to set up the tutorial.

Install anaconda for Linux

Anaconda is a package management system for Python aimed at individual users. We shall use this system for this tutorial. We recommend that you use it for all your data science projects because it handles package dependencies reliably. It also allows you to reproduce your results by allowing separate Python environments for each of your data science projects.

Instructions for how to install anaconda for Linux can be found at https://docs.anaconda.com/anaconda/install/linux/. There are instructions for each popular Linux distribution.

Open up a fresh Linux terminal. To check that anaconda has been installed correctly, ensure that you can run the conda command from the Linux terminal by typing conda info. The conda command is the main way to access the functionality in anaconda. This particular conda option should show a list of configuration settings including the version of conda that has been installed.

More help on conda can be found be found in the conda cheat sheet.

Create a conda environment

Anaconda allows you to create a dedicated environment for the tutorial. In this environment, we shall install Python and the packages that are needed.

At the Linux terminal prompt, you should create an environment called intro by typing

conda create -c conda-forge -n intro python=3.8 jupyter plotnine

This should install Python version 3.8 and the Python packages jupyter and plotnine as well as the dependencies of these packages. There should be no error messages.

You can see the list of your environments in anaconda by typing

conda env list

To activate the new environment and obtain the use of the newly installed Python packages type at the terminal

conda activate intro

From now on any conda packages that you install are placed in the intro environment.

Start Jupyter in a new directory

Create a new directory for the tutorial called pythonintro on your file system and change to that directory using the following commands:

mkdir pythonintro

cd pythonintro

We shall use a Jupyter notebook for the tutorial that is based in this directory. To start Jupyter, type at the terminal

jupyter notebook

This should open up your web browser in Linux and show you the contents of the directory pythonintro which is currently empty.

You are now ready to start the training session by going through our Data Science Tutorial For Python.

Share this article

Share your Thoughts

This site uses Akismet to reduce spam. Learn how your comment data is processed.