Recently, we published an introduction to data science in R for the beginner in programming. This is a complementary article written using the same approach, but this time focusing on Python, which is another open source programming language. You will learn how to use Python in a Jupyter notebook to manipulate a data set and visualise the results.
Python has an even larger following than R, so both articles should get the beginner up to speed in the two main languages for doing data science. The differences in the tutorials highlight how R and Python tackle the same task. Often it is your own experience, and the data science task that you want to do, that determines which language to choose.
The tutorial requires some basic knowledge of Linux, but other than that we shall go through the steps to set up the tutorial.
Install anaconda for Linux
Anaconda is a package management system for Python aimed at individual users. We shall use this system for this tutorial. We recommend that you use it for all your data science projects because it handles package dependencies reliably. It also allows you to reproduce your results by allowing separate Python environments for each of your data science projects.
Instructions for how to install anaconda for Linux can be found at https://docs.anaconda.com/anaconda/install/linux/. There are instructions for each popular Linux distribution.
Open up a fresh Linux terminal. To check that
anaconda has been installed correctly, ensure that you can run the
conda command from the Linux terminal by typing
conda info. The
conda command is the main way to access the functionality in anaconda. This particular
conda option should show a list of configuration settings including the version of
conda that has been installed.
More help on
conda can be found be found in the conda cheat sheet.
Create a conda environment
Anaconda allows you to create a dedicated environment for the tutorial. In this environment, we shall install Python and the packages that are needed.
At the Linux terminal prompt, you should create an environment called
intro by typing
conda create -c conda-forge -n intro python=3.8 jupyter plotnine
This should install
Python version 3.8 and the Python packages
plotnine as well as the dependencies of these packages. There should be no error messages.
You can see the list of your environments in anaconda by typing
conda env list
To activate the new environment and obtain the use of the newly installed Python packages type at the terminal
conda activate intro
From now on any conda packages that you install are placed in the
Start Jupyter in a new directory
Create a new directory for the tutorial called
pythonintro on your file system and change to that directory using the following commands:
We shall use a Jupyter notebook for the tutorial that is based in this directory. To start Jupyter, type at the terminal
This should open up your web browser in Linux and show you the contents of the directory
pythonintro which is currently empty.
You are now ready to start the training session by going through our Data Science Tutorial For Python.