Setting up a Python + Postgres environment for Data Science

Published in

The Startup

5 min readSep 30, 2019

This is the explanation about how I set up my Python environment using Anaconda but I’ll also include the alternative of not using it. I use Juno Elementary OS and this will be a local environment, so all the commands and instructions will be for Ubuntu 18.04.

Components of the environment

Python 3.7
Jupyter lab, NumPy, Pandas, MatPlotLib, csvkit, Postgres.

Optional:

Anaconda

Anaconda or not

I’ll explain how to set up the environment using pip, the most used package installer, and Anaconda, a distribution created for Python/R data science purposes.

I tried both ways and so far Anaconda has given me the easiest installation experience. In just a few minutes you have a whole environment ready to go.

Here is a comparison between both alternatives.

With Anaconda

Download the official Anaconda installer for Linux.
Install the required packages for Anaconda

sudo apt-get install libgl1-mesa-glx libegl1-mesa libxrandr2 libxrandr2 libxss1 libxcursor1 libxcomposite1 libasound2 libxi6 libxtst6

3. Execute the downloaded installer. (Check that the version matches!)

bash ~/Downloads/Anaconda3-2019.07-Linux-x86_64.sh

Once it finishes the installation you’ll get the following message: “Do you wish the installer to initialize Anaconda3 by running conda init?”. Answer “yes”.

You’ll have to close and open again the terminal to found conda’s (base) environment activated. To avoid these you can deactivate that setting with the command:

conda config — set auto_activate_base False

To activate the base enviroment you have to enter the following command:

conda activate base

Once activated you could launch the navigator, a graphical interface to the conda package manager and environment manager.

anaconda-navigator

Without Anaconda

Install Python 3.7.

Note that I’m using a third-party, Deadsnake, to have the easiest way. But on the official website, you can find the instructions to install it from the source.

2. Install pip.

Pip is a package management system and it will be the alternative to Anaconda.

Note: While using pip is a good practice to make use of virtualenv/venv. These are utilities to create isolated Python environments.

3. Install PostgreSQL

4. Install Python libraries

We’ll install numpy and Pandas, Maplotlib and Scipy for data analysis and visualizations, and Scikit for Machine learning.

5. Install Jupyter notebook and Jupyter lab

sudo pip3 install jupyter jupyterlab

Additional tools

Other tools that could be installed along with these set up are the following:

Pgadmin or SQL Workbench: This will allow you to manage the databases in a more graphical way and not have to depend always on the terminal.
Pycharm: It’s always nice to use an IDE, and it’s certainly more user-friendly that Jupyter for developers changing stack. I recommend Pycharm but there are several options that you could choose from.

How will it work all together

I the following graph I explain briefly which tool will take part in which part of the data science workflow.

More about the components of the workflow can be read in the book “R for Data Science” by Garrett Grolemund and Hadley Wickham.

Side note: Why I chose this setup

Here I’ll link to some useful resources to understand a little bit how the set up came together.

Why Python?

In my case, the selection of Python is mostly because I like it. The first time I started using Python was about 5 or 6 years ago to present it as an alternative to Matlab, emulating a project that I had developed along a course. I used Anaconda back then and I wanted to compare the experience now.

I also added tools that I’ve discovered on courses I’m following and a lot of reading about the subject.

If you’re here is probably because you have also chosen Python and you know already that the decision when it comes to Data analysis and Data science is mainly reduced to Python or R … or both. If you’re not, here’s an introduction to the topic:

Python vs R for Data Science: And the winner is..

Introduction

medium.com

Why PostgreSQL?

After choosing Python, Postgres is a no-brainer. Python and Postgres are usually related in most job descriptions, blogs and courses. Postgres is a very powerful (and open Source!) relational database and it presents a lot of advantages.

PostgreSQL for data science : pro and cons | Data into results

To be an efficient data scientist a proper toolset is paramount. As there are many solutions, it is difficult to select…

dataintoresults.com

Why local?

It’s more comfortable. I set up and maintained a small server for a few months and now I decided to move local because I don’t see any benefits for me. Some of the reasons to set up a server would be:

Operative System: Even though these tools can be installed and used in Windows, Linux or Mac, most people recommend the last two. With a remote server, you can easily run any Linux distribution and keep using Windows as the OS of your choice.
Scalability: With a local enviroment you have a physical limitation, and if it’s not enough you need a new computer. With a remote server, you just add more of whatever you need.
Accessibility: The remote server can be accessed from everywhere. You can access your tools from any computer with just your access data and start working. This is also a great advantage for collaboration if more than one person will be working on the project.
Automation: The remote server is always running so if you need to automate any script you won’t need to have your computer on for it to run.
Freedom: This is related to the first point. Since this is not your personal OS you have nothing to break (as long as you back up your code). You can install, uninstall and try as many things as you like without fear of breaking your computer. If you find that something is not working anymore you can easily restore the whole server to the initial image.