Introduction to Python for Data Science

Vin Busquet

Published in

Analytics Vidhya

6 min readJan 9, 2020

A collection of Jupyter notebooks that aims to teach you the basics of Python within few hours.

My computer screen showing the source code of the python library cryptosteganography

What is Python?

Python is an interpreted, high-level and general-purpose programming language. It was created by Guido van Rossum, and released in 1991.

Its design philosophy emphasizes code readability by enforcing code indentation. Its language constructs and object-oriented approach are intended to help programmers write clear and logical code for small and large-scale projects.

The language’s name is a tribute to the British comedy group Monty Python — and in occasionally playful approaches to tutorials and reference materials, such as examples that refer to spam and eggs (from a famous Monty Python sketch) instead of the standard foo and bar.

Why Python?

Being an interpreted language and also a dynamically typed language, python operations performs much slower when compared to compiled — statically typed — languages, like C or C++.

Despite this fact, Python is widely used even when it is somehow slower than other languages because:

It is easy to learn

Anyone aspiring to learn the language can learn it easily and quickly. Python has a shorter learning curve and scores over others by providing an easy-to-understand syntax.

Python is more productive

It is a much more concise and expressive language and requires less time, effort and lines of code to perform the same operations when compared to several other programming languages.

Companies can optimize their employees time

The execution speed does not matter as much as the business speed. If the developer can code a solution several times faster than using another language, the company can save time and resources. And employee time is often the most expensive resource.

Enable competitiveness improvement by fast innovation

As it is generally faster to learn and code solutions using python, new libraries and code contributions can be created more quickly, which makes the ecosystem more prone to innovation.

Huge community

One of the main reasons for the phenomenal rise of Python is attributed to its ecosystem. For example, as Python extends its reach to the data science community, more and more volunteers are creating data science libraries. This, in turn, has led the way for creating the most modern tools and processing in Python.

Extending Python with C or C++ is easy

It is quite easy to add new built-in modules to Python, if you know how to program in C. Such extension modules can do two things that can’t be done directly in Python: they can implement new built-in object types, and they can call C library functions and system calls.

This way, tasks for which speed execution is critical can be coded in C and exposed in python as a built-in module to be called inside a python program, as if it were a pure python module.

In fact several python libraries for mathematics, scientific computing, data science and other fields — that require speed perfomance — are coded in C/C++ and exposed to python as a module.

For all those reasons, and possibly more than I listed here, Python is widely adopted by Fortune 500 companies and the World’s Top universities and is also an extremely popular and widely used language within the data science community, which is the primary audience of this article.

Versions of Python

There are two popular versions of the Python programming language in use,
at the time of this publication: Python 2 and Python 3.

The support for python 2 ended January 1, 2020. The message the Python Foundation is trying to make loud and clear is that developers should transition to Python 3 as soon as possible without waiting any longer:

"We have decided that January 1, 2020, will be the day that we sunset Python 2. That means that we will not improve it anymore after that day, even if someone finds a security problem in it. You should upgrade to Python 3 as soon as you can."

Python 3 was released at the end of 2008 and from the very beginning, was meant to break away from the past, as the only way to fix a number of flaws that affected Python 2 and bring the language evolution forward.

The Jupyter Notebook

Instead of installing and managing python in your local environment, we will use Jupyter Notebook, an interactive environment to follow the lessons.

Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Notebook documents are both human-readable documents containing the analysis description and the results (figures, tables, etc..) as well as executable documents which can be run to perform data analysis.

Jupyter supports over 40 programming languages, one at a time (it doesn’t allow multiples runtime on the same document). You can check more about it at the official website https://jupyter.org/

The collection of lessons provided in this article were written in Jupyter and intended to run on the cloud via Google Colaboratory (Colab).

Google Colab

Colaboratory is a Google research project created to help disseminate machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud. One of its main advantages is that it provides support for GPU runtimes, which is very helpful for machine learning.

Another of its most interesting features is the GitHub integration, allowing both loading notebooks from Github public repositories and saving notebooks to Github. And that's the main reason I am using it to deploy the notebooks. Each opened notebook will create a copy of it from the public Github repository to your google drive account, allowing you to change, interact and save your version of the notebook.

Notebooks

The collection of Jupyter notebooks are intended to provide an introduction to the Python programming language.

Although this collection is aimed to the beginner data science student, I found it very useful for any beginner in python programming.

All notebooks were developed and released by IBM Cognitive Class, with some changes, code updates and other customizations made by me.

The notebooks are divided by the following topics, each containing a lesson with estimated time needed for completion.

Python Basics

This section covers the python basics: print, import, types, expressions and strings.

Your first program — 10 min
Types — 10 min
Expressions and Variables — 10 min
String Operations — 15 min

Total Estimated time needed : 45 min

Python Data Structures

This section covers the main Python data structures.

Tuples — 15 min
Lists — 15 min
Dictionaries — 20 min
Sets — 20 min

Total Estimated time needed : 75 min

Python Programming Fundamentals

This section covers the fundamentals of Python language, logic and control structures, functions, and object-oriented programming in Python.

Conditions and Branching — 15 min
Loops — 20 min
Functions — 40 min
Classes and Objects — 40 min

Total Estimated time needed : 120 min

Files

This section covers the basics of File handling in Python.

Reading files with open — 40 min
Writing files with open — 15 min

Total Estimated time needed : 55min

Python Data Analysis Library (Pandas)

This section covers an introduction to pandas, an open source library that provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

Introduction to Pandas Python — 15 min

NumPy

This section covers an introduction to NumPy, the fundamental package for scientific computing with Python.

NumPy makes it easier to do many operations that are commonly performed in data science. The same operations are usually computationally faster and require less memory in NumPy compared to regular Python.

1D NumPy in Python — 30 min
2D NumPy in Python — 20 min

Total Estimated time needed : 50 min

I hope these resources can help you on your journey to become a better python programmer.

References

Notebooks Source Code