Python Fundamentals for Data Science

Kashish Mogha
May 12, 2020 · 8 min read

Before we jump into a discussion of this article that is basic Python skills for Data Science, I expect that you have studied some basic programming already in the past. Now I’d like to take a moment and have you reflect on what you think data science is? If someone you ran into asked you what data science was all about, what would you tell them?

The history of data science goes back a little further than 2004, which is where the Google search term history begins. But this, at least, gives a sense of how popular the area is now. I think the popularity of interest in the area comes from the network and data-driven society we find ourselves living in. When people think of the term data scientist, they tend to think of Google or Amazon or Facebook, places with big artificial intelligence research teams, and certainly these are some amazing companies who are doing great things with data science.

But data scientist aren’t just limited to careers with tech companies. Data science is definitely one of those areas where you ask ten people and get ten different answers.

In Layman’s term, Data Science is the process of using data to understand different things.

Okay so let’s jump right in “The Topic” and start talking about the Python programming language.

There are many other tools that one can use in data science, such as specialized statistical analysis languages like R, or more general purpose programming languages like Java and C. But I chose Python as the basis for this because of three reasons.

First, it’s easy to learn. Python is now the language of choice for introducing university students to programming. It’s used in eight out of 10 of the US’s top computer science programs. If you have programming experience, but not Python-specific experience, you can pick up Python very quickly.

Second, it’s full featured. Python is a very general programming language with a lot of built-in libraries.

Finally, Python has a significant set of data science libraries one can use.

I’m going to provide a very basic overview of the Python programming language. Okay, so let’s jump in with an example.

Python is very little boilerplate code. In fact, if you just wanted to set the value of a couple of variables and output the results of these variables added together, you could do so in three lines.

In this example, I’ll write three statements. The first two set the variables x and y, each to be some integer value. Then we’ll do some addition.

Note: Python doesn’t require the use of keywords like var to declare a variable name or semicolons at the end of lines which are commonly used in other languages. Python leverages white space to understand the scope of functions and loops and end of line markers to understand the end of statements.

The Python language has a built-in function called type which will show you what type a given reference is. Some of the common types includes strings, the none type, etc.

Typed objects have properties associated with them, and these properties can be data or functions. A lot of Python’s built around different kinds of sequences or collection types. And there are three native kinds of collections tuples, lists, and dictionaries.

A tuple is a sequence of variables which itself is immutable (cannot be changed once created). We write tuples using parentheses, Here’s a tuple which has four items. Two are numbers, and two are strings.

Lists are very similar, but they can be mutable, so we can change their length, number of elements, and the element values. A list is declared using the square brackets.

Some basic operations on the list like appending, iterating, indexing.

Perhaps the most interesting operations we can do with lists are called slicing. Where the square bracket array syntax for accessing an element might look fairly similar to that which you’ve seen in other languages. In Python, the indexing operator allows you to submit multiple values. The first parameter is the starting location, The second parameter is the end of the slice. Our indexing values can also be negative which is really cool. And this means to index from the back of the string.

Now let’s talk about dictionaries.

Dictionaries are similar to lists and tuples in that they hold a collection of items, but they’re labeled collections which do not have an ordering. This means that for each value you insert into the dictionary, you must also give a key to get that value out. In other languages, the structure is often called a map. And in Python, we use curly braces to denote a dictionary.

Here is an example where we might link names to email addresses.

String Formatting in Python:

Imagine we have purchase order details and a dictionary, which includes a number of items, a price, and a person’s name. We can write a sales statement string which includes these items using curly brackets. We can then call the format method on that string and pass in the values that we want substituted as appropriate. Now the string formatting language allows us to do much more than this. We can control a number of different things like decimal places, for floating point numbers, or whether we want to prepend the positive numbers with the plus sign, or set the alignment of strings to left or right justified.

String manipulation is a big part of data cleaning.

Reading & Writing CSV Files:

Let’s learn the basics of iterating through a CSV file to create dictionaries and collect summary statistics.

First, let’s import the CSV module, which will assist us in reading in our CSV file. Using some iPython magic, let’s set the floating point precision for printing to 2 and now read in our mpg.csv using csv.DictReader and convert it to a list of dictionaries.

We can look at what the column names of the CSV are by using the key method. Suppose we want to find the average city MPG across all cars in our CSV file. We sum the city MPG entry across all the dictionaries in our list and divide by the length of the list.

You can check my Github repo for more deep functionality.

Python Dates and Time:

A lot of analysis we do might relate to dates and times. For instance, finding the average number of sales over a given period, selecting a list of products to determine if they were purchased in a given period. We’re not going to delve too deeply into time series analysis, but I wanted to show you some of the basics in Python.

In Python, we can get the current time since the epoch (which is January 1, 1970)using the time module. We can then create a timestamp using the timestamp function on the date-time object. When we print this value out, we see that the year, month, day, and so forth are also printed out. The date-time object has handy attributes to get the representative hour, day, seconds, etc. Date time objects allow for simple math using time deltas.

Objects and map:

Up to this point, I haven’t spoken much about object-oriented Python. While functions play a big role in the Python ecosystem, Python does have classes which can have attached methods, and be instantiated as objects.

First, you can define a class using a class keyword, and ending with a colon. Anything indented below this, is within the scope of the class. For example-In this definition of a person, for instance, we have written two methods. Set name and set location. And then we create an object of class, Then we can call functions and print out attributes of the class using the dot notation.

There are a couple of implications of object-oriented programming in Python, that you should take away from this very brief example. First, objects in Python do not have private or protected members. If you instantiate an object, you have full access to any of the methods or attributes of that object. Second, there’s no need for an explicit constructor when creating objects in Python. We can add a constructor if we want to by declaring the __init__ method.

Lambda and List Comprehension:

Lambda’s are Python’s way of creating anonymous functions. These are the same as other functions, but they have no name. The intent is that they’re simple or short-lived and it’s easier just to write out the function in one line.

We can declare a lambda function with the word lambda followed by a list of arguments, followed by a colon and then a single expression and this is key. There’s only one expression to be evaluated in a lambda. The return of a lambda is a function reference.

So in this case, you would execute my_function and pass in three different parameters.

Python has built in support for creating collections using a more abbreviated syntax called list comprehensions.

The NumPy:

Numpy, a package widely used in the data science community which lets us work efficiently with arrays and matrices in Python.

Let’s talk about in steps:

  1. Importing NumPy as np
  2. Make our first array. We can start by creating a list and converting it to an array. We can do it more succinctly by passing the list directly.
  3. Now let’s make multidimensional arrays by passing in a list of lists.
  4. We passed in two lists with three elements each, and we get a two by three array.
  5. We can check the dimensions by using the shape attribute.
  6. For the a-range function, we pass in a start, a stop, and a step size,and it returns evenly spaced values within a given interval.
  7. Suppose if we wanted to convert this array of numbers to a three by five array. We can use reshape to do that.

And the list goes on and on…..

Numpy has a lot to offer. So be sure to look at my repository to find out about more great features.

Thanks!

First step of developing skills to practice data science.