Getting started with Data Science using Python

With the data in the mind, let’s say, you have chosen a piece of data to work with. Depending on the size, the features, and the target variable of it, you may very well know how to proceed! Proven thing is, most of the times, acquiring the data required for the research you wish to do, is not an easy task at all. So usually, we’ll be exploring the datasets available, which may vary from MBs to GBs of data.

In order to analyze the data, and work on it, we’ll surely be needing some of the programming tools in our pocket. Python will be the language we will be using, for analyzing the data, visualising it and making predictions on how to proceed, and which algorithm to choose. For that, we’ll need some beginner tools, we may require.

  1. Numpy : A fast numerical programming library. import numpy as np
  2. Scipy : Imports stats functions. import scipy as sp
  3. Matplotlib : Python 2D Plotting library. import matplotlib
  4. Pandas : Data Analysis Library import pandas as pd

So we’ll be using mostly these libraries (listed above) for our Data Science workout. Beginning with the imports of these modules.

import numpy as np # Numerical Programming Lib. 
import matplotlib as mpl # Imports matplotlib
import matplotlib.cm as cm # Colormaps
import matplotlib.pyplot as plt # Plotting feature
import pandas as pd # handling data as dataframes

Let’s start with some basics in Python :

Suppose you are asked to check if the element in a table is of type floating point, or of integer (or say, something else). Now to do this, there are many ways, one of the ways will be to import types module in Python, which as the documentation says:

This module defines names for some object types that are used by the standard Python interpreter, but not for the types defined by various extension modules.
import types as typ # imports types module
a = 5.0/4.0
if(type(a) == typ.FloatType):
print("a is of floating point data type.")

While, this can very well be done with comparison of the data type of a with a known floating point variable.

if(type(a) == type(3.0)):
print("a is of floating point data type.")

Using list comprehensions, for cleaner code (let’s say you want the elements in the list to be half of the original data)

list_ = [1,4,16,36,64] # Original list
list_new = [x//2 for x in list_]

List comprehensions can prove to be very useful in some cases, though mostly in competitive coding, and rarely used explicitly in Data Science, but it’s no harm to learn about them.

So let’s say we have a data (stored in lists), and we want only even out of them, there are two ways we can do the work:

# Way - 1 normal method
list_ = [1,2,3,4,5,6] # Original list
list_new = [] # create a new empty list
for x in list_:
if(x % 2 == 0):
list_new.append(x)
# Way - 2 using list comprehensions
list_ = [1,2,3,4,5,6] # Original list
list_new = [x for x in list_ if x % 2 == 0]

Clearly, the second piece of code is cleaner, and better. Though any of them can be used, and will create no questions. There is a good explanation on List Comprehensions, which can be found here.

Now let’s say we have a file, we want to know how many lines it consists, we can do it in the following way:

file_novel = open(filename, 'r') # open the file in read mode
file_novel_content = file_novel.read() # Read the novel word by word
file_novel.close()
# Split file_novel_content to a list
file_novel_listcount = file_novel_content.split()
len(file_novel_listcount) # Find the length of the list

Using with syntax, we can make it better, in a lot of ways as we wont’ need to close the file. It will be done automatically, using the with syntax. The syntax for the same, is shown in the link here.

For the following work, find the novel’s .txt file here. (Although, the results may vary as the file in the link is not exactly the same used here.)

import numpy as np
import matplotlib.pyplot as plt
# seaborn sets up styles and gives us more plotting options
import seaborn as sns
with open("alice_in_wonderland.txt", 'r') as file_:
file_content = file_.read() # read the file word by word
file_content_list = file_content.split()
print(len(file_content_list))
# Total number of words in alice in the wonderland is 26443
# get unique words only using set data structure
file_content_list = [i.lower() for i in file_content_list]
file_content_unique = set(file_content_list)
# Create empty dictionary
file_content_dict = {}
for word in file_content_unique:
file_content_dict[word] = file_content_list.count(word)
# print(file_content_dict)
L = sorted(file_content_dict.iteritems(), key = lambda(k, v) : v, reverse=True)[:100]
topFreq = L[:20]
fig, ax = plt.subplots()
plt.ylabel('Frequency')
plt.xlabel('Words')
plt.title('Word vs Frequency')
# print(topFreq) - For testing purpose
pos = np.arange(len(topFreq))
# print(pos) - testing purpose
plt.bar(pos, [item[1] for item in topFreq])
plt.xticks(pos, [item[0] for item in topFreq])
plt.show()

The output of which looks like this:

We can plot the same data, with different graph styles :

Line Plot, Scatter plot, and combined.

We can also draw it as a pie chart. But clearly, the best out of all these will either be a pie chart, or a histogram. Out of these, histogram looks better, to me at least.

Pie Chart

On using subplots in matplotlib, and a tutorial on the same, follow this link and here.

We’ll be following more, with some ideas and things to work out with. These are obviously practical things, which we learn with practice! I suggest, going to the documentation of the libraries, and try and explore more arguments added to the graphs, which can help us analyze the data in a better way.

Like what you read? Give Kushashwa Ravi Shrimali a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.