# Getting started with Data Science using Python

With the data in the mind, let’s say, you have chosen a piece of data to work with. Depending on the size, the features, and the target variable of it, you may very well know how to proceed! Proven thing is, most of the times, acquiring the data required for the research you wish to do, is not an easy task at all. So usually, we’ll be exploring the datasets available, which may vary from MBs to GBs of data.

In order to analyze the data, and work on it, we’ll surely be needing some of the programming tools in our pocket. Python will be the language we will be using, for analyzing the data, visualising it and making predictions on how to proceed, and which algorithm to choose. For that, we’ll need some beginner tools, we may require.

1. `Numpy` : A fast numerical programming library. `import numpy as np`
2. `Scipy` : Imports stats functions. `import scipy as sp`
3. `Matplotlib` : Python 2D Plotting library. `import matplotlib`
4. `Pandas` : Data Analysis Library `import pandas as pd`

So we’ll be using mostly these libraries (listed above) for our Data Science workout. Beginning with the imports of these modules.

`import numpy as np # Numerical Programming Lib. import matplotlib as mpl # Imports matplotlibimport matplotlib.cm as cm # Colormapsimport matplotlib.pyplot as plt # Plotting feature import pandas as pd # handling data as dataframes`

Suppose you are asked to check if the element in a table is of type floating point, or of integer (or say, something else). Now to do this, there are many ways, one of the ways will be to import `types` module in Python, which as the documentation says:

This module defines names for some object types that are used by the standard Python interpreter, but not for the types defined by various extension modules.
`import types as typ # imports types modulea = 5.0/4.0 `
`if(type(a) == typ.FloatType):    print("a is of floating point data type.")`

While, this can very well be done with comparison of the data type of `a` with a known floating point variable.

`if(type(a) == type(3.0)):    print("a is of floating point data type.")`

Using list comprehensions, for cleaner code (let’s say you want the elements in the list to be half of the original data)

`list_ = [1,4,16,36,64] # Original listlist_new = [x//2 for x in list_]`

List comprehensions can prove to be very useful in some cases, though mostly in competitive coding, and rarely used explicitly in Data Science, but it’s no harm to learn about them.

So let’s say we have a data (stored in lists), and we want only even out of them, there are two ways we can do the work:

`# Way - 1 normal method`
`list_ = [1,2,3,4,5,6] # Original listlist_new = [] # create a new empty listfor x in list_:    if(x % 2 == 0):        list_new.append(x)`
`# Way - 2 using list comprehensions`
`list_ = [1,2,3,4,5,6] # Original listlist_new = [x for x in list_ if x % 2 == 0]`

Clearly, the second piece of code is cleaner, and better. Though any of them can be used, and will create no questions. There is a good explanation on List Comprehensions, which can be found here.

Now let’s say we have a file, we want to know how many lines it consists, we can do it in the following way:

`file_novel = open(filename, 'r') # open the file in read modefile_novel_content = file_novel.read() # Read the novel word by wordfile_novel.close()`
`# Split file_novel_content to a listfile_novel_listcount = file_novel_content.split()len(file_novel_listcount) # Find the length of the list`

Using with syntax, we can make it better, in a lot of ways as we wont’ need to close the file. It will be done automatically, using the with syntax. The syntax for the same, is shown in the link here.

For the following work, find the novel’s .txt file here. (Although, the results may vary as the file in the link is not exactly the same used here.)

`import numpy as npimport matplotlib.pyplot as plt# seaborn sets up styles and gives us more plotting optionsimport seaborn as sns`
`with open("alice_in_wonderland.txt", 'r') as file_:    file_content = file_.read() # read the file word by word     file_content_list = file_content.split()    print(len(file_content_list))`
`# Total number of words in alice in the wonderland is 26443`
`# get unique words only using set data structurefile_content_list = [i.lower() for i in file_content_list]file_content_unique = set(file_content_list)`
`# Create empty dictionaryfile_content_dict = {}`
`for word in file_content_unique:    file_content_dict[word] = file_content_list.count(word)`
`# print(file_content_dict)`
`L = sorted(file_content_dict.iteritems(), key = lambda(k, v) : v, reverse=True)[:100]`
`topFreq = L[:20]fig, ax = plt.subplots()`
`plt.ylabel('Frequency')plt.xlabel('Words')plt.title('Word vs Frequency')`
`# print(topFreq) - For testing purposepos = np.arange(len(topFreq))# print(pos) - testing purposeplt.bar(pos, [item[1] for item in topFreq])plt.xticks(pos, [item[0] for item in topFreq])plt.show()`

The output of which looks like this:

We can plot the same data, with different graph styles :

We can also draw it as a pie chart. But clearly, the best out of all these will either be a pie chart, or a histogram. Out of these, histogram looks better, to me at least.

On using subplots in matplotlib, and a tutorial on the same, follow this link and here.

We’ll be following more, with some ideas and things to work out with. These are obviously practical things, which we learn with practice! I suggest, going to the documentation of the libraries, and try and explore more arguments added to the graphs, which can help us analyze the data in a better way.

Like what you read? Give Kushashwa Ravi Shrimali a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.