Week 01:A Data scientist Journey..

Numpy and Pandas

Disclaimer: this is raw and unedited.. excuse my typos

This week started with me being busy .. But hey forget about that, I have introduced this hoping to be the continuous series of my journey to data science showcasing what I learn every week. Most of the stuffs in this article i learnt before but I will start from the beginning practicing my communication skills.

What is data science?

This is multidisciplinary subject so it tends to have many definitions. But I think wikipedia’s can sum most of y’all definitions… hey do not mind if your is not included.

“Data science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, machine learning, data mining, and predictive analytics” -Wikipedia.

so data scientist is:

“Someone who knows more statistics than a computer scientist and more computer science than a statistician.”— Josh Blumenstock

There are many language which can help work with data .. but the popular ones are R and Python. I decided to go with python because I have prior experience working with it(worked with it for the past 2 or 3 years). The first thing was to get the initial understanding of libraries which can help me do mathematic computation and data manipulation. I spend sometimes this week working with Numpy and Pandas.

Numpy

this the data analysis package which is like a mother to most of python data analysis package with powerful features like N-dimensional array and broadcast ability… since this was not meant to be a tutorial, I found great tutorial from dataquest.io and below that there are links to great tutorials too.

Pandas

this is high-performance, easy-to-use data structures and data analysis tools for the Python programming. This makes working with array/1-dimension matrix(series) and 2-dimension series(DataFrame a lot easy). Go through this tutorial and thank me later.

This week I have been working with these two package to explore and simple cleaning of datasets. Things like dropping unwanted column or row, finding mean, max, min, sum. renaming column and get general insight of what the dataset is comprised with.

What I learned this week:

  • Do not chain calls.. Since friends do not let friend chain calls. eg df.loc('index1')['column2'] always try to find a less expensive solution
  • Do not iterate through series can be quite expensive, instead use Numpy built in methods to work with series and leverage functional and parallel programming(Numpy features). eg.
import pandas as pd
s = pd.Series(np.random.randint(0,1000,10000))
%%timeit -n 100
summary = 0
for item in s:
summary+=item
# 100 loops, best of 3: 1.74 ms per loop

while if you decide to use numpy

%%timeit -n 100
summary = np.sum(s)
# 100 loops, best of 3: 159 µs per loop
  • Pandas gives you the view and not a copy since is memory effective.
  • Use many data sources
  • Use statistical models.
  • have good communication skills (What does a 60% probability even mean? How can we visualize, validate, and understand the conclusions?)

Bye, Till next week.. take care.

NB: will link the iPython notebook that I have used soon.

Update: The link to the notebook is here