Prerequisites before starting Data Science journey

Gaurav Chauhan
6 min readAug 27, 2018

--

This is part of The ULTIMATE Curriculum in Data Science which you can refer for more topics related to Data Science.

The big technology trend is to make systems intelligent and data is the raw material. -Amod Malviya

All right mate, now you have started your journey to the “Adventures of Data Science” and there is no coming back from now. But you can’t fight in this field with using old school technologies, so i present you all that you need as Prerequisites before starting Data Science journey.

Software

Right now there is only three programming languages that is preferred in Data Science community.

  • python
  • R
  • Javascript

Technically you can code in any language as you like but to make this tutorial easier, i will use python as my go to language and all of my tutorials will be in python.

Right now just follow these steps and after that i will give detailed information of all of them.

  • Install Python as been instructed. (preferred to install python 3.x )
  • Install PIP.
  • Install Jupyter notebook.
  • write the following line in the command as

pip install numpy pandas scipy matplotlib scikit-learn seaborn

Python

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems. You will use this to write your code.

Jupyter

Jupyter Notebook is an open-source web application that allows us to create and share codes and documents.

It provides an environment, where you can document your code, run it, look at the outcome, visualize data and see the results without leaving the environment. This makes it a handy tool for performing end to end data science workflows — data cleaning, statistical modeling, building and training machine learning models, visualizing data, and many, many other uses. You will mostly write all the code in this IDE.

To learn how to write Jupyter notebook,

Typical Jupyter notebook

Tip: if you want to run python code online in your browser, try out Google Collab. It provides same jupyter notebook styled experience in your browser.

Numpy

NumPy is the foundational library for scientific computing in Python, and many of other libraries use NumPy arrays as their basic inputs and outputs. In short, NumPy introduces objects for multidimensional arrays and matrices, as well as routines that allow developers to perform advanced mathematical and statistical functions on those arrays with as little code as possible.

Pandas

Pandas adds data structures and tools that are designed for practical data analysis in finance, statistics, social sciences, and engineering. Pandas works well with incomplete, messy, and unlabeled data (i.e., the kind of data you’re likely to encounter in the real world), and provides tools for shaping, merging, reshaping, and slicing data sets. (and if you are thinking that there are python and pandas in Data Science, wait there is also anaconda which will require a separate tutorial).

There are two main data structures in this library:

“Series” — one-dimensional

“Data Frames”, two-dimensional

When you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:

Scikit Learn

scikit-learn builds on NumPy and SciPy by adding a set of algorithms for common machine learning and data mining tasks, including clustering, regression, and classification. As a library, scikit-learn has a lot going for it. Its tools are well-documented and its contributors include many machine learning experts. What’s more, it’s a very curated library, meaning developers won’t have to choose between different versions of the same algorithm.

If you want to see all the algorithms that scikit learn provides, go to their official website.

journey to select preferred algorithm for our problem

Scipy

SciPy builds on NumPy by adding a collection of algorithms and high-level commands for manipulating and visualizing data. This package includes functions for computing integrals numerically, solving differential equations, optimization, and more.

Matplotlib

matplotlib is the standard Python library for creating 2D plots and graphs. It’s pretty low-level, meaning it requires more commands to generate nice-looking graphs and figures than with some more advanced libraries. However, the flip side of that is flexibility.

With a bit of effort you can make just about any visualizations:

  • Line plots
  • Scatter plots
  • Bar charts and Histograms
  • Pie charts
  • Stem plots
  • Contour plots
  • Quiver plots
  • Spectrograms

There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.

basic Matplotlib plots

Seaborn

Seaborn is a popular visualization library that builds on matplotlib’s foundation. The first thing you’ll notice about Seaborn is that its default styles are much more sophisticated than matplotlib’s. Beyond that, Seaborn is a higher-level library, meaning it’s easier to generate certain kinds of plots, including heat maps, time series, and violin plots.

Seaborn’s heatmap
Seaborn’s violin plot

As of now only these modules can be used to start your Data Science journey. In the upcoming courses i will mention many other modules that will be required as you progress to becoming the ULTIMATE Data Scientist.

To get the latest updates, tips and anything you want or have issue just post in the comments.

Till then….

Happy coding :)

And Don’t forget to clap clap clap…

--

--