Prerequisites before starting Data Science journey
This is part of The ULTIMATE Curriculum in Data Science which you can refer for more topics related to Data Science.
The big technology trend is to make systems intelligent and data is the raw material. -Amod Malviya
All right mate, now you have started your journey to the “Adventures of Data Science” and there is no coming back from now. But you can’t fight in this field with using old school technologies, so i present you all that you need as Prerequisites before starting Data Science journey.
Right now there is only three programming languages that is preferred in Data Science community.
Technically you can code in any language as you like but to make this tutorial easier, i will use python as my go to language and all of my tutorials will be in python.
Right now just follow these steps and after that i will give detailed information of all of them.
- Install Python as been instructed. (preferred to install python 3.x )
Python doesn't come prepackaged with Windows, but that doesn't mean Windows users won't find the flexible programming…www.howtogeek.com
- Install PIP.
Many Python developers rely on a tool called PIP for Python to make everything much easier and faster. Installing and…www.makeuseof.com
- Install Jupyter notebook.
The Jupyter Notebook is an incredibly powerful tool for interactively developing and presenting data science projects…www.dataquest.io
- write the following line in the command as
pip install numpy pandas scipy matplotlib scikit-learn seaborn
Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems. You will use this to write your code.
Jupyter Notebook is an open-source web application that allows us to create and share codes and documents.
It provides an environment, where you can document your code, run it, look at the outcome, visualize data and see the results without leaving the environment. This makes it a handy tool for performing end to end data science workflows — data cleaning, statistical modeling, building and training machine learning models, visualizing data, and many, many other uses. You will mostly write all the code in this IDE.
To learn how to write Jupyter notebook,
INTRODUCTION Jupyter Notebooks are a powerful way to write and iterate on your Python code for data analysis. Rather…www.codecademy.com
Tip: if you want to run python code online in your browser, try out Google Collab. It provides same jupyter notebook styled experience in your browser.
NumPy is the foundational library for scientific computing in Python, and many of other libraries use NumPy arrays as their basic inputs and outputs. In short, NumPy introduces objects for multidimensional arrays and matrices, as well as routines that allow developers to perform advanced mathematical and statistical functions on those arrays with as little code as possible.
Pandas adds data structures and tools that are designed for practical data analysis in finance, statistics, social sciences, and engineering. Pandas works well with incomplete, messy, and unlabeled data (i.e., the kind of data you’re likely to encounter in the real world), and provides tools for shaping, merging, reshaping, and slicing data sets. (and if you are thinking that there are python and pandas in Data Science, wait there is also anaconda which will require a separate tutorial).
There are two main data structures in this library:
“Series” — one-dimensional
“Data Frames”, two-dimensional
When you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:
scikit-learn builds on NumPy and SciPy by adding a set of algorithms for common machine learning and data mining tasks, including clustering, regression, and classification. As a library, scikit-learn has a lot going for it. Its tools are well-documented and its contributors include many machine learning experts. What’s more, it’s a very curated library, meaning developers won’t have to choose between different versions of the same algorithm.
If you want to see all the algorithms that scikit learn provides, go to their official website.
SciPy builds on NumPy by adding a collection of algorithms and high-level commands for manipulating and visualizing data. This package includes functions for computing integrals numerically, solving differential equations, optimization, and more.
matplotlib is the standard Python library for creating 2D plots and graphs. It’s pretty low-level, meaning it requires more commands to generate nice-looking graphs and figures than with some more advanced libraries. However, the flip side of that is flexibility.
With a bit of effort you can make just about any visualizations:
- Line plots
- Scatter plots
- Bar charts and Histograms
- Pie charts
- Stem plots
- Contour plots
- Quiver plots
There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.
Seaborn is a popular visualization library that builds on matplotlib’s foundation. The first thing you’ll notice about Seaborn is that its default styles are much more sophisticated than matplotlib’s. Beyond that, Seaborn is a higher-level library, meaning it’s easier to generate certain kinds of plots, including heat maps, time series, and violin plots.
As of now only these modules can be used to start your Data Science journey. In the upcoming courses i will mention many other modules that will be required as you progress to becoming the ULTIMATE Data Scientist.
To get the latest updates, tips and anything you want or have issue just post in the comments.
Happy coding :)
And Don’t forget to clap clap clap…