photo by Shutterstock.

A Data Scientist’s Treasure Trove.

MD Fazal Mustafa
Developer Student Club, HIT
6 min readApr 3, 2020

--

Well, who is a Data Scientist? I know some of you are thinking that. Some who have been reading tech blogs and have the word in media must be wondering What’s the difference between a Data Scientist and Machine Learning Engineer? Aren’t they all the same?

Well let’s resolve these few questions first then we will go on a Treasure Hunting part, cool. Let’s Begin.

So a Data Scientist is someone who applies machine learning, statistical methods, and exploratory analysis to data to extract insights and aid decision making. He also does Data Cleaning, ETL(Extract, Transform & Load because sometimes you have raw data that’s not ready to for applying Data Science techniques on it), a little bit of Database Management, Data Visualization and Data Analysis. Yup, that’s it. I know it’s a lot that's why most companies have Data Science teams.

Okay, a Data Scientist is not the same as a Machine Learning Engineer. Because a Data Scientist can do all the job of an ML engineer but an ML engineer cannot do all the job of a Data Scientist. Because primarily Data scientists are not required just to train neural networks, many times they have to answer questions. They are provided a database and are asked to answer any number of questions, draw insights from the data. While an ML engineer doesn’t derive insights or cleans data, prepare data or even do database management. ML engineer is required to work with very complex algorithms and many times are required to come up with their own algorithms in order to get a model with the highest accuracy possible.

Okay now lets dive into Treasure Island of a Data Scientist. We will discuss packages or libraries whatever you prefer to call them. So for a Data Scientist there are 3 kinds of packages:-

  1. Scientific Computing packages.
  2. Visualization packages
  3. Machine Learning packages.

Scientific Computing packages

1. Pandas

Pandas is an open-source Python package that provides high-performance.It helps help developers work with “labeled” and “relational” data intuitively. It’s based on two main data structures: “Series” (one-dimensional, like a list of items) and “Data Frames” (two-dimensional, like a table with multiple columns).

Pandas is a perfect tool for data wrangling or munging. It is designed for quick and easy data manipulation, reading, aggregation, and visualization. Pandas take data in a CSV or TSV file or a SQL database and create a Python object with rows and columns called a data frame.

2. Numpy

NumPy is one of the most fundamental packages in Python for array-processing. It provides high-performance multidimensional array objects and tools to work with the arrays. NumPy is an efficient container of generic multi-dimensional data.

The library offers many handy features performing operations on n-arrays and matrices in Python. It helps to process arrays that store values of the same data type and makes performing math operations on arrays (and their vectorization) easier.

NumPy is used to process arrays that store values of the same datatype. NumPy facilitates math operations on arrays and their vectorization. This significantly enhances performance and speeds up the execution time correspondingly.

3. Scipy

SciPy library contains modules for efficient mathematical routines as linear algebra, interpolation, optimization, integration, and statistics. The main functionality of the SciPy library is built upon NumPy and its arrays. SciPy makes significant use of NumPy.

SciPy uses arrays as its basic data structure. It has various modules to perform common scientific programming tasks as linear algebra, integration, calculus, ordinary differential equations, and signal processing.

Visualization Packages

1. Matplotlib

This is a standard data science library that helps to generate data visualizations such as two-dimensional diagrams and graphs (histograms, scatterplots, non-Cartesian coordinates graphs). Matplotlib is one of those plotting libraries that are really useful in data science projects — it provides an object-oriented API for embedding plots into applications.

It is a close resemblance to MATLAB embedded in Python programming language. Matplotlib can help in making Histograms, bar plots, scatter plots, area plots to pie plots and many other visualizations.

2. Seaborn

Seaborn, is a data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics. Putting it simply, seaborn is an extension of Matplotlib with advanced features.

Well, the difference between Matplotlib and Seaborn is that Matplotlib is used for basic plotting; bars, pies, lines, scatter plots and stuff whereas, seaborn provides a variety of visualization patterns with less complex and fewer syntax.

3. Folium (Geographic Information System library)

It essentially helps in working with maps. Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map.

The library has a number of built-in tilesets from OpenStreetMap, Mapbox, and Stamen, and supports custom tilesets with Mapbox or Cloudmade API keys. Folium supports both Image, Video, GeoJSON, and TopoJSON overlays.

Machine Learning Libraries

1. Scikit Learn

Scikit Learn is a robust machine learning library for Python. It features ML algorithms like SVMs, random forests, k-means clustering, spectral clustering, mean shift, cross-validation and more. Even NumPy, SciPy and related scientific operations are supported by Scikit Learn with Scikit Learn being a part of the SciPy Stack.

Data scientists use it for handling standard machine learning and data mining tasks such as clustering, regression, model selection, dimensionality reduction, and classification. Another advantage? It comes with quality documentation and offers high performance.

2. TensorFlow

TensorFlow is an AI library that helps developers to create large-scale neural networks with many layers using data flow graphs. TensorFlow also facilitates the building of Deep Learning models, pushes the state-of-the-art in ML/AI and allow easy deploy of ML-powered applications.

It is used by all giant companies. It can do Voice/Sound Recognition, Sentiment Analysis, Face Recognition, Time Series, Video Detection, etc. Its really powerful and one of the most mature and highly community-backed ML library in the world.

3. Keras

Keras is TensorFlow’s high-level API for building and training Deep Neural Network code. It is an open-source neural network library in Python. Keras only provides high-level APIs while TensorFlow provides both high-level and low-level APIs.

It’s very straightforward to use and provides developers with a good degree of extensibility. The library takes advantage of other packages, (Theano or TensorFlow) as its backends. Moreover, Microsoft integrated CNTK (Microsoft Cognitive Toolkit) to serve as another backend. It’s a great pick if you want to experiment quickly using compact systems.

4. Pytorch

PyTorch is a framework that is perfect for data scientists who want to perform deep learning tasks easily. The tool allows performing tensor computations with GPU acceleration. It is open-source. The Tesla cars for level 3 autonomy uses Pytorch and have made their own custom stack over Pytorch for training their neural networks.

PyTorch is based on Torch, which is an open-source deep-learning library implemented in C, with a wrapper in Lua.

Now there are many other very useful, important and famous packages like NLTK, Spacy, Bokeh, Plotly, ggplot, statsmodel, XGBoost, Bokeh, OpenCV, YOLO, etc. We will discuss them later. Today my aim was to introduce the top 10 most important packages that are like a treasure for Data scientists.

Know your author

I am a 2nd-year IT undergrad who is mad behind autonomous vehicles. That deep interest of mine makes me study Data Science, Machine Learning, Deep Learning, and Computer Vision. I have a crush on AR too. Thus sometimes to follow that crush I do get involved in Unity and App Development with Android or Flutter. This is my first technical blog. Show your love.

--

--