The Data Science Starter Pack! What and Why?

Yash Gupta
Data Science Simplified
10 min readOct 7, 2020

If you had data to work on… What tool would you use to work on it? The first answer most of us have is Excel. Have you thought of going beyond Excel to work on it? Well, Excel was launched in the year 1985 and it’s the only ages old program that’s still up and running thanks to constant updates by Microsoft and because Excel’s spreadsheets and their functioning was a revolution in the way data was handled.

Excel is just not about it’s cells though, right? We also have functions, row and column operations, formatting, multiple sheets to work on, the renowned Macros and the list can keep going on.

Ms Excel Interface

But there is no way that we would be stuck at Excel for 3 long decades right? There has to be something better out there that machines can use to make our work simpler.

Today, there are tons of customizable and indigenous tools that companies and individuals use to make their data handling processes simpler. These range from Database Management Systems(DBMS), Programming Languages, Analytics tools, Data handling & sharing over Cloud services and Data Viz (Visualization). There’s a tool/application for every need of every individual today. They can outperform other tools in their segment.

In this article, we’ll go over a set of such applications (and languages) that you can use to work on your data. We’ll discuss how it is used and what are it’s pros and cons in brief to make sure you know what you need to learn to use to be the best at what you do. The tools discussed in this article are mentioned below:

  1. Python
  2. Tableau
  3. SQL
  4. Power BI
  5. TensorFlow
  6. Excel VBA
  7. Others

Note: There are other tools that you can use for the same tasks that these tools do. These make their cut into this list because of their extensive usage and popularity that makes it a part of a ‘Starter’ pack.

Python:

The Anaconda Navigator to use Python’s packages.

The crown jewel of Data Science today, Python is like a Genie that can fulfill any wish that you have. Python is an open-source Programming language that you can use for pretty much anything. It’s wide availability and access to hundreds of other open-source libraries lets you get packages customized to your needs.

It can work on datasets, matrices, lists, strings (text) etc. and can perform Machine Learning and prepare Neural Networks on them. It is also the basis for Deep Learning, Natural Language Processing, Neural Networks (ANNs and RNNs), Artificial Intelligence and Data Science itself. The inclusion of a programming language was crucial for the process to be called a ‘science’ and it exhibits unmatchable skill in terms of doing anything you want.

It has extensive libraries for Data Science purposes such as Pandas, Numpy, Scipy Stats, Scikit Learn, Altair, Seaborn, Matplotlib, Flask, ChainConsumer etc. which go from Data handling, cleaning, viz, Web Development and also Machine Learning. If you want to master Data Science, you must learn Python.

def, in, mean, median, array, describe, info, class, list, set, tuple, string, plot, kdeplot, distplot.

You might be wondering what are these words? It is a list of some commands out of Python. They sure are not 10101110 or 01010110 which are Binary. For anyone reading this who assumes Python is hard to learn; there’s only two requirements for learning Python:

  1. A Desktop/Laptop/Notebook etc.
  2. Elementary English (Reading and Writing, which I’m sure you know since you’re reading this)

Note: Python’s credibility in terms of the components present in its libraries make it an alternative that can be used against R Programming. While some companies still prefer R Programming in their general use, there’s nothing that R can do that Python cannot.

For more on Python:

Tableau / Tableau Public:

Hands down, the most amazing and aesthetic Data Visualization tool out there, Tableau can plot volumes of data for you in just seconds. Tableau is highly recommended to use for everyone, from students to professionals. Tableau can generate any kind of plot that you need such as Donut Chart, Pie Chart, Histogram, Bar plot, Scatterplot, Box plot, Violin plot, Maps etc. It can also make Data Viz seem like cakewalk thanks to its Drag and Drop interface and auto-recognition of variables as continuous or categorical.

Sounds too overwhelmingly amazing? That’s not it. It comes with an inbuilt Data Interpreter which is accurate 99% of the time when you’re required to make slight changes to clean your data.

Starting with it’s latest version, Tableau 10, it comes with Data Storytelling features alongside Dashboards that can take your reports a whole new level ahead.

P.S. It has a paid version and a FREE Public version from where you can store your visualizations on their cloud and access all their resources from Tableau Public. All you need to do is Sign up and download the application! It is very simple to use and a detailed tutorial is also available on how to use it on Tableau Public’s official website (the link for which will be at the end of this article).

Get to making some art now! For more tips on how to make your visualizations stand out, check out my previous blog:

SQL (Not Sequel):

Relational Database Management Systems for SQL

SQL or as people like to call it, “Sequel” or in actual terminologies, “Structured Query Language” is the way you can communicate to databases stored in your system and manipulate information stored within using a DBMS (Database Management System)or Relational DBMS. It can support a wide range of systems such as MySQL, PostgreSQL, Oracle SQL, SQL Server or Microsoft Access. The difference in these comes about the features offered in them and all of them use SQL as their language.

It can store data in rows and columns in tables which can then be stored in a database and manipulated accordingly. Like Python, SQL works with English and commands are as simple as the words create, delete, drop, select etc. Needless to say, it’s a must to learn SQL to be able to work with Databases.

Power BI:

Offered by Microsoft, the Power BI tools which include Power BI Desktop, Power BI Pro, Power BI Service and Power BI Mobile are a more flexible alternative to Tableau which facilitates better methods to clean data manually. It comes with an inbuilt Query editor that works with it’s inbuilt languages (DAX and M) and can work on big volumes of data easily. It can auto detect null entries or any errors in the data.

Unlike Tableau, it is a bit heavy on the processors (with a runtime requirement of 1.6GB) and hence ranks below Tableau in this list (the list is unordered but this one just lacks the speed against Tableau). Nevertheless, there are pros to this giant that makes it better than Tableau in some aspects. It can allow users to make visualizations, establish two way relationships between tables and make highly customizable dashboards (which can sometimes be better than Tableau).

P.S. If your desktop/laptop doesn’t support heavy programs, it’s best to avoid Power BI and go for Tableau.

It supports collaborative projects and project sharing over it’s cloud so that you can team up with your peers and colleagues and work on a project on a real time basis all together. It also comes with a Power BI Mobile tool that enables you to access your projects from your mobile phone using just the internet.

Note: It also supports custom visuals that can be downloaded from the Microsoft app store like any other application and can also be embedded with JavaScript code to make your own custom visuals to suit your specific needs in the best way possible.

TensorFlow 2.0

Note: Skip this if you’re not interested in learning Python and progress towards the Machine Learning and Deep Learning segments.

TensorFlow is an open source Deep Learning library owned by Google (2.0 Version). TensorFlow is highly available to anyone and everyone. It can also be accessed using the Colaboratory by Google in case your local system doesn’t support it. (Available only on x64 bit processors). TensorFlow can help you work with Deep Learning Sequential models and Perceptrons that you can customize in your way and set the number or epochs and run it enough number of times to make your model stand out in terms of analyzing Data.

In simpler terms, you can train a Perceptron which works like a human biological Neuron and then you can connect multiple Perceptrons together to tie up in multiple levels of Input layers (>4 or 5 hidden layers between Input and Output will make it a Deep Neural Network) and can lead to the output. Thereby making it an ANN or Artificial Neural Network. These work on the basis of the Universal Approximation Theorem which you can study for further information.

TensorFlow and Deep Learning libraries as such can take you leaps and bounds in your Data Science path. They enable you to train the machine on Data in a way that most Machine Learning algorithms cannot. For more reading on how TensorFlow works and how amazing it is, check out the link down below.

Excel VBA

Surprise!

*Image used for Representational Purposes only.

Excel does make the cut into this list. We spoke about how we wouldn’t be stuck at a decades old tool but it still makes the cut. Prior to 1985, people did not have a 1048576 rows by 16384 columns data handling tool that could host up to a rough 1.8 X 10¹⁰ Cells. You can go ahead and try to compute that value. Spreadsheets have been ever-developing and the GUI was getting more and more easier to use in Excel but there was one underlying thing that none of us was told about at School.

There is a high possibility that you know what Macros are. Macros, in case you don’t know, records a sequence of things as done by a user in Excel and can store it in it’s memory to repeat it over time as and when required by just clicking a button and skipping the entire manual process again.

Let me ask you a question? Do you know there’s a language behind the functioning of Excel? Well, there is one. Any application needs a backend language for it to function with a multitude of features. In Ms-Excel, this language happens to be VBA or Visual Basic for Applications developed in association with Microsoft.

It is a bit different compared to Python, SQL and DAX/M (Power BI backend languages) due to it’s age, but it does do tasks that are not limited to just Excel’s inbuilt Functions. It allows you to record your own custom macros and store them. It also facilitates the creation of new functions.

For example, if you want to subtract 3 from an element and then subtract it with it’s square. (Don’t ask why you would do that? Rather, look at the fact that doing this manually over a lot of elements can take years. There’s no function that exists for this task) In such a scenario, VBA comes to the rescue!

Note: If you’re comfortable with using Excel for your Data needs and would like to skip the rest of the tools, do learn VBA and then your knowledge about Excel and how it works will be complete and at your disposal for any situation you encounter.

Others:

There are other tools and languages that you can use in order to handle data, analyze and make models on it and Visualize it; such as Traceis, Octave, MATLAB, R Programming, Google Sheets, Alteryx, QlikView, Google Charts, D3.js etc.

Alteryx, D3, QlikView, Traceis

Google your need and you’ll be presented with a list of multiple applications to suit your need. For more information on any application or language you’re interested to know about, drop a comment down below and I’ll respond back with the relevant resources. All the tools mentioned in this article are free to use. Go ahead and get any of them and start using them on your data! Following are some useful links:

Tableau Public:

Microsoft Power BI:

MySQL:

For more such articles, stay tuned with us as we chart out paths on understanding data and coding and demystify other concepts related to Data Science and Coding. Please leave a review down in the comments. It was a long article, thank you very much for reading it all the way here! Great going!

--

--

Yash Gupta
Data Science Simplified

Lead Analyst at Lognormal Analytics and self-taught Data Scientist! Connect with me at - https://www.linkedin.com/in/yash-gupta-dss