A Beginners Guide to the World Within Data Science

The Goal

This post aims to touch on the major concepts and concentrations that make up data science. For each respective concept, I hope to effectively communicate a general definition along with an adequate explanation of the role it plays in the field. The following list reflects the relevant sub-topics that I will be covering:

  • Statistics
  • Linear Algebra
  • Programming
  • Machine Learning
  • Data Mining
  • Data Visualization

Statistics

What is it?

Business Dictionary has defined the statistics as a “branch of mathematics concerned with collection, classification, analysis, and interpretation of numeral facts for drawing inferences on the basis of their quantifiable likelihood”. Or in simpler terms, statistics is collecting numbers, looking at numbers, and drawing conclusions from numbers.

Significance in Data Science

There are a number of ways that statistics come out to play in data science practices. These practices can be extremely helpful for interpreting data and producing interesting results.

Experimental Design: If you have a question and want that question answered then, whether you know it or not, you are most likely going to be conducting some sort of experiment to find the answer. This involves setting up that experiment, dealing with sample sizes, control groups, and more.

Frequent Statistics: Using statistical practices such as confidence intervals and hypothesis tests allow you to determine how much a result or data point matters. Being able to calculate significance and other important pieces of information from data will make you a stronger data scientist.

Modeling: Techniques such as regression and clustering are used often in data science for modeling work. Whether you’re trying to predict something, find the bigger picture in data, or understand reasoning behind data; chances are you will end up using some sort of predictive modeling.

How Important is it?

This is where things get a bit fuzzy and opinions begin to vary. In order to more accurately answer this question, I propose that we break up statistics into two groupings: new and old.

Old statistics such as regression and hypothesis tests are simplistic by nature. While they can be useful, many distinguished data scientists predict them being used less and less. Stating that these concepts will likely become less important as we move forward and statistical techniques evolve. On the other hand, new statistics like decision trees and predictive power are very useful and are used often by data scientists.

All this being said, I still recommend aspiring data scientists to work through general statistical theories and practices. Even if you won’t be using them in everyday work, they still are very helpful in helping you progress up to more advanced concepts that you will use regularly while training analytical thinking.

Resources

What’s the Difference Between Data Science and Statistics?: A great explanation on what differentiates the two fields.

Data Science and Statistics: Another great article on a similar matter.

7 Ways Data Scientists Use Statistics: Goes into more detail regarding implementation purposes.

Linear Algebra

What is it?

Personally, I found Wikipedia’s definition to be the most helpful: Linear algebra is the branch of mathematics concerning vector spaces and linear mappings between such spaces. This is a pretty good start, however I believe we can simplify that definition a bit and make it less textbook-like.

Simply put, linear algebra is math dealing with straight stuff in space.

Significance in Data Science

Here are a few of the more prominent use cases for linear algebra in data science today:

Machine Learning: An excessive amount of machine learning tactics tie in with aspects of linear algebra. Just to name a few, there is principal component analysis, eigenvalues, and regression. This is especially true when you start working with high dimensional data, as they tend to incorporate matrices.

Modeling: If you want to model behaviors in any way, you will likely find yourself using a matrix to breakdown samples into subgroups before doing so in order to establish accurate results. This act requires you use general matrix mathematics including inversion, derivation, and more.

Optimization: Understanding the various versions of least squares is very useful to any data scientist. It can be used for dimensionality reduction, clustering, and more. All of which play a part in optimizing networks or projections.

Some of you may have recognized the repetition of the words matrix and matrices, this is not a coincidence. Matrices are a big part of general linear algebra theory. These concepts used on matrices are equally effective on tables and data frames, two fundamental structures used in data science.

Resources

Linear Algebra Wikipedia Page: Good for all that textbook-style stuff.

What is the Point of Linear Algebra?: Great thread on Quora, pay special attention to the first two answers by Sam Lichtenstein and Dan Piponi.

What Concepts of Linear Algebra Should One Master to be a Good Data Scientist?: Another quality Quora thread, the first answer by Lili Jiang was especially insightful.

Linear Algebra for Data Scientists: If you’re looking for a place to dive into a quick run through of the basics.

Programming

How Important is it?

If you plan on pursuing a career in data science, you’ll need to learn how to code, and code well. This is the reason why so many data scientists have a computer science background; it’s a big advantage. However, if you are not fortunate enough to be reading this with some programming experience than don’t worry, like most things in life, it can be self taught.

When is it Most Important?

We’ve established that being able to program is an essential skill for data scientists no matter what domain you are in. With this being said, general scripting or commanding is not where programming really thrives in data science. By writing programs to automate tasks you not only save yourself valuable time later, you also make your code much easier to debug, understand, and maintain.

As a general rule of thumb, if you have to do something more than twice, you should be writing a script or program to automate it.

What Should I Know?

Let’s move on to some of the crucial skills involved with programming in data science. Keep in mind that the following list is more focused on practical skills then specific practices (For instance, time management skills rather than right outer joins).

Development: Many data scientists today go by the name of “software developer” despite preforming very similar tasks seen in data science. Data scientists that are familiar with software development practices are typically more comfortable than academics when working on larger scale commercial projects.

Database: Data scientists are constantly using databases. In order to be effective in your work, you will need to have experience in this area. As NoSQL and cloud oriented databases grow, traditional SQL databases are on a steep decline. However, employers will still expect you to have a basic knowledge of SQL commands and database design practices.

Collaboration: Collaboration is key in software. You’re undoubtedly familiar with the age old saying, “A team is only as good as its weakest link”. Well despite it’s cliche connotation, this is also true with any data science team. Much of your work will be done in groups, because of this you will need to be polished in communicating with your team as well as maintaining relationships in order to maximize productivity.

Important Practices

If you were to ask any software developer or data scientist what the most important aspect of programming is in the workplace, they will surely reply with a one word, three syllable answer: Maintenance. Simple, maintainable code trumps complex genius code almost always in the workplace. Your code is ultimately irrelevant if other programmers can’t understand if well enough to scale it and maintain it over time. There are a couple ways to easily improve your code maintainability. They go as follows…

Do not use “hard” values in your code: Instead use variables and inputs, they are dynamic in nature and will scale over time opposed to entering any static values. This small change in your code now will make your life much easier down the line.

Document and comment your code relentlessly: The most effective way to make your code easier to understand is to comment like crazy. By commenting with concise and informative anecdotes, you will save yourself endless conversions explaining yourself to your peers.

Refactor your code: Remember that once you submit a piece of code, it doesn’t end there. Be constantly going back to your past work and looking for ways to make optimize it and make it more efficient.

Resources

Software Development Skills for Data Scientists: The is a great overview on important soft skills for programming practice.

The 5-Dimensions of the So Called Data Scientist: Interesting take on the different roles a data scientist can take. Pay close attention to the “Programming expert” and “Database expert”.

9 Must-Have Skills You Need to Become a Data Scientist: Good information in the “Technical Skills” section of this short article.

Machine Learning

What is it?

First off, machine learning is a part of the more vast field of artificial intelligence. Artificial intelligence is a term coined by John McCarthy in 1956, first defined as “the science and engineering of making intelligent machines”. Within this field, machine learning has become more and more significant over time.

Machine learning allows us to teach computers how to program themselves so that we don’t have to write explicit instructions for certain tasks.

The Two Types

Machine learning can be broken down into two forms of learning: supervised and unsupervised.

Supervised learning: The majority of practical machine learning today is done using supervised learning. Supervised learning is the process of an algorithm learning from data, producing it’s expected results, and then being corrected by the user in order for the algorithm to improve in accuracy next run. So in layman’s terms, think of the computer algorithm as the student and you as the teacher, correcting it and steering it in the right direction when needed.

Unsupervised learning: While this type of machine learning has less practical use right now, it is arguably the more interesting branch. Unsupervised learning is where the algorithms are left on their own to discover and identify the underlying structures in the data.

Significance in Data Science

Machine learning is undoubtedly a big deal in today’s technology picture. Tony Tether and John Hennessy have already called it “the next internet” and the “hot new thing” respectively. Bill Gates has been quoted on the topic as well, stating that “a breakthrough in machine learning would be worth ten Microsofts”.

With notable applications like the self driving car, image classification, and speech recognition, one can easily see what all the hype is about concerning machine learning. The field is growing and growing quickly, so hop on the bandwagon now or be left behind.

Resources

What is Machine Learning?: Good thread on Quora with several slightly differing answers that aim to define machine learning.

Short History of Machine Learning: A little more in-depth look at the history of machine learning.

Supervised and Unsupervised Machine Learning Algorithms: Clear, concise explanations of the types of machine learning algorithms.

Visualization of Machine Learning: Easily my favorite resource on this matter. Excellent visualization that walks you through exactly how machine learning is used.

Data Mining

What is it?

If you have perused a good amount of online data science resources than you have likely seen the term “data mining” before. But what is this practice composed of exactly? After looking at various sources, I think it’s best to describe it as the following:

Data mining is the process of exploring data in order to extract important information.

Glossary

Through my experience, I have come across a number of other topics within data mining that I think it will prove helpful to know. Below you’ll find a quick and easy definition list of data mining slang. Keep in mind that differentiating between them can get tricky seeing as they are all very similar.

Data Wrangling: This is the act of converting data from it’s raw form into a more useable form. It usually consists of a few vital steps including cleaning and parsing into predefined structures.

Data Munging: The same exact thing as “Data Wrangling” shown above. Why we need two terms to describe this process I may never know…

Data Cleaning: A crucial step that involves detecting and correcting (or removing) corrupt, inaccurate, or missing values from the dataset.

Data Scraping: A technique in which a computer program reads in data coming from another program or website like Twitter.

Significance in Data Science

Everyone wants to make awesome predictive models and put together jaw-dropping visualizations. However, it is often overlooked that none of those things happen until you have performed the ‘Janitor’ work. In a recent New York Times article (Will be posted under resources), it was found that data scientists spend roughly 50%-80% of their time collecting and preparing data.

This harsh reality needs to be communicated to prospective data scientists out there. Between the lucrative base salary and the title of Sexiest Job of the 21st Century, young professionals are eyeing data scientist jobs without knowing the reality of the occupation.

Resources

What is Data Mining?: A good thread on Quora with various definitions of data mining.

What is Data Wrangling?: Short elaboration on what data wrangling is composed of exactly.

‘Janitor Work’ is Key Hurdle to Insights: Interesting article that goes into detail regarding the importance of data mining practices in the field of data science.

Data Visualization

What is it?

Data visualization may seem a bit more self-explanatory than other topics. However, I think you will come to realize that there is a bit more to data visualization than meets the eye. Let’s start things off by defining our topic, as redundant as it may seem.

Data visualization is the act of communicating data or results through some sort of picture or chart.

It is a common misconception that the most important part of a visualization is how attractive it looks. While this is very important, it is not what we are trying to convey. Ultimately, the goal is to communicate the insights you found within the data in the most easily accessible way for the brain. According to NeoMam Studios, Researchers found that color visuals increase readers willingness to read by 80%

Common Types

So now let’s look at some common types of visualizations. Keep in mind that this is by no means an exhaustive list. Rather it is simply some of the most common 2-D visualizations that I personally have encountered in my time. So here you go:

Multidimensional: These are graphs and charts dealing with multiple variables. This is far and away the most common form of visualization you will see. Some examples include: pie charts, histograms, and scatter plots.

Time-Driven: All of these visualizations use time as a baseline for communicating data effectively. Any of these can be a powerful tool for conveying change over a period of time. Some examples include: times series, gantt charts, and arc diagrams.

Geospatial: As one might guess, geospatial visualizations are all concerning location. These are a commonly used for conveying insights about a specific area or region. Some examples include: dot distribution maps, proportion symbol maps, and contour maps.

Key Elements

There are a few key elements that all great visualizations of data have in common. Below you will find a list of the ones that I found to be most significant.

Information: This deals with having accurate and consistent information. It doesn’t matter what your final product is if the data isn’t reliable.

Story: Your visualization should have meaning. It should be something that is relevant to a project or society in one way or another. What’s the point of producing a visual if nobody cares to see it?

Function: No matter how elaborate the information you are communicating, it is your job to make it concise and easy to understand. The ability to transform your data into a form that is useable for someone that is not necessarily from a technical background will prove very important. This is especially true in the workplace.

Attractive: Lastly, your visualization needs to be generally good-looking. It should draw people in and be pleasing to the eye. In order to do this you will have to take into account things such as balance, color, consistency, and sizing, among many others.

Significance in Data Science

Data visualization skills are extremely useful for data scientists, no matter what field you would like to concentrate in. Being able to effectively convey your data through images rather than words makes your message that much easier to understand and in turn, gives you a greater chance at making an impact with your work.

Resources

What Makes a Good Data Visualization: Excellent venn diagram explaining the components of an excellent data visualization. Touched on some of it above but I would still highly recommend giving it a look.

Duke Introduction to Visualization: Run through of all the data visualization types as seen by Duke University Libraries.

Data Visualization Quora Thread: Some good questions about the general topic of data visualization for further reading.

Wrap it Up

So as you’ve probably realized by now, data science is a complex field that is composed of a variety of unique sub-topics. Understanding these sub-topics more clearly is a strong step in the right direction for an aspiring data scientist. I hope I was successful in accomplishing this.


Hope this was helpful. Make sure to follow me on Twitter and check out some more of my work at conordewey.com. Thanks for reading!