Data analysis is part of any data scientist’s daily work (along with data munging and cleansing). It is also important for many other people in the modern workforce, system analysts, business owners, financial teams, and project managers.
However, most undergrad courses do not (or at the very least, did not) teach the basics of data analysis in any of their courses. There are maths courses and statistics, as well as computer programming courses which involve data structures and algorithms.
Yet, none of these focused on how to look at data sets from databases, CSV’s, or other data sources available in the modern data world.
There might be an occasional project that requires data analysis. Some people might have been lucky enough to receive a set of projects that forced them to analyze data for the first time out of a database.
However, most students are left to try and figure it out themselves during their first job.
For students not planning to be programmers, understanding databases and SQL is a valuable skill. It allows them access to data that was once only available to database teams.
Managers are no longer satisfied with their teams not having access to data. Therefore, everyone needs to know how to work with, and devise, analysis from data!
Data analysis is abstract. It is not math (although math is involved) and it is not English or accounting. It requires a hands-on approach to truly understand the pitfalls good analysts will run into.
Yet, most students have not dealt with vague parameters and large data sets by the time they get their first job, which is a shame! Many students haven’t even heard of a data warehouse and this is where most of the data, which helps managers make critical decisions, resides.
In the modern business world, data analysis is not limited to data scientists. It is also key for analysts, system engineers, financial teams, PR, HR, marketing, and so on.
Thus, our team wanted to create a guide to help both new students and those interested in learning more about data science and analysis.
The Foundation of Good Data Science and Analytics
This first part of this series will cover the important soft skills required for good analysis. Data analysis is not only maths, SQL and scripting. It is also about staying organized and being able to articulate to managers the discoveries that have been unearthed.
This is one of many traits that successful teams in data science and analytics portray. It is important to point these out first because it lays the groundwork for the rest of the series.
After this section, we will discuss analysis processes, techniques, and give examples with data sets, SQL and Python notebooks.
The term data storyteller has become correlated with data scientists but it is also important for anyone who uses data to be good at communicating their findings.
This skill-subset fits in the general skill of communication. Data scientists have access to multiple data sources from various departments. This gives them the responsibility and need to be able to clearly explain what they are discovering to executives and SMEs in multiple fields.
They take complex mathematical and technological concepts and create clear and concise messages that executives can act upon.
They’re not hiding behind their jargon but they actually transcribe their complex ideas into business-speak. Analysts and data scientists alike must be able to take numbers and return clearly stated ROIs and actionable decisions.
This means not only taking good notes and creating solid workbooks. It also means creating solid reports and walk-throughs for other teams.
How do you do that? This could be a post in itself but here are some quick tips to better communicate your ideas in a report or presentation.
- Label every figure, axis, data point, etc.
- Create a natural flow of data and notes in a notebook.
- Make sure to highlight your key findings and sell your conclusion. This is easier said than done when you have lots of data to prove your point.
- Imagine you are telling a story or writing an essay with data.
- Don’t bore your audience to death, keep it sweet and to-the-point.
- Avoid heavy math jargon! If you can’t explain your calculations in plain English, you don’t understand them.
- Peer-review your reports and presentations to ensure maximum clarity.
The video The best stats you’ve ever seen by Hans Rosling is a great example of data storytelling.
Data scientists and analysts aren’t always on the same team as the business owners and managers who come to them with questions. This makes it very important for analysts to listen diligently to what is being asked of them.
Working in large corporations, there is a lot of value in trying to find other teams’ pain-points and problems and help them through it.
This means having empathy. Part of this skill requires experience in the workforce and another part of this skill simply requires the understanding of other human beings.
Why are they asking for the analysis and how can you make it as clear and accurate for them as possible?
Miscommunication with the business owners can happen quite easily. Thus, the combination of listening diligently, as well as listening for what is not being said, is a great asset.
Besides being focused on details, data analysts and data scientists also need to focus on the context behind the data they are analyzing.
This means understanding the needs of the other departments who have requested the project, as well as actually understanding the processes behind the data they are analyzing.
Data typically represent the processes of a business. This could be a user interacting with an E-commerce site, a patient in a hospital, a project getting approved, software being purchased and invoiced, and so on.
All of these get represented in thousands of data warehouses and databases across the world and they are often stored only slightly different with different business rules.
That means that data analysts need to understand those business rules and logic. Otherwise, they can’t perform good analysis; they will make bad assumptions and they will often create dirty and duplicate data.
All because they did not understand context. Context allows data-focused teams to make clearer assumptions. They are not forced to spend too much time in the hypothesis-phase where they are testing every possible theory. Instead, they can utilize context to help speed up the process of their analysis.
The metadata (e.g. context) around data, is like gold to a data scientist. It isn’t always there, but when it is, it makes our jobs easier.
Whether you’re using Excel or Jupyter Notebook, it is important for a data analyst to understand how to track their work.
Analysis requires many assumptions, questions, and single-track thinking that can be lost if not noted down.
It is easy to come back the next day and forget what was analyzed, how and why different queries and metrics were pulled, etc. Thus, it is important to note everything down in a diligent manner. It is not to be left to the next day because there will always be a loss of information.
Creating a clear note-taking style makes it easier for everyone involved. We brought this up earlier, in communication, but we’ll mention it again.
Labeling, creating a natural flow of notes, and avoiding business jargon can help everyone involved. Even the original note taker. It is pretty embarrassing when even the original note taker does not understand their notes!
Creative and Abstract Thinking
Creativity and abstract thinking help data scientists better hypothesize possible patterns and features they are seeing in their initial exploration phases.
Combining logical thinking with minimal data points, data scientists can lead themselves to several possible solutions. However, this requires out-of-the-box thinking.
Analysis is a combination of disciplined research and creative thinking. If an analyst is too limited by confirmation bias or process, they might not reach the correct conclusions.
If, on the other hand, they are thinking too wildly and not using basic deduction and induction to drive their search, they could spend weeks trying to answer a simple question as they wander through various data sets without clear goals.
Analysts need to be able to take big problems and data sets and break them down into smaller pieces. Sometimes, the two or three questions asked by a separate team can’t be answered with two or three answers.
Instead, the two or three questions themselves might need to be broken down into small, bite-sized questions which can be analyzed and supported by data.
Only then can the analyst go back and answer the larger questions. This is particularly true for large and complex data sets. It is becoming more and more important to be able to clearly breakdown analysis into its proper pieces.
Attention to Detail
Analysis requires attention to detail. Just because an analyst or data scientist might be a big-picture type person, it doesn’t mean they are not responsible for figuring out all the valuable details of a project.
Companies, even small ones, have lots of nooks and crannies. There are processes on processes and not understanding those processes and their details affect the level of analysis that can be done.
Especially when you’re writing complex queries and programming scripts. It is very easy to incorrectly join a table or filter the wrong thing. Thus, it is key to always double and triple check your work. Also, if scripts are involved, peer reviews should be too.
Analysis requires curiosity. We will get into this when we break down the process. However, a step in the analysis process is listing out all the questions you believe are valuable to the analysis. This requires a curious mind that cares to know the answer.
Why is the data the way it is? Why are we seeing patterns? What can we use to find the answer? Who would know?
These are just some of the questions that can help to point analysis in the right direction. You need to have the drive and desire to know why.
Tolerance of Failure
Data science has a lot of similarities to the science field, in the sense that there might be 99 failed hypotheses that lead to one successful solution.
Some data-driven companies only expect their machine learning engineers and data scientists to create new algorithms, or correlations, every 12 to 18 months. This depends on the size of the task and the type of implementation required (e.g. process implementation, technical, policy, etc.).
In all this work there is failure after failure, there is unanswered question after unanswered question and analysts have to continue.
The point is to get the answer, or clearly state why you can’t answer the question. However, you can’t just give up because the first few attempts failed.
Analysis can be a black hole for time. Question after question can be incorrect. That is why it is important to have a semi-structured process. One that guides analysts but doesn’t hold them back.
Data Science and Analytics Soft Skills
The skills analysts and data scientists need aren’t all about programming and statistical analysis.
Instead, it’s about focusing on making sure that the discovered insights are easily transferable. This allows other team members and managers to also gain from the analysis.
Analysts need to be able to do more than just reach a conclusion. They need to be able to create work that is easily reproducible and communicable.
This process doesn’t just save time. It, more importantly, helps leaders trust the analyst’s conclusion.
Otherwise, the analysts might be correct but if they sound unsure, if they have bad notes, or are even missing one data point, it can instantly lead to distrust amongst their leaders.
Sadly, this is very true. Analysts’ work can instantly come into question when even just one data point is incorrect or communicated poorly.
We often recommend that data teams do a walk-through of their reports and presentations just to check for holes. A team member who is good at questioning every angle is great in these situations.
The more your team can pre-answer questions executives may have, the more likely the executives will sign off on the next leg of the project.
The Process of Data Analysis
In the next part, we will describe a process for analyzing data.
We will be setting up basic notebooks and describing simple processes that will help new and experienced data scientists and analysts make sure they are tracking their work effectively.