4 Things that I learned from Data Analysis

Kadek Byan Prihandana Jati
4 min readDec 2, 2018

--

Unlike coding practices which has their own framework like TDD (Test Driven Development), honestly I don't have much experience when playing with data or doing data analysis and I never heard about what and how to analyze data efficiently.

Maybe because of my experience is in creating code, on the field of data analysis somehow I feel lost when trying to analyze things and trying to gain some insights.

So, in this post I want to wrap up my experiences in how to analyze data efficiently with the tools that we managed to use.

Problem & Question

This is the thing that always come first before we are doing some set of data analysis. Defining what is the problem and what are the set of questions that we have to ask to gain some insights. On this point, we also encounter the same things when we do coding. We have to extract the information of the problem, knowing the things that become the problem until we could create a several question to make the problem more visible.

Learn the Data

I'm not wondering about why Data Scientist role needs a person that having a skill to learn things as fast as possible. Because when they are facing data analysis, they have to learn the data. Some people said that, when we could saw the tree, we could saw and imagine the forest looks like. When we want to analyze some data, learn data by doing these things:

  • See the schema first
  • Get the first 10 rows of data (sampling)
  • Get the 1 row definition of data, say we have a flights data, for 1 record means we talking about 1 flight which has the information around the schedule (depart, arrival), destination airport, origin airport, the weather data, the type of airplane, the ID of flight, so on and so forth.
  • Get to know what processes creating the data as well. Sometimes, we have a data that comes from client that tracking the users activity, say our products is a website / mobile apps, so that the client could be the client device. When we know how to get our data through the client apps and our Data Pipeline infrastructure (let say). Or, when we are not in IT industry, some data may be gathered from people or gathered from a third party. It is important to know where the source of our data, because those things will build our mental model when we perceive the data.
  • Some says, becareful with the null data. We could count some columns that containing null value and doing some basic exploration before dig deeper on our data.
  • A further pre-analysis ritual, could be found on kaggle.com. There should be a lot of basic preparation for data if you are a curious one.

Keep Details & General Analysis Balance to avoid Stuck

Data Analysis just like a series of detail & general looking of data. What detail state in data analysis for me is when we are drilling out the data and seeing the smallest part of our data. For example, still on the flights data, we are looking for some details, when we are filtering the flights on certain country, on certain airport, and on certain time to find things that small and matter for us. The advantages of this detail state is we could have a more focus on our data and smaller size of data to be viewed. But the disadvantages is we might be failed to see the impact of our data to in a whole view. The opposite of detail state is the general one. When we are doing some aggregation and seeing the correlation about the columns that we have on our data. So, the tools that might help us to be a guiding star while we are doing analysis is the problem and question at the beginning of analysis. Before doing some details analysis or general analysis, put some thought on the question & problem, do the details or general query will lead us closer to our question & our problem? If those things are not lead us to our question & problem, reduce those effort, because when we are searching for a lot of "just want to know" information, we will lead closer to stuck experience.

Some of general (broader) analysis techniques:

  • Searching for some correlation of the data
  • Describe the data with the basic statistics like average, min, max, percentile, standard deviation.
  • Creating some clusters or stratified sample from our data
  • Extracting new columns using our data

Some of details analysis techniques:

  • Basic operational joins (inner, left, right, left anti, right anti, full outer joins)
  • Window functions on SQL
  • Basic SQL (indeed)

Refactoring: Keep our notebook clean & sleek

This refactoring things might be the unsexiest things which exist on coding practices that I will mention on data analysis environment. But, the thing is, refactoring technique on this context is just as simple as make our analysis notebook clean & intentional. Clean means that we have some templates on naming things and creating a mini documentation in order when people involve in our investigation, they could give us some comments. And in this refactoring points, making our "data sources" clean and intentional also good for our next investigation. For example, when we have a 50 GB data and we have to load it those data in our notebook instead, we could create some query and define our data based on our needs by filtering those 50 GB data into smaller one and save the "pre-processed" data in our storage. Maybe we are saving the data after its first join with another data and make it as our new source of data, and it will faster our next analysis when we want to do a lot of analysis there.

To conclude this article, we have 4 things here, such as:

  1. Defining problem & set of questions
  2. Learn the data
  3. Keep details & general things balanced
  4. Refactoring notebooks

I think that's all of the things that I've got shared, if there is any question regarding this topic, we could have some conversation in the comment section below. Cheers!

--

--