Fantastic data and where to find them

Simple data science methods you should know before starting as data scientist

Novega Ghufron Arlinka
Binar Academy
7 min readFeb 15, 2019

--

Photo by Franki Chamaki on Unsplash

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… — Dan Ariely 2013

Is that quote still relevant for now?

What exactly are data? Nowadays, everyone keeps talking about data or even big data. We read many articles that say data science is now the sexiest job of the 21st century.

In 2017, the internet usage reached 47% of the world’s population, it’s equivalent to 3.8 billion people. Each click, share, transaction, even how many times you visit you ex’s Instagram profile, are always recorded. With this impressive growth, data are keep generated 24/7 by countless platforms, from the entire universe—or imagine parallel universes exist, how many data can it be?

We human need rest, but data just, never sleep.

By 2020, it’s predicted that 1.7 MB data will be generated every second for every person. ‘The Big One’ is getting bigger and bigger. As its growth is getting bigger, what exactly should we do to this data? For example, with Instagram business account, we can see how many people’s getting engaged in our content, how many people’s getting impressed, etc. But, what’s next? If we get an impressive growth from these insights, how do we keep it growing ahead? The ability to make data-driven decisions is something crucial for any business. And there are so many data in the universe which is big and complex.

Data Never Sleep 6.0 https://www.domo.com/learn/data-never-sleeps-6

Some managers only wanted a simple chart that they could present in their board meeting each day. The questions are: which data that should be presented? Or which one is the apt insights that they should use to consider? Those questions would bring us back to the fundamental understanding of data analysis method.

Data Analysis vs Data Analytics

Wait, before jump into data analysis method, what is exactly the difference between data analysis and data analytics? Here we go.

Analysis & Analytics. Source: Menkominfo Digitalent “Concept of Big Data”

Data analytics is a discipline that includes the management of complete data lifecycle such as collecting, cleaning, organizing, storing, analyzing and even governing data. From data analytics, we can get data-driven decision-making that empowered by scientific backing based on factual data, not only based on past experiences or an intuition. In big data environments, data analytics has developed methods that use highly scalable distributed technologies and frameworks that are capable of analyzing large volumes of data from different sources.

While data analysis is the process of examining data to find facts, relationships, patterns, insights and/or trends. For example, let say a coffee shop wants to analyze how the number of coffee cups that sold related to daily work hour. The results are such an analysis that would support decisions related to how many cups of coffee that coffee shop should be prepared in relation to work hour activity.

So, data analysis is a larger area than data analytics. There are some methods of data analysis that can be used to process the data (simple — complex) and how to get meaningful insights from it.

Various methods of Data Analysis:

  • Quantitative Analysis is a technique that focused on quantifying patterns and find the correlations in the data. The results of this method are numerical insights that can be used as absolute data comparison.
Quantitative Analysis
  • Data Mining is a specialized form of data analysis of the large datasets. In Big Data environment, this method is generally referred to as automated, software-based techniques that sift through massive data sets to identify hidden patterns. From those insights, data mining forms the basis for predictive analytics and business intelligence.
Photo by Mika Baumeister on Unsplash
  • Statistical Analysis used mathematical formulas to describe datasets via summarizations, such as providing the mean, median, or mode of statistics associated with the dataset. It also can be used to infer patterns and relationships within the dataset, such as regression that can be applied to predict the values of the dependent variable. Another example of statistical analysis is a correlation which is used to determine whether 2 variables are related to each other. For case example, we want to know the correlation between the distance of sea that affects the city’s temperature, we can express the correlation by using a coefficient between -1 to +1. Correlation can first be applied to discover if a relationship exists or not. By using the regression equation, we also can predict how hot is the city’s temperature which is 250 kilometers away from the sea.
Linear Regression
  • Machine learning is the hottest analysis method for now. Implemented for clustering, classification, segmentation, etc. First, the machine learning model would learn about the pattern of some data until it can determine or take any decision by itself based on what it has been learned. There are supervised learning and unsupervised learning method. For example, the supervised learning method can be used for classification to classify some types of customers into some classes such as ‘loyal’ or ‘not-loyal’ customer (with the label) based on some features. Meanwhile, the unsupervised learning method can be used for clustering to know how many groups of customers exist based on similar purchasing history in a marketplace (without label). The design of a machine learning model needs to be trained by using the training datasets until it is able to achieve an expected accuracy before entering the testing phase and ready to be deployed in real implementation.
Classification & Clustering model. Source: Menkominfo Digitalent “Concept of Big Data”

Another task that machine learning can do is filtering such as collaborative & content-based filters. Filtering is the automated process of finding relevant items from a pool of items. Items can be filtered either based on a user’s own behavior or by matching the behavior of multiple users. If we use Netflix, its recommender system would recommend us what kind of movies that we might be like based on the similarity of the user’s behavior (likes, rating, historical streams, etc).

Collaborative vs Content-based Filtering. Source: Towards Data Science

The machine learning model is also used to identify anomalies, abnormalities, and deviations that can be advantageous such as opportunities, or unfavorable such as risks.

  • Semantic Analysis is a method that figures out the meaning of linguistic input to construct meaning representations. This analysis produces common-sense knowledge about ‘the world’ by extracting data and constructs models of ‘the world’. For example, we want to know what people say about ice cream. We can use natural language processing (NLP) to extract the information based on textual data. By using NLP as a technique for text and speech recognition, an ice cream shop can transcribe customer calls into text data that are then mined for commonly recurring reasons for customer dissatisfaction. Not stopping there, by using sentiment analysis from textual data, we also can determine the bias or the emotion of individuals which can be used by the ice cream shop to identify customer satisfaction or dissatisfaction early, gauging product success or failure, and spotting new trends. For example, we can determine the emotion of people about the election candidates President of the USA in 2016 by capturing their tweets on Twitter.
Sentiment Analysis of People’s Tweets about Election Candidates in the USA 2016. Source: https://rampages.us/giny/2017/03/05/twitter-research/

Conclusion:

  • Another data alike is the cockroach. Once they were born, they’ll keep growing to infinity, and beyond.
  • The ability to make data-driven decisions is something crucial for any business
  • By understanding the fundamental concept of data analysis, we could be wiser to determine which data that we should capture, analyze and visualize in order to have accurate data-driven decision making.
  • Data analysis methods should be the first thing to learn if you’d like to be an expert in this omnipresent digital way of life.

On the other hand, as data keep growing more complex, the infrastructure of the data itself should be one of the important things to consider. Utilization of cloud-based virtual computing is one of the solutions to deal with big data management. Let’s discuss it later in the next article.

Thank you!

--

--