My First Week of Studying Data Science
I just started an online Data Science program and last week I was introduced to three new concepts: Exploratory Data Analysis (EDA) and Visualization, Networks and, Unsupervised Learning. Let me share my key take aways using simple examples.
EDA and Visualization
A data scientist works with data. This data comes from various sources and contains information, that the data scientist will want to extract in order to solve a problem, make predictions, respond to a business request, etc. Exploring the data is the first step for the data scientist to gain a deeper understanding of the data. Several methods are available such as visualization and descriptive statistics.
Visualizing the data helps to identify patterns, or any relationships in the data. In the example below, I generated a scatter plot from the Iris dataset, and we can see a relationship between the petal length and petal width: the longer the length, the wider the width. Indeed, the datapoints seem to form an upward line.
Descriptive statistics help to summarize the data with key numbers such as the average (mean), the standard deviation (std), the maximum value (max) or the minimum value (min) for the variables sepal length, sepal.
So the goal of EDA is to explore the data and gain an understanding by identifying patterns or relationships within the data. If we have a dataset with a large number of features (high dimensionality), we can perform a Principal Component Analysis (PCA) in order the reduce the number of features while preserving most of the information (dimensionality reduction). Read my learning experience with PCA here. Another way to work with such datasets is to use the Stochastic Neighbor Embedding (SNE) method to visualise the data in a lower dimension and identify any patterns or clusters. Below, we can see how clusters are formed by animal or insect, and also a certain relationship between big animals (elephant, horse and cow) on one hand and insects (spider and butterfly) on the other.
Networks
Based on graph theory, a network is a way of studying how objects are connected to each other. The objects, which we call “nodes” or “vertices”, are connected by lines, which we call “edges” or “links”. For example, a road network is composed of cities (nodes) and roads (edges) that connect the cities.
Let’s take Facebook as a social network where our friends are the nodes, and our relationships the edges. Different centrality measures describe difference aspects of a node within the network:
- a node with the highest degree centrality is the person with the largest number of edges (the most popular)
- a node with high eigenvector centrality is a popular person that is friends with popular people (well-connected)
- a node high closeness centrality is the person that can reach other nodes quickly in the network and could best inform the group
- a node with high betweenness centrality is the person who connects different parts of the network, and whose removal could disrupt the network (the bridge)
Unsupervised Learning
Imagine you are in a class and a teacher hands you a deck of cards with different types of plants. Your teacher might explain that some plants are trees, some are shrubs and why some are herbs. You learn how to identify them based on your teacher’s lessons. This is supervised learning.
In unsupervised learning, you are in the same class with the same deck of cards, but without the teacher. This time, you might look at the cards and figure out yourself how to group them together based on the similarities or patterns that you might pick up but without knowing what the answer is.
In other words, you have a dataset you wish to explore but you do not know (nor does the computer) what the correct answers are. You run programs for the computer to explore the data on its own and find any patterns within the data.
In our example of unsupervised learning, you have collected data on plants, such as their size, color, their leaves, whether they have flowers, or not, etc. You run a specific algorithm like clustering that will group these plants based on similar features and, as a result you will have groups of plants (or clusters) that share a common pattern that the algorithm discovered.
My first week of study was dense with new concepts and was a great introduction to the next topic: Machine Learning.