Data Science for beginner — Overview of what to learn.

Ethan Duong
8 min readJan 9, 2023

--

Data scientist is the sexiest job of the 21st century. And everyone want to be sexy. The field start to become highly competitive, raising the standard for employment.

Therefore, knowing how to use different tools is merely not enough, job seekers need to be able to grab the fundamental concepts and techniques, then applied it to create values (even when it s not a big impact).

This blog will discuss how to begin with data science and highlight the learning technique that I found most effective.

An Overview of Data Science concept

Combination of:

  • Mathematics
  • Statistics
  • Programming skills

-> Fundamental of math and statistics will help to understand data science concepts. Programming skills will help to work with various tools.

Application

At the end, the purpose of data science is to extract meaningful insights from data.

Some of the most popular fields in Data Science: Natural Language Processing, Computer Vision, Machine Learning, Statistics, Mathematics, Programming, Data Analytics, and Business Intelligence.

Data science has many important applications in the above fields such as -

  • Image Classification and Object Detection
  • Fraud and Anomaly Detection
  • Healthcare Management
  • Language translation and Text Analytics
  • Remote Sensing

Three important role:

  • Data Analyst: Analyze data for business decisions.
  • Data Scientist: bringing valuable information from big data.
  • Data Engineering: working with data pipelines.

How to learn Data Science or anything else.

The more you find out about data science, the broader it becomes. It might make you feel overwhelmed because there are so many thing to learn. Learning from a few online courses or depend on some online certificates are merely not enough to keep yourself motivated. Therefore, it is important to have a good strategy to learn in an efficient way.

From personal experience, the method that I am using has proven effective for my at least, it keeps me motivated and enable me to applied to create real value.

Learning method - “Project based learning”:

  1. Study basic concept: You can quickly go through articles, online courses, read overview reports to grab the basic concepts of the tool or technique (4–7 days).
    Example: If you want to learn Python, read overview article about this programming language to understand basic syntaxes, information source, simple data structures, and basic application.
  2. Working on projects: Pick a simple project that fit with your level and start working on it. You can keep updating your knowledge while working on project. This process might take longer time because you will have to do much research and watch multiple tutorials (2–4 weeks).
  3. Repeat: repeat step 1 and step 2, but this time, learn more complicated concepts and choose higher-level projects. This step will repeat depend on how much you want to master your skills.

Keep in mind:

  • You will never reach a place where you know everything about a topic or a skills.
  • You will have to go through deliberate practice, when you have to put sustained effort into improving your performance.
  • Do research and decide the goal (SMART criteria), aiming single-minded at it.
  • You have to tell yourself that you will complete it by any chance with no excuse. Keep the belief that what ever the outcome is, you will be at the better place.

What to learn in Data Science

At the beginning, you should pick a field in data science (mentioned above) and aim at one field so you not get overwhelmed by too many options.

Based on multiple articles, research, tutorial, and personal experience, this is the my personal answer for the question: “Which concept are must-have to work in the field of data science”:

  • Basic knowledge
  • Mathematics and Statistics
  • Working with database
  • Python and its libraries
  • Data cleaning
  • Exploratory Data Analysis
  • Visualization

1. Basic knowledge

This include other information outside the technical aspects, but more related to real world situation. Some of the knowledge you should know: data science definition, education background, job characteristics, nature of work, salary, global trend, and personal meaning. You need to read and update the news to be constantly updated with this information.

For example: even you like data science, but you don’t have any degrees related to math, programming, or stats, you will have to accept the fact that you will find it harder to compete with people with these degree as they had an advantage of education background (more or less).

2. Mathematics and Statistics

Mathematics

  • Linear Algebra: this branch of math is extremely useful in machine learning as most machine learning models can be expressed in matrix form. A dataset itself is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation.
  • Probability: helps to predict the unknown outcome of any event. This allows data scientists to evaluate the certainty of outcome of their work. Key concepts involving probability distribution, statistical significance, exploratory data analysis, pattern analysis, hypothesis testing and regression.
  • Calculus: this branch of mathematics deals with methods based on the summing of infinitesimal differences to determine and describe derivatives and integrals of functions. Deep learning and machine learning both heavily rely on the idea of gradient descent. Only those who have a working knowledge of calculus.

Statistics

  • Descriptive statistics: Learn about location estimates (mean, median, mode, trimmed statistics, and weighted statistics), and variability used to describe data. This is the initial stage in analyzing quantitative data that can be easily visualized using graphs and charts.
  • Inferential statistics: involves defining business metrics, A/B tests, designing hypothesis tests, and analyzing collected data and experiment results using confidence intervals, p-value, and alpha values.

3. Working with database

This section will be about the overlap between Data Scientist and Data Engineer. Developing and creating pipelines that can collect data from several sources and consolidate it into a single warehouse. The data need to be represented in a highly usable format for further analysis.

Beginners can start with learning SQL language and then move on to one RDBMS such as MySQL, Oracle, and one NoSQL. Besides, it is also important to take elementary courses in cloud technologies and frameworks like agile and scrum.

4. Python and its libraries

Python programming language is widely used in scientific and research groups because it is simple and has simple syntax. Moreover, Python has a huge set of libraries such as NumPy, Pandas, Matplotlib, and Scikit-learn, this allow data scientist to work with data more efficiency.

Beginner should start with basic Python by taking course on Udemy or Coursera, some key syntax are: List, Set, Tuples, Dictionary, Function, …(remember to applied learning method above)

5. Data cleaning

Most of the time data scientist spend time on cleaning data, this is mandatory job for beginner. You simply can not have an unbiased results after analyzing uncleaned dataset.

Data cleaning is the process of identifying and fixing incorrect data. Below are common steps in data cleaning process:

  1. Remove irrelevant data
  2. Remove duplicate
  3. Standardize capitalization
  4. Convert data type
  5. Handling outliers
  6. Fix errors
  7. Language Translation
  8. Handle missing values

I always start the process by using spreadsheet or Python (depend on the amount of data) as it has simple and straightforward method.

6. Exploratory Data Analysis

This analysis simply mean investigating data to discover unknown patterns, to spot anomalies, to test hypothesis with the help of statistics and graphical visualization.

As a beginner, python would be the perfect tool to conduct EDA.

EDA steps:

  1. Data collection: the process of gathering, measuring, and analyzing accurate data from a variety of sources to find answers to a problem.

2. Data cleaning: identifying and fixing incorrect data. (section 5)

3. Univariate analysis: data being analyzed upon only one variable (no causes or relationships). The process describes the data and find patterns that exist within it. Common visualization techniques:

  • Box Plots: (whisker plot) displays the five-number summary of a the dataset: minimum, first quartile, median, third quartile, and maximum.
Box Plot
  • Histogram: a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data.
Histogram

4. Bivariate Analysis: this process use two variables and compare them. This enable us to identify how one feature affects the other and start further analysis to figure out the causes.

  • Scatter Plot: two-dimensional data visualization that uses dots to represent the values obtained for two different variables — one plotted along the x-axis and the other plotted along the y-axis.
Example of Scatter Plot
  • Bar Chart: represents categorical data, with rectangular bars having lengths proportional to the values that they represent.

7. Visualization

Data visualization forms the backbone of all analytical projects. It helps in gaining insights into the dataset and also used for data pre-processing. Having the right set of visualizations for different data types and business scenarios is the key to effective communication of results.

Chart types and when to use it

cite: https://www.reddit.com/r/datascience/comments/bo8a0c/the_fun_way_to_understand_data_visualization/

Powerful visualization tool that recommended for beginner:

  • Tableau: most widely used data visualization tools. Tableau is built on the work of scientific research to make analysis faster, easier, and more intuitive.
  • Power BI: an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence.
  • Google Chart: One of the major players in the data visualization market space, Google Charts, coded with SVG and HTML5, is famed for its capability to produce graphical and pictorial data visualizations.
  • JupiterR: A web-based application, JupyteR, is one of the top-rated data visualization tools that enable users to create and share documents containing visualizations

Note:

  • You can just pick one tool and understand how to use it.
  • Learning to use visualization tools is not as important as be able to use the right technique to layout your arguments.
  • Visualization is good when it can speak for itself without demanding reader to read the explanation.
  • Main purpose of visualization is to transfer the message rather than layout all information.

Writer note:

This blog is my overview about the field of data science, you might want to check other source for more detailed information.

Thank you for reading, I hope it helps !

Reference:

What is exploratory data analysis. Chartio. (n.d.). Retrieved January 9, 2023, from https://chartio.com/learn/data-analytics/what-is-exploratory-data-analysis/

Madhugiri, D. (2023, January 2). Ultimate Data Science Roadmap for 2023. Retrieved January 9, 2023, from https://www.knowledgehut.com/blog/data-science/data-science-roadmap

Simplilearn. (2022, December 16). 23 best data visualization tools for 2023. Simplilearn.com. Retrieved January 9, 2023, from https://www.simplilearn.com/data-visualization-tools-article#:~:text=Some%20of%20the%20best%20data,a%20large%20volume%20of%20data.

--

--

Ethan Duong

The place to share what I've learned, mostly tech-related ! Trying to keep the knowledge from fading overtime :) Reach me at ethan.duong1120@gmail.com