# All You Need to Know to Break into the Data World and Machine Learning

Data Science was referred to as the “sexiest job of the 21st century”. Terabytes of data is produced everyday, and it is time to take action! Many people are trying to break in one of the data-related fields; however, with lots of mixing up and confusion between the subfields and lots of available resources on the web, one might get lost on where to start. Many people end up learning general set of skills and become more into data science generalists.

This is why we decided to create this article which helps you discover the main data-related fields and choose the one that best suits you. We also summarized all the competencies required for each sub-field so you would have an action plan of what do next!

The roadmap here covers the **four** most frequent jobs in data and the required skills for each one. We will cover high-level details to help you discover what skills you are still lacking.

# Data Science

Data science can be best described as the “Art of dealing with data”. As a data scientist, You are not simply using a programmatic tool to reach point B from point A; However, you start by defining point A then start drawing all the possible paths from this points, explore your input data, put assumptions, state hypotheses formally, test your hypothesis using different statistical and mathematical tools, design and apply experiments if needed, evaluate the current cycle, develop some programmatic tools if needed and more..

Data Science has three main components :

1. Machine learning & computer science skills

2. Math and statistics

3. Domain related knowledge

Data science can be practiced by different stacks of technology and tools. Here, we’ll start by listing the required skills in the python stack.

**Skills required in the Python track**

- Familiarity with Numpy, pandas, sklearn, and matplotlib.
- Strong SQL skills, No-SQL skills are highly required too. That includes designing normalized schemas, good indexing technique, and writing

efficient queries. - Data cleaning
- Good data visualization skills(tools like tableau or libraries like matplotlib, seaborn, Bookeh, etc )
- Statistical analysis skills. This includes familiarity with the different statistical questions types.
- Experiment design and statistical testing(parametric and non-parametric testing)
- Familiarity with big data frameworks/ infrastructures (spark, hive, Hadoop, mongo, etc)
- Machine learning skills(skill level varies widely based on the

business logic) - Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)
- Story telling skills (powerpoint, etc)

Data science is a very broad field, usually you’d need to acquire new skills based on the task you are being assigned (how to build recommender systems, sequence modeling, etc) I only covered the essential skill set.

# Data Analysis

Data Analysis is basically about answering a business related question using data. This question can be:

- descriptive: You are simply describing the data sample you have and its related statistics. you are not interested in data outside your sample.
- exploratory: You are exploring different patterns, trends in the data, seasonality, relationships, and distribution. usually done using exploratory data analysis visualization tools.
- inferential: You are trying to infer some question answer about the data based on the sample you have using hypothesis testing and different statical testing techniques.
- predictive: You are using different statistical tools to extrapolate some values based on some variables like predicting revenue, new users behavior, etc.
- causal: This type of questions usually requires running one or more experiment to test for a causality factor between two or more variables.
- mechanistic: This one questions the underlying link between two sets of variables. It is usually hard to uncover in an uncontrolled environment.

Data analysis can be considered as a subfield of data science usually

for professional with no or little technical background. It usually requires statistics, and domain related experience.

Up till now, most data analyst use tools like SPSS and similar ones; however,

there has been a new trend into hiring data analyst with skills in R/ python

since they have more powerful tools in predictive analytics and big data.

**Skills required in the Python track**

- Familiarity with Numpy, pandas, sklearn, and matplotlib
- Strong SQL skills. No-SQL skills are highly required too. Normally

this includes writing efficient queries. - Good data visualization skills(tools like tableau, or libraries like

matplotlib, seaborn, etc ) - Statistical analysis skills
- Experiment design and statistical testing
- Understanding of basic predictive analytics tools like regression

models and clustering, cohort analysis, etc. - Strong understanding of the full cycle of data science(stating a sharp question, exploratory data analysis, inference, formal statistical modeling, interpretation, and communication)

# Machine Learning Engineering:

Machine learning is the field of AI we use to automate processes that usually require human intelligence to do specially in vision and language. ML is the subfield of AI that applies that using data. There are other non-data centric approaches in AI.

Machine learning is the most technical intensive track out of them.

It requires a range of technical skills like writing efficient queries, efficient learning algorithms(in time and accuracy)

**Skills required in the Python track:**

- Familiarity with Numpy, pandas, sklearn, and matplotlib
- Strong SQL, No-SQL skills are essential.
- Good data visualization skills(tools like tableau, or libraries like matplotlib, seaborn, etc )
- Familiarity with big data frameworks/ infrastructures (spark, hive,

Hadoop, mongo, etc) - Strong understanding of basic ml algorithms (regressions,

classification, clustering, and dimensionality reduction) - Feature Engineering and hyper-parameter fine tuning
- Strong intuition of the different optimization algorithms and when to use each one.
- Structuring and Evaluating ML algorithms
- Understanding different neural networks structures and new viral architectures.
- Reinforcement learning
- Strong familiarity with one or more of tge Deep learning frameworks(Tensorflow, keras, caffe, or torch, etc)
- Network analysis

# Data Engineering

Data engineering is the field that cares about building data pipelines and infrastructure. This job is crucial to any company that has huge amount of data and planning to acquire a data scientist. Usually, hiring a data engineer comes before hiring a data scientist.

**Skills required in the Python track:**

- In depth knowledge of SQL and noSQL solutions
- System architecture skills
- ETL and other data warehousing tools for efficient data storage

and retrieval - Familiarity with different AWS or any cloud services for data lakes,

data warehousing, etc - Big data based analytics(i.e. frameworks on top of mongo or

Hadoop like spark, hive, mapreduce) - Basic understanding of Data modeling , ML, and statistical

analysis. - Building efficient data pipelines

After all, all these fields are pretty new in industry and not yet well established. That’s why you need to keep up with the new skills, viral architectures, papers, etc.

We will follow up with another post about the best recommended online courses and degrees to learn each skill and a quick dive into each one of those bullet points.