The Basic Question: Why Do You Want To Be A Data Scientist?
It is true that the “hottest job of the 21st century” has all the buzz, glam and lucre, but many enthusiasts are still confused as to what this job entitles. Fewer still, understand what it takes to be a data scientist.
Let me quote some important skills which required to become Data Scientist:
I’d recommend fluency in Python, including familiarity with the standard libraries (Pandas,NumPy,Matplotlib,seaborn, SciPy, scikit-learn, etc). More breadth and depth are better, but you have to start somewhere. R also plays vital role in Data Science but mastering in Python is must.Below are the list of programming language which is most In- Demand Programming Language
The responsibility of data analysts can vary across industries and companies, but fundamentally, data analysts utilize data to draw meaningful insights and solve problems. They analyze well-defined sets of data using using Data Visualization tools like Python,Tableau, Power BI and to answer tangible business needs: e.g. why sales dropped in a certain quarter, how internal attrition affects revenue,How Can we increase Performance metrics like Csat etc.
It evolves following task:
Feature engineering is most important task in Data Analysis.The features in your data will directly influence the predictive models you use and the results you can achieve.Identifying the features, co-relation between variables, which features need to keep and which feature need to drop etc play pivotal role in data analysis.
Data wrangling is the process of cleaning, structuring and enriching raw data into a desired format for better decision making in less time. Data wrangling is increasingly ubiquitous at today’s top firms. Data has become more diverse and unstructured, demanding increased time spent culling, cleaning, and organizing data ahead of broader analysis.
EDA(Exploratory Data Analysis)
It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand,before getting them dirty with it.
For example while doing the exploratory analysis on Ecomerce company, we come to know that Yearly Amount Spent has positive co-relation with Time on App.
The main goal is to visualize data and statistics, interpreting the displays to gain information. Data visualization is useful for data cleaning, exploring data structure, detecting outliers and unusual groups, identifying trends and clusters, spotting local patterns, evaluating modeling output, and presenting results
Data Visualization can be done using tools like Tableau,Power Bi,Google Data Studio, Plx Dashboard, python matplotlib,seaborn and plotly.
Below is the data visualization done using seaborn to find out co-ralation between features in E-Commerce data.
Maths and Statistics:
Learning the theoretical background for data science or machine learning can be a daunting experience, as it involves multiple fields of mathematics and statistics, and a long list of online resources.
However, if you are a beginner in machine learning and looking to get a job in industry, I don’t recommend studying all the math before starting to do actual practical work, this bottom up approach is counter-productive and you’ll get discouraged, as you started with the theory (dull?) before the practice (fun!).
My advice is to do it the other way around (top down approach), learn how to code, learn how to use the PyData stack (Pandas, sklearn, Keras, etc..), get your hands dirty building real world projects, use libraries documentations and YouTube/Medium tutorials. THEN, you’ll start to see the bigger picture, noticing your lack of theoretical background, to actually understand how those algorithms work, at that moment, studying math will make much more sense to you!
I will divide the resources to 3 sections (Linear Algebra, Calculus, Statistics and probability), the list of resources will be in no particular order, resources are diversified between video tutorials, books, blogs, and online courses.
- Khan Academy Linear Algebra series (beginner friendly).
- Coding the Matrix course (and book).
- 3Blue1Brown Linear Algebra series.
- fast.ai Linear Algebra for coders course, highly related to modern ML workflow.
- First course in Coursera Mathematics for Machine Learning specialization.
- “Introduction to Applied Linear Algebra — Vectors, Matrices, and Least Squares” book.
- MIT Linear Algebra course, highly comprehensive.
- Stanford CS229 Linear Algebra review.
Web scraping is a process of automating the extraction of data in an efficient and fast way. With the help of web scraping, you can extract data from any website, no matter how large is the data, on your computer. Moreover, websites may have data that you cannot copy and paste.
There are different techniques for Data Scrapping like Scrappy,BeautifulSoup etc. The most common technique is using BeautifulSoup. It extracts the html page in any website. The data stored by web Scrapping is unstrutured format. We convert the unstrutured into structured data for Data analysis and exploration.
Basic knowledge of web scrapping help us to collect unstructured data into structured format and perform analysis.
Machine Learning,Deep Learning and AI:
I heard words like data science, artificial intelligence, machine learning and deep learning. Within these scopes, there are still many words that arouse curiosity.
What is artificial intelligence?
Artificial intelligence (AI) is therefore, based on the idea of the capability of a machine or computer program to think(reason), understand and learn like humans.
From the definition of intelligence, we can also say that artificial Intelligence is the study of the possibility of creating machines able to apply knowledge received from data in manipulating the environment.
What is Machine learning?
Artificial intelligence is very vast. Machine learning(ML) is a subset of Artificial Intelligence.Machine learning(ML) is a set of statistical tools to learn from data and do predictions
What is Deep Learning?
In machine learning, data mostly passes through algorithms which perform linear transformations on them to produce output.
Deep learning is a subset of machine learning in which data goes through multiple number of non-linear transformations to obtain an output.
‘Deep’ refers to many steps in this case. The output of one step is the input for another step, and this is done continuously to get a final output. All these steps are not linear. An example of a non-linear transformation is a matrix transformation.
In a crisp to become a data Scientist we should have mastered in all these skill. Examples are:
- Linear Regression
- Logistic Regression
- K Nearest Neighbors
- Decision Trees and Random Forests
- Support Vector Machines
- K Means Clustering
- Principal Component Analysis
- Recommender Systems
- Natural Language Processing
- Neural Nets and Deep Learning
Following IDE can be used for building models
2. Jupyter Notebook
4. R Studio
The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems