The Data Science primer for beginners

Megha Sinha
Analytics Vidhya
Published in
5 min readMay 7, 2020

Have you been wondering of the meaning of ‘Data is the new oil’ for your business or how could a digital transformation leveraging analytics, AI and ML add value to your organization; but have been overwhelmed by the frenzy surrounding the world of data and analytics?

If you answered yes, this primer could be the repository that you needed to understand and engage in meaningful discussions in this field and to ride on the wave of the vast ocean of Data Science.

The ‘Lingo’

There is a barge of terminologies used in Data & Analytics conversations and more often than not such terms are used rather loosely or interchangeably. It makes the conversation difficult for managers who are new to this world. The terms defined below are in no way completely exhaustive but will definitely serve as a primer to help you get started.

Most commonly used terms in Data Science

The ABC of Machine Learning algorithms

Machine Learning (ML), as stated previously, is the method of enabling systems to learn on their own by providing them with data inputs. At a fundamental level, ML is the process of solving a problem by

  1. Gathering a cleaned, usable dataset.
  2. Defining an algorithm —i.e. a series of steps to be followed.
  3. Building a model based on the data set.
  4. Training and testing the model to predict solutions to the identified problem.

Machine Learning is classified into supervised, unsupervised, and reinforcement learning. The names are self - explanatory. Let us take the simple example of teaching the alphabet of the English language to a toddler. In the process, the teacher uses flashcards, handwritten alphabets, and/or puzzle pieces depicting letters A through Z. With repetition, the child learns to differentiate one alphabet from the other. The process of training a machine is no different. The flashcards, handwritten alphabets and the puzzle-pieces constitute the dataset(labeled in this case), repetition of visually displaying the cards is the algorithm which the machine then uses to predict the outcome when it comes across the same data in the future. This is Supervised Learning, i.e. the target(e.g. correct labels for an image), that needs to be predicted is fed to the machine during the process of learning.

On the other hand, unsupervised learning is the system of learning where the final target is not known in advance and the system learns it over iterations on its own. Reinforcement learning can be compared to the act of rewarding a child when he/she displays good conduct but neither rewarding nor reprimanding otherwise. Thus, the child(machine in the current context) is incentivized to perform better each time.

Widely implemented Machine learning use-cases

Analytics across the Value — Complexity matrix

Data analytics is classified into five categories on the basis of value it delivers to an organization and the complexity involved in delivering such values.

  1. Descriptive Analytics — Low Value & Low Complexity activity. Descriptive analytics relies on past data to generate reports at a department or an organization level that enables managers and teams to visualize how their businesses performed. This category of analytics helps to answer the question, “What happened?” E.g. MIS reports.
  2. Diagnostic Analytics — Low to Medium Value and Low to Medium Complexity activity. Diagnostic analytics helps in the assessment of the question, “Why something happened?” E.g. hypotheses building and problem-solving.
  3. Predictive Analytics — Medium to High Value and Medium to High Complexity. Predictive analytics helps in answering the question, “What will happen next at a granular level?” E.g. predictive modeling.
  4. Prescriptive Analytics — High Value and High Complexity. Prescriptive analytics helps in answering the question, “What decision to take to perform better?”E.g. IoT in manufacturing, autonomous vehicles, etc.

The five keys to derive value from analytics, ML and AI

To sum it all up, read below the key points to remember when starting a data science project.

  1. Know and define your problem: A clear, concise, and measurable problem is the fundamental requirement for getting started with any project, data science, or otherwise. The problem statement should be very specific such as “Which agents will perform if on-boarded?”, “Which customer will default at payments?”. It is equally important to define the metrics that will define the success of your project. A clear definition of problems and identification of metrics is the battle half-won.
  2. Be open to alternative approaches: The what and why of a problem are more important than the how of the problem. Therefore, never approach a problem with the intent of using AI or ML to find answers. It is imperative to understand that other alternatives may be better at solving your problems than an AI/ML solution might. Build hypotheses using the classic MECE approach(Mutually exclusive and collectively exhaustive) and reject /accept them backed by data and intelligent judgment.
  3. Be prepared to accept probabilistic vs deterministic outcomes: Data Science is the convergence of mathematics, statistics, computer programming, and deep domain knowledge. The statistical probability of outcomes should be understood ahead of implementation and accepted by business managers so as to not feel deceived of its capabilities. For example, the prediction of a sales agent’s performance at his next assignment can be predicted with a certain probability and should be accepted as-is. The agent may/may not perform as per the results predicted by a model subject to conditions that may not have been accounted for such as a black swan event of a COVID — 19 outbreak in recent times.
  4. Build a culture of iteration and continuous feedback: Iterative development approach to data science projects ensure reaching out to domain experts more often than a waterfall approach. Build the first model and continually perform error analysis with the help of domain experts to improve your model such that it meets the business requirements. This helps in incorporating dynamic, changing business needs and paves the way to more agile teams.
  5. Harness the power of Storytelling: All projects are implemented with an audience in mind. The audience in a data science project could be business leaders, end-users, contact center agents, the onboarding team, and so on. The solution to the problem should be represented such that it is easily accessible, clearly explainable and provide actionable insights to the audience. Projects, that do not represent clearly identifiable solutions fail despite building excellent models that lead to accurate outcomes.

--

--

Megha Sinha
Analytics Vidhya

Infusing technology into businesses to increase productivity and profitability| Digital Strategy & Transformation, Data Science, S/W Engineering