# Bayes Theorem, Old but Gold!

## It has more than two centuries but has become the most used Machine Learning algorithms

Bayes Theorem allows anyone, in a deceptively simple manner, to calculate a conditional probability where intuition often fails. You’ve might bump into this theorem in Machine Learning when dealing with Maximum a Posteriori (MAP) — a probability framework for fitting a model to a training dataset — or in classification predictive modeling problems such as the Bayes Optimal Classifier and Naive Bayes.

Reverend Thomas Bayes was a wealthy presbyterian minister and amateur mathematician who lived in London in the eighteenth-century. Without realizing it created, the reverend created a completely new religion that influenced a great number of study fields over…

# Jupyter notebook for dummies, I mean, beginners Data Scientists!

## Agenda

• Advantages of using Jupyter notebook;
• How to get started with Jupyter notebook and its interface;
• Jupyter Kernels.

As a junior data scientist, notebooks are powerful tools to write, execute code and visualize the results. You can also use it to take notes to remember logic steps. All in one place! Track evolution as your portfolio is growing.

In this article, Jupyter notebook has the spotlight, as it is one of the most famous open-source IDE for creating and sharing live codes.

• It's a free, open-source, and interactive web tool;
• Have it all in a single document;

# Who’s the Data Engineer?

In recent decades, anyone that could retrieve value from data would be considered a Data Analyst, and who was able to create backend platforms to support such data analysis would be considered an ETL developer. With the introduction of Big Data and new technologies that raised with it we also saw the evolution and readaptation of such roles.

Let’s go through them…

Data analysis has always been around businesses. Business Intelligence (BI) is a term that has been evolving for over 150 years. For example, it is assumed that the first usage of the term BI was by Richard Miller…

# Introduction to Statistics for Data Science

## Advanced Level — The Fundamentals of Inferential Statistics with Point Estimators and Confidence Intervals Estimates

In Statistics, to infer the value of an unknown parameter we use estimators. Estimation is the process used to make inferences, from a sample, about an unknown population parameter.

Based on a random sample of a population, a point estimate is the best estimate although it is not absolutely accurate. Furthermore, if you continuously retrieve random samples from the same population it is expected that the point estimate would vary from sample to sample.

On the other hand, a confidence interval is an estimate constructed on assumption that the true parameter will fall within a specified proportion regardeless of the…

# Introduction to Statistics for Data Science

Although the Central Limit Theorem can be set as part of the “Advanced Level — The Fundamentals of Inferential Statistics with Probability Distributions” post, it is my belief this theorem deserves a single post!

The first step of every statistical analysis you will perform is to determine whether the dataset you are dealing with is a population or a sample. As you might recall, a population is a collection of all items of interest in your study whereas a sample is a subset of data points from that population. Let’s take a short refresher!

• Population : it’s a number of…

# Introduction to Statistics for Data Science

## Advanced Level — The Fundamentals of Inferential Statistics with Probability Distributions

We've covered the basics of Descriptive Statistics with the first two posts on this series. It is time to move on to Inferential Statistics, which are methods that rely on probability theory and distribution helping us to predict, in particular, the population’s values based on sample data.

We’ve seen that descriptive statistics provide information about our sample data by providing us with a concise summary of data. For example, we were able to calculate the mean and standard deviation of a player’s height from the English Premier League. Such information can provide valuable knowledge about a group of players.

On…

# Introduction to Statistics for Data Science

## Intermediate Level — The Fundamentals of Descriptive Statistics

In the last post I’ve been through some introductory but important statistics concepts to get you started in Data Science. The terms population and sample were analysed, the types of data you might work with and the different types of measures you might perform to your data such as measure of central tendency (mean, median, mode), measure of variability (variance, standard deviation) and measure of asymetry (skewness and modality).

This is a more intermediate level post where will see some statistical concepts which are less known but still important in your initial exploratory data analysis. …

# Introduction to Statistics for Data Science

## Basic Level — The Fundamentals of Descriptive Statistics

Statistics is a big part of a Data Scientist’s daily living. Each time you start an analysis, your first steps before applying fancy algorithms and making some predictions is to first do some exploratory data analysis (EDA) and try to read and understand the data by applying statistical techniques. With this first data analysis, you are able to understand what type of distribution the data presents.

At the end of this brief introduction, we will use the dataset of Lego dataset to make sense of these concepts.

Descriptive statistics is the analysis of data which helps to describe, show or…

# The lesser known Machine Learning system’s criteria

These are the lesser known ways to classify a Machine Learning algorithm! In previous posts I’ve mentioned how to state your problem and ways to classify a ML system, however there is still to mention a way to classify a system by whether or not it can learn incrementally from a stream of incoming data.

Batch Learning

In Batch Learning (BL) a system cannot learn incrementally, it must be fed all the training data in order to generate the best model. This is usually very time consuming as well as demanding on computing resources (CPU, memory space, disk space, etc.)… 