Efficiently exploring the parameter-search through Bayesian Optimization with skopt in Python. TL;DR: my hyperparameters are always better than yours.

Image for post
Image for post
Explore vast canyons of the problem space efficiently — Photo by Fineas Anton on Unsplash

In this post, we will build a machine learning pipeline using multiple optimizers and use the power of Bayesian Optimization to arrive at the most optimal configuration for all our parameters. All we need is the sklearn Pipeline and Skopt.
You can use your favorite ML models, as long as they have a sklearn wrapper (looking at you XGBoost or NGBoost).

About Hyperparameters

The critical point for finding the best models that can solve a problem are not just the models. We need to find the optimal parameters to make our model work optimally, given the dataset. This is called finding or searching hyperparameters. …


A simple, yet meaningful probabilistic Pyro model to uncover change-points over time.

Image for post
Image for post
Solitude. A photo by Sasha Freemind on Unsplash

One profound claim and observations by the media is, that the rate of suicides for younger people in the UK have risen from the 1980s to the 2000s. You might find it generally on the news , in publications or it is just an accepted truth by the population. But how can you make this measurable?

Making an assumption tangible

In order to make this claim testable we look for data and find an overview of the suicide rates, specifically England and Wales, at the Office for National Statistics (UK) together with an overall visualization.

https://www.ons.gov.uk/visualisations/dvc661/suicides/index.html

Generally, one type of essential questions to ask — in everyday life, business or academia — is when changes have occurred. You are under the assumption that something has fundamentally changed over time. In order to prove that you have to quantify it. So you get some data on the subject-matter and build a model to display points at which changes in values have occurred as well as their magnitude. We are going to look at exactly how to do that. …


Make the best of missing data the Bayesian way. Improve model performance and make comparative benchmarks using Monte Carlo methods.

Image for post
Image for post
A Missing frame ready to throw you off your model. Photo by Vilmos Heim on Unsplash.

The Ugly Data

How do you handle missing data, gaps in your data-frames or noisy parameters?
You have spent hours at work, in the lab or in the wild to generate or curate a dataset given an interesting research question or hypothesis. Terribly enough, you find that some of the measurements for a parameter are missing!
Another case that might throw you off is unexpected noise that was introduced at some point in the experiment and has doomed some of your measurements to be extreme outliers. …


Build better Data Science workflows with probabilistic programming languages and counter the shortcomings of classical ML.

Image for post
Image for post
The tools to build, train and tune your probabilistic models. Photo by Patryk Grądys on Unsplash.

We should always aim to create better Data Science workflows.
But in order to achieve that we should find out what is lacking.

Classical ML workflows are missing something

Classical Machine Learning is pipelines work great. The usual workflow looks like this:

  1. Have a use-case or research question with a potential hypothesis,
  2. build and curate a dataset that relates to the use-case or research question,
  3. build a model,
  4. train and validate the model,
  5. maybe even cross-validate, while grid-searching hyper-parameters,
  6. test the fitted model,
  7. deploy the model for the use-case,
  8. answer the research question or hypothesis you posed.

As you might have noticed, one severe shortcoming is to account for certainties of the model and confidence over the output. …


Modeling U.S. cancer-death rates with two Bayesian approaches: MCMC in STAN and SVI in Pyro.

Image for post
Image for post
Modeling death-rates across U.S. counties — Photo by Joey Csunyo on Unsplash

Single parameter models are an excellent way to get started with the topic of probabilistic modeling. These models comprise of one parameter that influences our observation and which we can infer from the given data. In this article we look at the performance and compare two well established frameworks — the statistical language STAN and the Pyro Probabilistic Programming Language (PPL).

Kidney Cancer Data

One old and established dataset is the cases of kidney cancer in the U.S. from 1980–1989, which is available here (see [1]). Given are U.S. counties, their total population and the cases of reported cancer-deaths.
Our task is to infer the rate of death from the given data in a Bayesian way.
An elaborate walk-through of the task can be found in section 2.8 of “Bayesian Data Analysis 3” [1].


The way from data novice to professional

Image for post
Image for post
Clock with reverse numeral, Jewish Town-hall Clock, Prague - by Richard Michael (all rights reserved).

There exists the idea that practicing something for over 10000 h (ten-thousand-hours) lets you acquire enough proficiency with the subject. The concept is based on the book Outliers by M. Gladwell. The mentioned 10k hours are how much time you spend practicing or studying a subject until you have a firm grasp and can be called proficient. Though this amount of hours is somewhat arbitrary, we will take a look on how those many hours can be spent to gain proficiency in the field of Data Science.

Imagine this as a learning budget in your Data-apprenticeship journey. If I were to start from scratch, this is how I would spend those 10 thousand hours to become a proficient Data Scientist. …


One reason why Bayesian Modeling works with real world data. The approximate light-house in the sea of randomness.

Image for post
Image for post
Photo by William Bout on Unsplash

When you want to gain more insights into your data you rely on programming frameworks that allow you to interact with probabilities. All you have in the beginning is a collection of data-points. It is just a glimpse into the underlying distribution from which your data comes. However, you not only want simple data-points in the end. What you want is elaborate, talkative density distributions with which you can perform tests. For this, you use probabilistic frameworks like TensorFlow Probability, Pyro or STAN to compute posteriors of probabilities.
As we will see, the computation of this is not always feasible and we rely on Markov Chain Monte Carlo (MCMC) methods or Stochastic Variational Inference (SVI) to solve those problems. Especially over large data-sets or even every-day, medium-sized datasets we have to perform the magic of sampling and inference to compute values and fit models. If these methods would not exist we would be stuck with a neat model that we had thought up, but no way to know if it actually makes sense. …


The answer to: “Why is my model running forever?” or the classic: “I think it might have converged?”

The driver behind a lot of models that the average Data Scientist or ML-engineer uses daily relies on numerical optimization methods. Studying the optimization and performance of different functions helps to gain a better understanding of how the process works.
The challenge we face on a daily basis is that someone gives us a model of how they think the world or their problem works. Now, you as a Data Scientist have to find the optimal solution to the problem. For example, you look at an energy-function and want to find the absolute, global minimum for your tool to work or your protein to be stable. Maybe you have modeled user-data and you want to find the ideal customer given all your input-features — hopefully, continuous ones. …


Connect the dots over time and forecast with confidence(-intervals).

Image for post
Image for post
Connecting dots across time — photo by israel palacio on Unsplash

Have you ever wondered how to account for uncertainties in time-series forecasts?
Have you ever thought there should be a way to generate data-points from previously seen data and make judgement calls about certainties? I know I have.
If you want to build models that capture probabilities and hold confidences we recommend using a probabilistic programming framework like Pyro.
In a previous article we have looked at NGBoosting and have applied it to the M5 forecasting challenge on Kaggle. As a quick recap — the M5 forecasting challenge asks us to predict how the sales of Walmart items will develop over time. It provides around 4–5 years of data for items of different categories from different stores across different states and asks us to forecast 28 days that we have no information about. As an overview over the challenge and data-set we still recommend this amazing notebook. …


or how to predict something you don’t know with confidence-intervals.

Currently there is a prominent forecasting challenge happening on Kaggle, the M5 Forecasting Challenge . Given are the sales of product-items over a long time range (1913 days or around 5 years), calendar information and price-info. An excellent in-depth analysis by M. Henze can be found here. The goal of the challenge is, to make informed predictions on how the sales for the individual items will continue over the upcoming 28 days.
One can approach this challenge with different methods: LSTMs, MCMC methods and others come to mind. …

About

Richard Michael

I am a Data Scientist and M.Sc. student in Bioinformatics at the University of Copenhagen. You can find more content on my weekly blog http://laplaceml.com/blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store