Image for post
Image for post
Photo by Joshua Earle on Unsplash

Learn how to use the non-parametric approach to estimating the cumulative hazard function!

In the previous article, I have described the Kaplan-Meier estimator. To give a quick recap, it is a non-parametric method to approximating the true survival function. This time, I will focus on another approach to visualizing a survival dataset — using the hazard function and the Nelson-Aalen estimator. Once again, we will use the convenience of the lifetimes library to quickly create the plots in Python.

1. The Nelson-Aalen estimator

With the Kaplan-Meier curves, we approximated the survival function, stating the probability of the event of interest (for example, the death event) not occurring by a certain time t.

An alternative approach to visualizing the aggregate information from a survival-focused dataset entails using the hazard function, which can be interpreted as the probability of the subject experiencing the event of interest within a small interval of time, assuming that the subject has survived up until the beginning of the said interval. For a more detailed description of the hazard function, please see this article. …


Image for post
Image for post
Source: Unsplash

Facilitate access to the survival analysis for your entire company!

In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!

Image for post
Image for post
Source

Tableau is a business intelligence tool used for creating elegant and interactive visualizations on top of data coming from a vast number of sources (you would be surprised how many distinct ones are there!). To make the definition even shorter, Tableau is used for building dashboards.

So why would a data scientist be interested in using Tableau instead of Python? When creating a Notebook/report with the results of a survival analysis exercise in Python, the reader will always be limited…


Image for post
Image for post
Photo by Tobias Tullius on Unsplash

Learn one of the most popular techniques used for survival analysis and how to implement it in Python!

In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.

I continue the series by explaining perhaps the simplest, yet very insightful approach to survival analysis — the Kaplan-Meier estimator. After a theoretical introduction, I will show you how to carry out the analysis in Python using the popular lifetimes library.

1. The Kaplan-Meier Estimator

The Kaplan-Meier estimator (also known as the product-limit estimator, you will see why later on) is a non-parametric technique of estimating and plotting the survival probability as a function of time. It is often the first step in carrying out the survival analysis, as it is the simplest approach and requires the least assumptions. …


Image for post
Image for post
Source: pixabay

Understand the basic concepts of survival analysis and what tasks it can be used for!

In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).

Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. …


Image for post
Image for post
Photo by Ivo Rainha on Unsplash

These websites will help you keep up to date with the latest trends in data science

I think you will not argue with me when I state that data science is becoming one of the most popular fields to work at, especially given that Harvard Business Review named “data scientist” the sexiest job of the 21st century. In the field, we have come a long way, from the times when terms like data science and machine learning were still unknown and everything was gathered under the umbrella of statistics. However, we are far from the end of the journey.

That can also be a dividing aspect of data science — the field is developing so rapidly that it can be difficult to even follow all the new algorithms, techniques, and approaches. So working in data science, similarly to software engineering, often requires constant learning and development. Don’t get me wrong, some people (myself included) like that a lot. Others prefer to learn for a few years and then just cut the coupons from that knowledge. …


Image for post
Image for post
Photo by nrd on Unsplash

Learn how the criterion works and how to implement it in Python from scratch!

There is a lot of content (books, blog posts, tutorials, etc.) out there about clustering in general and various ways of finding the optimal number of clusters in algorithms such as k-means clustering: the gap statistic, silhouette score, the infamous elbow (scree) plot, and many more.

Thanks to COVID-19 and the newly found chunks of extra spare time, I finally could go back to my backlog of books and finished Andriy Burkov’s excellent The Hundred-Page Machine Learning Book. The book provides a very good overview of various aspects of machine learning and encourages the readers to dive deeper into the topics of their interest. …


Image for post
Image for post
Photo by UX Indonesia on Unsplash

Learn the basics of process mining and how to use process discovery algorithms in Python

In current times, the majority of apps/websites we are using on a daily basis are greatly interested in how we are using them. That is because they want to learn from the users’ behavior and improve in order to attract or retain more users.

Imagine your favorite e-commerce app. Most likely, you are free to browse the products that the company is offering, and every now and then you are gently nudged to create an account in order to actually purchase something. …


Image for post
Image for post
Photo by Kaboompics .com from Pexels

Learn how to correctly calculate and interpret the effect size for your A/B tests!

As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will — to some degree of certainty — result in better performance in terms of the specified KPI.

This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference — it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). …


Image for post
Image for post
Photo by Ian Parker on Unsplash

A quick tour of the library and how it stands out from the old guard

Python has a few very well developed and mature libraries used for statistical analysis, with the biggest two being statsmodels and scipy. These two contain a lot (and I mean a LOT) of statistical functions and classes that will in 99% of the times cover all your use-cases. So why are there still any new libraries being released?

The newcomers often try to fill in a niche or to provide something extra that the established competition does not have. Recently, I stumbled upon a relatively new library called pingouin. Some key features of the library include:

  • The library is written in Python 3 and is based mostly on pandas and numpy. Operating directly on DataFrames is something that can definitely come in handy and simplify the workflow. …


Image for post
Image for post
Image by arielrobin from Pixabay

In a world full of metrics, we need to clearly recognize those that do more harm than good

From the 1970s and 1980s, we can observe the meteoric rise of the notion of metric as a performance measure in business management. Peter Drucker, the man who is said to have invented modern business management, once wrote that “you can’t manage what you can’t measure”. Since then, the idea embedded in the quote has been applied not only in the business environment but basically in all the industries, including education, healthcare, military, etc.

However, metrics are not a silver bullet for measuring the efficiency of different aspects of the business (and not only). …


Image for post
Image for post
Photo by Gabriel Crismariu on Unsplash

Learn how to create custom imputers, including groupby aggregation for more advanced use-cases

Working with missing data is an inherent part of the majority of the machine learning projects. A typical approach would be to use scikit-learn’s SimpleImputer (or another imputer from the sklearn.impute module). However, often the simplest approach might not be the best one and we could gain some extra performance by using a more sophisticated approach.

That is why in this article I wanted to demonstrate how to code a custom scikit-learn based imputer. To make the case more interesting, the imputer will fill in the missing values based on the groups’ averages/medians.

Why should you write custom imputer as classes?

Before jumping straight into coding I wanted to elaborate on a few potential reasons why writing a custom imputer class (inheriting from scikit-learn) might be worth your…


Image for post
Image for post
Photo by NeONBRAND on Unsplash

If you are wondering what causes the SettingwithCopyWarning, this is the place for you!

Regardless of how long you worked with pandas, be it a day or a year, sooner or later you are likely to run into the infamous SettingWithCopyWarning. In this article, I explain what causes the problem and how to properly address the issue.

Warning not an error

Before I dive into the technicalities, I want to highlight that SettingWithCopyWarning is — as the name suggests — a warning, not an error. So the code we are executing will most likely not break and produce the end result. However, the end result might not be the one we actually intended to obtain.

The reason why I wanted to highlight the distinction is that we might be tempted to ignore the warning when we see that the code actually succeeds in returning a result. And as a matter of fact, the result might be correct! The best practice is to be extra careful and actually understand the underlying principles. This way, we can often save a lot of time trying to identify an obscure bug, which we could have avoided in the first place. …


Image for post
Image for post
Photo by William Iven on Unsplash

Stay up-to-date with the companies’ earnings with a few lines of code

At the moment of writing, we are currently close to the end of the Q1 2020 earnings season. For many investors and quantitative finance hobbyists, this is a specially interesting period, with multiple short-term investment opportunities. That is why in this article, I wanted to briefly introduce the concept of the earnings season and show how to easily extract all the important information using Python.

What is the earnings season?

The term earnings season refers to the time periods of the year during which most publicly listed companies release their quarterly corporate earnings reports to the public. The earnings seasons occur in the months immediately following the end of each fiscal quarter, so this would correspond to the months of January, April, July, and October. Typically, the earnings seasons last for about 6 weeks, after which the number of reported earnings subsidies to the off-season frequency (there are still some companies reporting off-season, as their financial calendars might be a bit different than the most). …


Image for post
Image for post
Photo by Anne Nygård on Unsplash

Recently, I watched the Data Science Pioneers movie by Dataiku, in which several data scientists talked about their jobs and how they apply data science in their daily jobs. …


Image for post
Image for post
Source: unsplash

A case study of colorizing images coming from an old-school video game using Deep Learning in Python

Recently I finished working on my Capstone Project for Udacity’s Machine Learning Engineer Nanodegree. As I knew the project would take quite a lot of time and energy, I wanted to work on something that would be genuinely interesting to me. For a long time already, I was intending to familiarize myself with a domain of computer vision, namely image colorization. I am (and have been for as long as I can remember) a big fan of video games and that is why in this project I wanted to work with something close to my heart.

Recently, I saw a few posts on the Internet showing that by using Deep Learning it is possible to enhance the quality of emulated video games. By emulation, I mean running a game using dedicated software on a system different than the one the game was originally created for. An example could be playing Nintendo 64 games on a PC. By embedding pre-trained Neural Networks in the emulator software, it is possible to upscale the resolution to 4K or to increase the quality/sharpness of the textures. What is really amazing is that these solutions work out of the box for all games, not only for one or two games on which they were directly trained. …


Image for post
Image for post
Source: pexels.com

Learn the basics of working with RGB and Lab images to boost your computer vision projects!

Every computer vision project — be it a cat/dog classifier or bringing colors to old images/movies — involves working with images. And in the end, the model can only be as good as the underlying data — garbage in, garbage out. That is why in this post I focus on explaining the basics of working with color images in Python, how they are represented and how to convert the images from one color representation to another.

Setup

In this section, we set up the Python environment. First, we import all the required libraries:

import numpy as npfrom skimage.color import rgb2lab, rgb2gray, lab2rgb
from skimage.io …


Image for post
Image for post
Source: Unsplash

Learn how to carry out the Cohort Analysis to better understand the customers’ behavior

Cohort Analysis is a very useful and relatively simple technique that helps in getting valuable insights about the behavior of any business’ customers/users. For the analysis, we can focus on different metrics (dependent on the business model) — conversion, retention, generated revenue, etc.

In this article, I provide a brief theoretical introduction into the Cohort Analysis and show how to carry it out in Python.

Introduction to Cohort Analysis

Let’s start with the basics. A cohort is a group of people sharing something in common, such as the sign-up date to an app, the month of the first purchase, geographical location, acquisition channel (organic users, coming from performance marketing, etc.) …


Image for post
Image for post

How BUX used data to prototype a new feature and A/B test its performance

At BUX, we do our best to not only rely on our gut feeling about the directions we take but also follow a data-driven approach to decision making. If possible, we combine the user data we obtain via different forms of in-app activity tracking with qualitative feedback gathered from our users (CRM tools, questionnaires, meetups organized by BUX, etc.).

In this article, I would like to describe one of the bigger projects the team worked on recently, which resulted in redesigning one of the key features of the app (the change should already be visible in the latest versions of the app!). …


Image for post
Image for post
Source: Unsplash

And how to use a custom class to extract frames as images

In one of my first articles on Medium, I showed how to train a Convolutional Neural Network to classify images coming from old GameBoy games — Mario and Wario. After over a year, I wanted to revisit one aspect of the process — downloading videos (and potentially audio) from YouTube videos and extracting frames as images. We can use such images for various machine learning projects.

Setup

In my previous article, I used a library called pytube to download the videos. However, after some changes introduced by YouTube, it is not really usable anymore — any attempt to download videos results in KeyError: ‘url_encoded_fmt_stream_map’. …


Image for post
Image for post
Source: pixabay

Learn how to leverage the strengths of multiple models using a variant of ensemble learning

In this article, I describe a simple ensemble algorithm. In general, ensemble models combine multiple base models to improve the predicting performance. The best-known example of an ensemble model is the Random Forest, which — greatly simplifying the algorithm’s logic — combines multiple Decision Trees and aggregates their predictions using majority vote in case of a classification problem or by taking the average for regression tasks.

Similarly to the Random Forest, the Voting Ensemble estimates multiple base models and uses voting to combine the individual predictions to arrive at the final ones. However, the key difference lies in the base estimators. Models such as Voting Ensemble (and Stacking Ensemble) do not require the base models to be homogenous. …


Image for post
Image for post
Source: pixabay

Learn how to create a selection of benchmark models for both classification and regression problems

What I really like about scikit-learn is that I often stumble upon functionalities, which I was not aware of before. My most recent “discovery” is the DummyClassifier. The dummy estimator does not learn any patterns from the features, it uses simple heuristics (inferred from the targets) to calculate the predictions.

We can use that naïve estimator as a simple sanity check for our more advanced models. To pass the check, the considered model should result in better performance than the simple benchmark.

In this short article, I show how to use the DummyClassifier and explain the available heuristics.

Setup

We need to import the required…


Image for post
Image for post
Source: Unsplash

My short story of going from Medium articles to a book contract

In this article, I wanted to share my story of how writing articles on Towards Data Science started quite an adventure, which ended up as a published book. I will also describe what I learned on the way and what tools helped me in the process. Let’s start!

How it began

It was sometime in February of 2019 that I was contacted by an Acquisitions Editor with a question if I would be interested in authoring a book. …


Image for post
Image for post
Source: Unsplash

Download historical stock prices with as little as one line of code!

The goal of this short article is to show how easy it is to download stock prices (and stock-related data) in Python. In this article I present two approaches, both using Yahoo Finance as the data source. There are many alternatives out there (Quandl, Intrinion, AlphaVantage, Tiingo, IEX Cloud, etc.), however, Yahoo Finance can be considered the most popular as it is the easiest one to access (free and no registration required).

The first approach uses a library called yfinance and it is definitely the easiest approach that I am aware of. The second one, withyahoofinancials, is a bit more complicated, however, for the extra effort we put into downloading the data, we receive a wider selection of stock-related data. …


Learn how to create and implement trading strategies based on Technical Analysis!

Image for post
Image for post
BUX Zero, our new zero-commission stock trading app

Investing was always associated with large amounts of money, both in terms of the invested amount as well as costs associated with it. Here at BUX, we want to make investing accessible to everyone. That is why we recently launched BUX Zero in the Netherlands and other European countries will follow soon! BUX Zero is a zero-commission stock trading app, which makes investing not only accessible but also easy to do directly from your phone.

This is the second article on backtesting trading strategies in Python. …


Image for post
Image for post
Source: Unsplash

Learn how to use violin plots and what are their advantages over box plots!

In general, violin plots are a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same information as in the box plots:

  • median (a white dot on the violin plot)
  • interquartile range (the black bar in the center of violin)
  • the lower/upper adjacent values (the black lines stretched from the bar) — defined as first quartile — 1.5 …

About

Eryk Lewinson

Data Scientist, ML/DL enthusiast, quantitative finance, gamer.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store