In the previous article, I have described the Kaplan-Meier estimator. To give a quick recap, it is a non-parametric method to approximating the true survival function. This time, I will focus on another approach to visualizing a survival dataset — using the hazard function and the Nelson-Aalen estimator. Once again, we will use the convenience of the
lifetimes library to quickly create the plots in Python.
With the Kaplan-Meier curves, we approximated the survival function, stating the probability of the event of interest (for example, the death event) not occurring by a certain time t.
An alternative approach to visualizing the aggregate information from a survival-focused dataset entails using the hazard function, which can be interpreted as the probability of the subject experiencing the event of interest within a small interval of time, assuming that the subject has survived up until the beginning of the said interval. For a more detailed description of the hazard function, please see this article. …
In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!
Tableau is a business intelligence tool used for creating elegant and interactive visualizations on top of data coming from a vast number of sources (you would be surprised how many distinct ones are there!). To make the definition even shorter, Tableau is used for building dashboards.
So why would a data scientist be interested in using Tableau instead of Python? When creating a Notebook/report with the results of a survival analysis exercise in Python, the reader will always be limited…
In my previous article, I described the potential use-cases of survival analysis and introduced all the building blocks required to understand the techniques used for analyzing the time-to-event data.
I continue the series by explaining perhaps the simplest, yet very insightful approach to survival analysis — the Kaplan-Meier estimator. After a theoretical introduction, I will show you how to carry out the analysis in Python using the popular
The Kaplan-Meier estimator (also known as the product-limit estimator, you will see why later on) is a non-parametric technique of estimating and plotting the survival probability as a function of time. It is often the first step in carrying out the survival analysis, as it is the simplest approach and requires the least assumptions. …
In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).
Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. …
I think you will not argue with me when I state that data science is becoming one of the most popular fields to work at, especially given that Harvard Business Review named “data scientist” the sexiest job of the 21st century. In the field, we have come a long way, from the times when terms like data science and machine learning were still unknown and everything was gathered under the umbrella of statistics. However, we are far from the end of the journey.
That can also be a dividing aspect of data science — the field is developing so rapidly that it can be difficult to even follow all the new algorithms, techniques, and approaches. So working in data science, similarly to software engineering, often requires constant learning and development. Don’t get me wrong, some people (myself included) like that a lot. Others prefer to learn for a few years and then just cut the coupons from that knowledge. …
There is a lot of content (books, blog posts, tutorials, etc.) out there about clustering in general and various ways of finding the optimal number of clusters in algorithms such as k-means clustering: the gap statistic, silhouette score, the infamous elbow (scree) plot, and many more.
Thanks to COVID-19 and the newly found chunks of extra spare time, I finally could go back to my backlog of books and finished Andriy Burkov’s excellent The Hundred-Page Machine Learning Book. The book provides a very good overview of various aspects of machine learning and encourages the readers to dive deeper into the topics of their interest. …
In current times, the majority of apps/websites we are using on a daily basis are greatly interested in how we are using them. That is because they want to learn from the users’ behavior and improve in order to attract or retain more users.
Imagine your favorite e-commerce app. Most likely, you are free to browse the products that the company is offering, and every now and then you are gently nudged to create an account in order to actually purchase something. …
As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will — to some degree of certainty — result in better performance in terms of the specified KPI.
This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference — it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). …
Python has a few very well developed and mature libraries used for statistical analysis, with the biggest two being
scipy. These two contain a lot (and I mean a LOT) of statistical functions and classes that will in 99% of the times cover all your use-cases. So why are there still any new libraries being released?
The newcomers often try to fill in a niche or to provide something extra that the established competition does not have. Recently, I stumbled upon a relatively new library called
pingouin. Some key features of the library include:
numpy. Operating directly on
DataFramesis something that can definitely come in handy and simplify the workflow. …
From the 1970s and 1980s, we can observe the meteoric rise of the notion of metric as a performance measure in business management. Peter Drucker, the man who is said to have invented modern business management, once wrote that “you can’t manage what you can’t measure”. Since then, the idea embedded in the quote has been applied not only in the business environment but basically in all the industries, including education, healthcare, military, etc.
However, metrics are not a silver bullet for measuring the efficiency of different aspects of the business (and not only). …
Working with missing data is an inherent part of the majority of the machine learning projects. A typical approach would be to use
SimpleImputer (or another imputer from the
sklearn.impute module). However, often the simplest approach might not be the best one and we could gain some extra performance by using a more sophisticated approach.
That is why in this article I wanted to demonstrate how to code a custom
scikit-learn based imputer. To make the case more interesting, the imputer will fill in the missing values based on the groups’ averages/medians.
Before jumping straight into coding I wanted to elaborate on a few potential reasons why writing a custom imputer class (inheriting from
scikit-learn) might be worth your…
SettingwithCopyWarning, this is the place for you!
Regardless of how long you worked with
pandas, be it a day or a year, sooner or later you are likely to run into the infamous
SettingWithCopyWarning. In this article, I explain what causes the problem and how to properly address the issue.
Before I dive into the technicalities, I want to highlight that
SettingWithCopyWarning is — as the name suggests — a warning, not an error. So the code we are executing will most likely not break and produce the end result. However, the end result might not be the one we actually intended to obtain.
The reason why I wanted to highlight the distinction is that we might be tempted to ignore the warning when we see that the code actually succeeds in returning a result. And as a matter of fact, the result might be correct! The best practice is to be extra careful and actually understand the underlying principles. This way, we can often save a lot of time trying to identify an obscure bug, which we could have avoided in the first place. …
At the moment of writing, we are currently close to the end of the Q1 2020 earnings season. For many investors and quantitative finance hobbyists, this is a specially interesting period, with multiple short-term investment opportunities. That is why in this article, I wanted to briefly introduce the concept of the earnings season and show how to easily extract all the important information using Python.
The term earnings season refers to the time periods of the year during which most publicly listed companies release their quarterly corporate earnings reports to the public. The earnings seasons occur in the months immediately following the end of each fiscal quarter, so this would correspond to the months of January, April, July, and October. Typically, the earnings seasons last for about 6 weeks, after which the number of reported earnings subsidies to the off-season frequency (there are still some companies reporting off-season, as their financial calendars might be a bit different than the most). …
Recently I finished working on my Capstone Project for Udacity’s Machine Learning Engineer Nanodegree. As I knew the project would take quite a lot of time and energy, I wanted to work on something that would be genuinely interesting to me. For a long time already, I was intending to familiarize myself with a domain of computer vision, namely image colorization. I am (and have been for as long as I can remember) a big fan of video games and that is why in this project I wanted to work with something close to my heart.
Recently, I saw a few posts on the Internet showing that by using Deep Learning it is possible to enhance the quality of emulated video games. By emulation, I mean running a game using dedicated software on a system different than the one the game was originally created for. An example could be playing Nintendo 64 games on a PC. By embedding pre-trained Neural Networks in the emulator software, it is possible to upscale the resolution to 4K or to increase the quality/sharpness of the textures. What is really amazing is that these solutions work out of the box for all games, not only for one or two games on which they were directly trained. …
Every computer vision project — be it a cat/dog classifier or bringing colors to old images/movies — involves working with images. And in the end, the model can only be as good as the underlying data — garbage in, garbage out. That is why in this post I focus on explaining the basics of working with color images in Python, how they are represented and how to convert the images from one color representation to another.
In this section, we set up the Python environment. First, we import all the required libraries:
import numpy as npfrom skimage.color import rgb2lab, rgb2gray, lab2rgb
from skimage.io …
Cohort Analysis is a very useful and relatively simple technique that helps in getting valuable insights about the behavior of any business’ customers/users. For the analysis, we can focus on different metrics (dependent on the business model) — conversion, retention, generated revenue, etc.
In this article, I provide a brief theoretical introduction into the Cohort Analysis and show how to carry it out in Python.
Let’s start with the basics. A cohort is a group of people sharing something in common, such as the sign-up date to an app, the month of the first purchase, geographical location, acquisition channel (organic users, coming from performance marketing, etc.) …
At BUX, we do our best to not only rely on our gut feeling about the directions we take but also follow a data-driven approach to decision making. If possible, we combine the user data we obtain via different forms of in-app activity tracking with qualitative feedback gathered from our users (CRM tools, questionnaires, meetups organized by BUX, etc.).
In this article, I would like to describe one of the bigger projects the team worked on recently, which resulted in redesigning one of the key features of the app (the change should already be visible in the latest versions of the app!). …
In one of my first articles on Medium, I showed how to train a Convolutional Neural Network to classify images coming from old GameBoy games — Mario and Wario. After over a year, I wanted to revisit one aspect of the process — downloading videos (and potentially audio) from YouTube videos and extracting frames as images. We can use such images for various machine learning projects.
In my previous article, I used a library called
pytube to download the videos. However, after some changes introduced by YouTube, it is not really usable anymore — any attempt to download videos results in
KeyError: ‘url_encoded_fmt_stream_map’. …
In this article, I describe a simple ensemble algorithm. In general, ensemble models combine multiple base models to improve the predicting performance. The best-known example of an ensemble model is the Random Forest, which — greatly simplifying the algorithm’s logic — combines multiple Decision Trees and aggregates their predictions using majority vote in case of a classification problem or by taking the average for regression tasks.
Similarly to the Random Forest, the Voting Ensemble estimates multiple base models and uses voting to combine the individual predictions to arrive at the final ones. However, the key difference lies in the base estimators. Models such as Voting Ensemble (and Stacking Ensemble) do not require the base models to be homogenous. …
What I really like about
scikit-learn is that I often stumble upon functionalities, which I was not aware of before. My most recent “discovery” is the
DummyClassifier. The dummy estimator does not learn any patterns from the features, it uses simple heuristics (inferred from the targets) to calculate the predictions.
We can use that naïve estimator as a simple sanity check for our more advanced models. To pass the check, the considered model should result in better performance than the simple benchmark.
In this short article, I show how to use the
DummyClassifier and explain the available heuristics.
We need to import the required…
In this article, I wanted to share my story of how writing articles on Towards Data Science started quite an adventure, which ended up as a published book. I will also describe what I learned on the way and what tools helped me in the process. Let’s start!
It was sometime in February of 2019 that I was contacted by an Acquisitions Editor with a question if I would be interested in authoring a book. …
The goal of this short article is to show how easy it is to download stock prices (and stock-related data) in Python. In this article I present two approaches, both using Yahoo Finance as the data source. There are many alternatives out there (Quandl, Intrinion, AlphaVantage, Tiingo, IEX Cloud, etc.), however, Yahoo Finance can be considered the most popular as it is the easiest one to access (free and no registration required).
The first approach uses a library called
yfinance and it is definitely the easiest approach that I am aware of. The second one, with
yahoofinancials, is a bit more complicated, however, for the extra effort we put into downloading the data, we receive a wider selection of stock-related data. …
Investing was always associated with large amounts of money, both in terms of the invested amount as well as costs associated with it. Here at BUX, we want to make investing accessible to everyone. That is why we recently launched BUX Zero in the Netherlands and other European countries will follow soon! BUX Zero is a zero-commission stock trading app, which makes investing not only accessible but also easy to do directly from your phone.
In general, violin plots are a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same information as in the box plots:
first quartile — 1.5 …