# 1st place in Kaggle LANL Earthquake Prediction Competition

I am excited to report another winning solution on Kaggle for the LANL Earthquake Prediction competition. The goal of this competition was to predict the remaining time until an earthquake breaks out based on acoustic data from laboratory experiments. The following description can be also found on Kaggle including respective discussion.

Thanks a lot to the hosts of this competition and congratz to all participants and of course to my amazing teammates.

What made this competition tricky was to find a proper CV setup that you believe in as the public LB gave bad feedback for private LB. …

# 1st place in Kaggle Quora Insincere Questions Classification Competition

A few weeks ago, I decided to try out Kaggle competitions and directly won the first place in my first competition I ever participated in: Kaggle Quora Insincere Questions Classification

The goal of the competition was to classify Quora questions (text data) in two categories: sincere or insincere. In the following I try to summarize some of the main points of our solution.

Model Structure We played around with a variety of different model structures, but in the end resorted to a quite simple one. It’s basically a Single Bi-LSTM 128 followed by a Conv1D with kernel size 1 only…

# Introduction to Bayesian Inference: A Coin Flipping Example

Originally published at my old Wordpress blog.

Recently, I have been involved with more teaching and one part of my teaching efforts has been to provide an introduction to Bayesian inference. Personally, I have the intuition that this can be best achieved by working through a very simple example: namely the classic coin flip example. Also, it is often helpful to accompandy the elaborations of Bayesian statistics with methods of how frequentists would tackled given problems. After all, these are the things most people are familiar with.

To that end, I have prepared slides and a jupyter notebook:

Slides: Speakerdeck

Notebook: Nbviewer

In case of any comments, please let me know. I am also happy for help with extending current work, e.g., by collaborating on github.

# UDF in Google’s BigQuery: An example based on calculating text readability

Originally published at my old Wordpress blog.

In my data science workflow, I have recently started to heavily utilize Google’s BigQuery which allows you to store and query large data in SQL style. Internally, Google uses their enormeous processing power in order to guarantee blazing fast queriees; even if those are complex ones that operate on huge data. There is a specific amount of operations that is free, and after exceeding the free quota, Google has a reasonable pricing model.

Due to the fast calculation of queries, it is mostly a good idea to do as much calculation inside a…

# Bayesian Correlation with PyMC

Originally published at my old Wordpress blog.

Recently, I have been getting more and more interested in Bayesian techniques and specifically, I have been researching how to approach classical statistical inference problems within the Bayesian framework. As a start, I have looked into calculating Pearson correlation.

To that end, I have found great resources in the great blog by Rasmus Bååth who had a two-part series about how to model correlation in a Bayesian way [1,2]. A very similar model has also been proposed and discussed in [3].

My main contribution here is that I show how to apply the…

# Handling huge matrices in Python

Originally published at my old Wordpress blog.

Everyone who does scientific computing in Python has to handle matrices at least sometimes. The go-to library for using matrices and performing calculations on them is Numpy. However, sometimes your matrices grow so large that you cannot store them any longer in memory. This blog post should provide insights into how to handle such a scenario.

The most prominent, and the solution I would suggest at first, is to use Scipy’s sparse matrices. Scipy is a package that builds upon Numpy but provides further mechanisms like sparse matrices which are regular matrices that…

# Determining Power Law parameter(s) using Bayesian modeling with PyMC

Originally published at my old Wordpress blog.

In a previous post, I talked about fitting the power law function to empirical data. Recently, I got highly interested in Bayesian modeling and probabilistic programming. I am currently re-reading the excellent freely available book “Probabilistic Programming and Bayesian Methods for Hackers” which provides a thorough tutorial for using the PyMC Python library. In order to get familiar with the framework, I had the idea to try to model the parameters of a power law function given empirical data using Bayesian inference.

I make the code and explanations available via an iPython notebook

# Statistical test for randomness in categorical data sequences

Originally published at my old Wordpress blog.

Previously, I worked a lot with sequences consisting of categorical data. For example, sequences of categories where the set of categories is finite. As a prerequisity of my further modeling approaches of such data, I was interested in first applying a statistical test that kind-of gives me an idea whether the data has been produced in random fashion or based on some regularities.

My first idea was to use autocorrelation and subsequently, apply something like a Ljung-Box test which could give me insights into the overall randomness of a sequence. However, autocorrelation builds…

# Statistical Significance Tests on Correlation Coefficients

Originally published at my old Wordpress blog.

Recently, I had to determine whether two calculated correlation coefficient are statistically significantly different from each other. Basically, there exist two types of scenarios: (i) You want to compare two dependent correlations or (ii) you want to compare two independent correlations. I now want to cover both cases and present methods of determining the statistical significance.

Two independent correlations

This use case applies when you have two correlations that come from different samples and are independent to each other. An example would be that you want to know whether height and weight are…

# The popularity of subreddits and domains on Reddit

Originally published at my old Wordpress blog.

In a previous blog post I introduced a Reddit dataset I crawled which includes submission data for one complete year (2012–04–24–2014–04–23). I showed that the number of submissions each day were steadily rising but that interestingly more submissions are added to Reddit during the week. However, weekend submissions seem to get more attention by Redditors on average. Motivated by comments to this blog post and some further ideas I now extend the analysis of this dataset. …

## Philipp Singer

Data Scientist at UNIQA Insurance Group, PhD in CS, passionate about machine learning, statistics, data mining, programming, blockchain, and many other fields.

Get the Medium app