Photo by fabio on Unsplash

Hello there!

My name is Niels, and I’m a data scientist at an EdTech company based in Oslo. I love writing code and working on projects that allow me to pick up new skills on my data science journey. In my spare time, I work on several (rather nerdy) academic projects on parliamentary speeches (an artefact of my PhD days).

On Medium, I blog about data science, data engineering, and other topics of interest — please check out some of my posts and my (new but growing!) blog series below.

Stay connected

Follow me on Medium or subscribe to my…

Hands-on Tutorials

A concise overview of Elasticsearch concepts and principles

Photo by Mathew Schwartz on Unsplash

Elasticsearch (ES) is a distributed search engine designed for scalability and redundancy. ES has become increasingly popular in recent years because of its robustness and scalability in machine learning, storage and handling of large volumes of data for analytics, and many other applications. In this blog post, we’ll go over four things that data practitioners need to master to get started with ES: provisioning an ES cluster, designing indices, writing queries, and optimising indices.

1. Setting up an ES cluster

First, you need to know how to set up an ES cluster.

Elasticsearch is a distributed search and analytics engine that is part of the Elastic…

A hands-on guide to automating version tags and changelogs based on your project’s commit history with commitizen

Photo by Yancy Min on Unsplash

Tired of keeping track of whether your code changes warrant a minor or major version increment? Too busy to keep a neat and tidy changelog? Try using commitizen, a command-line utility that “forces” you to write commit messages following the Conventional commits standard (or a different, user-defined format). Once configured, commitizen will bump the semantic versioning of your code automatically based on your commits, and update your changelog.

Commitizen is configured in five simple steps. First, you need to install commitizen:

pip install commitizen

Second, add a reference to your code’s current version number to your file, or add…

Make your code easier to debug by adding string representations of your Python classes

Photo by Joshua Aragon on Unsplash

In this blog post, we’ll take a brief look at the use of Python’s __repr__() method, and why you should use it when creating your own, custom Python classes.

A quick primer in __repr__()

When creating a Python class, adding a descriptive string representation of the class and its objects is useful for debugging purposes. Including the __repr__() method in your Python class is a common way to achieve this. It allows you to call the repr() function directly on your class instance to reproduce the code that was used to instantiate it.

But what is the __repr__() method? According to the official documentation, __repr__()

A hands-on guide to writing Elasticsearch queries in Domain Specific Language, using the Python Elasticsearch Client

Photo by Christopher Burns on Unsplash

In this post, I’ll introduce the basics of querying in Elasticsearch (ES). We’ll look at how queries are structured (e.g. the filter vs. query context, and relevance scoring) in Elasticsearch Domain Specific Language (DSL) and apply them with the Python Elasticsearch Client. (And, if DSL makes your head spin, skip to the final section of this post, where we’ll go through the basics of running SQL queries against ES). All code used in this blog post is available in this GitHub repo.

Getting started

For this post, I’m assuming you are familiar with the basics of ES. I’m also assuming you have…

Making Sense of Big Data

Here are three strategies to reduce the store size of your Elasticsearch indices when dealing with large and continuous streams of data

Photo by Pietro Jeng on Unsplash

Elasticsearch (ES) has gained traction in recent years because it offers a robust and scalable engine for storing and analysing large volumes of data with low latency. If you’re a data engineer or data scientist working with large (and fast-growing) volumes of data, you’ll know that optimising for storage is a key component of building high-quality solutions. In this post, I discuss three strategies for optimising disk usage when using ES. Replication code for this blog post is available on GitHub.

A quick primer in Elasticsearch

Before we start exploring Elasticsearch (ES) storage optimisation, let’s review some ES fundamentals.

If you ask, say, four developers…

A hands-on guide to creating an ES index from a CSV file, and to managing your data with the Python Elasticsearch Client

Photo by Paul Green on Unsplash

Elasticsearch (ES) is a distributed search engine that is designed for scalability and redundancy. It is fast, and it is suited for storing and handling large volumes of data for analytics, machine learning, and other applications. ES has gained traction in recent years, and is an important technology for any data scientist’s (and data engineer’s) toolbox. In this blog post, we’ll provision our own (free) ES cluster with Bonsai, and use the Python Elasticsearch Client to set up an index, and write and query data. All code for this blog post is available on GitHub.

Setting up your Elasticsearch cluster

In a first step, we’ll…

Photo by John Barkiple on Unsplash

Here’s how to make assessing code complexity part of your Python development routine

So you’ve written a piece of Python code and it does the job. Great, but is your code sufficiently simple? Complex code is difficult to read and makes code maintenance more costly. Catching complexity early can save time, money, and a lot of frustration. In this post, I’ll show you how to use the wily command-line tool to trace the complexity of your code over time.

A quick primer in code complexity

Code complexity matters. Unnecessarily complex code is harder to read, and more difficult to maintain. If your code is hard to understand, it’s harder to spot existing bugs and easier to introduce new ones…

Build your own PostgreSQL database for free in three simple steps

For smaller projects, CSV is a great format for storing data. But what if you want to up your data management game? A relational database offers a more robust way to organise and manage your data. In this post, I show how you can transform your CSV files into a PostgreSQL database in three simple steps. I’ll also discuss some the advantages of using a relational database.

The advantages of a relational database

A relational database is a database that divides data into linked tables that have shared data points. These shared data points allow us to combine information and create novel tables or views with…

Photo by Kevin Ku on Unsplash

Unit testing is key to producing quality code. Here’s how to automate it.

Unit testing is key to developing quality code. There’s a host of libraries and services available that you can use to perfect testing of your Python code. However, “traditional” unit testing is time intensive and is unlikely to cover the full spectrum of cases that your code is supposed to be able to handle. In this post, I’ll show you how to use property-based testing with Hypothesis to automate testing of your Python code. I also discuss some of the advantages of using a property-based testing framework.

Property-based automated testing

Unit testing involves testing individual components of your code. A typical unit test…

Niels D. Goet

Data Scientist | PhD in Politics, @Oxford_University | Python, R, AWS, ML | EdTech | Author — Towards Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store