My name is Niels, and I’m a data scientist at an EdTech company based in Oslo. I love writing code and working on projects that allow me to pick up new skills on my data science journey. In my spare time, I work on several (rather nerdy) academic projects on parliamentary speeches (an artefact of my PhD days).
On Medium, I blog about data science, data engineering, and other topics of interest — please check out some of my posts and my (new but growing!) blog series below.
Elasticsearch (ES) is a distributed search engine designed for scalability and redundancy. ES has become increasingly popular in recent years because of its robustness and scalability in machine learning, storage and handling of large volumes of data for analytics, and many other applications. In this blog post, we’ll go over four things that data practitioners need to master to get started with ES: provisioning an ES cluster, designing indices, writing queries, and optimising indices.
First, you need to know how to set up an ES cluster.
Elasticsearch is a distributed search and analytics engine that is part of the Elastic…
Tired of keeping track of whether your code changes warrant a minor or major version increment? Too busy to keep a neat and tidy changelog? Try using commitizen, a command-line utility that “forces” you to write commit messages following the Conventional commits standard (or a different, user-defined format). Once configured, commitizen will bump the semantic versioning of your code automatically based on your commits, and update your changelog.
Commitizen is configured in five simple steps. First, you need to install commitizen:
pip install commitizen
Second, add a reference to your code’s current version number to your
__init__.py file, or add…
In this blog post, we’ll take a brief look at the use of Python’s
__repr__() method, and why you should use it when creating your own, custom Python classes.
When creating a Python class, adding a descriptive string representation of the class and its objects is useful for debugging purposes. Including the
__repr__() method in your Python class is a common way to achieve this. It allows you to call the
repr() function directly on your class instance to reproduce the code that was used to instantiate it.
But what is the
__repr__() method? According to the official documentation,
In this post, I’ll introduce the basics of querying in Elasticsearch (ES). We’ll look at how queries are structured (e.g. the filter vs. query context, and relevance scoring) in Elasticsearch Domain Specific Language (DSL) and apply them with the Python Elasticsearch Client. (And, if DSL makes your head spin, skip to the final section of this post, where we’ll go through the basics of running SQL queries against ES). All code used in this blog post is available in this GitHub repo.
For this post, I’m assuming you are familiar with the basics of ES. I’m also assuming you have…
Elasticsearch (ES) has gained traction in recent years because it offers a robust and scalable engine for storing and analysing large volumes of data with low latency. If you’re a data engineer or data scientist working with large (and fast-growing) volumes of data, you’ll know that optimising for storage is a key component of building high-quality solutions. In this post, I discuss three strategies for optimising disk usage when using ES. Replication code for this blog post is available on GitHub.
Before we start exploring Elasticsearch (ES) storage optimisation, let’s review some ES fundamentals.
Elasticsearch (ES) is a distributed search engine that is designed for scalability and redundancy. It is fast, and it is suited for storing and handling large volumes of data for analytics, machine learning, and other applications. ES has gained traction in recent years, and is an important technology for any data scientist’s (and data engineer’s) toolbox. In this blog post, we’ll provision our own (free) ES cluster with Bonsai, and use the Python Elasticsearch Client to set up an index, and write and query data. All code for this blog post is available on GitHub.
So you’ve written a piece of Python code and it does the job. Great, but is your code sufficiently simple? Complex code is difficult to read and makes code maintenance more costly. Catching complexity early can save time, money, and a lot of frustration. In this post, I’ll show you how to use the wily command-line tool to trace the complexity of your code over time.
Code complexity matters. Unnecessarily complex code is harder to read, and more difficult to maintain. If your code is hard to understand, it’s harder to spot existing bugs and easier to introduce new ones…
For smaller projects, CSV is a great format for storing data. But what if you want to up your data management game? A relational database offers a more robust way to organise and manage your data. In this post, I show how you can transform your CSV files into a PostgreSQL database in three simple steps. I’ll also discuss some the advantages of using a relational database.
Unit testing is key to developing quality code. There’s a host of libraries and services available that you can use to perfect testing of your Python code. However, “traditional” unit testing is time intensive and is unlikely to cover the full spectrum of cases that your code is supposed to be able to handle. In this post, I’ll show you how to use property-based testing with Hypothesis to automate testing of your Python code. I also discuss some of the advantages of using a property-based testing framework.
Unit testing involves testing individual components of your code. A typical unit test…