How I Learned to Stop Encoding and Love the QLattice

Curing your data preprocessing blues

Kevin Broløs
Jun 25, 2020 · 6 min read
Photo by Alexander Sinn on Unsplash

Data preprocessing is a necessary, but time-consuming, part of any data scientist's workflow. In this article, I’ll talk about our product at Abzu, the QLattice and the python library, Feyn, and how it can help you spend less time on data preprocessing for machine learning tasks - and give you a few bonuses on the way.

We’re going to go through a data example using a pizza menu I’ve spent all my spare time digitizing for exactly this purpose. You can find the dataset, as well as this notebook, here on GitHub.

I’m gonna do a short recap on the QLattice, but otherwise assume that you've read or will read this for more information.

We also have a fun, lighter read: introducing the QLattice.

TLDR; on the QLatttice

Engage Quantum Drive. via Giphy: https://gph.is/2inq1N0

The QLattice (short for Quantum Lattice) is an evolutionary machine learning approach developed by Abzu that lets you find the best models for your data problem. This is done by continually updating the QLattice with your best model structures to narrow the search field to come up with better models as you have them compete with each other.

Training works by asking your QLattice to generate thousands of graphs (the QGraph) containing mappings between your inputs and output, training them, and updating your QLattice to get even more, and better, graphs.

This is done using a python library called Feyn [/ˈfaɪn/], which you can pip install on Windows, OS X, or Linux.

If you want to follow along, you can sign up for free here.

Let’s load in the data real quick

We’re using pandas to load this up real quick from a csv.

We’ve dropped some unique identifiers to avoid encouraging memorization, and that holds for the comma-separated list of ingredients in the dataset too (since they’re mostly unique per pizza). We have kept tangential information instead. We could do further work to use more of the ingredients feature, but we’ll leave that as an exercise for now as our objective is not training a perfect model.

After loading, admittedly, this dataset is already a lot nicer than what you’d find in the wild, so it contains no NaN or otherwise missing values.

So with that out of the way, let’s take a closer look!

The bare necessities

This next part, depending on the algorithm, would normally be about data preparation before training, and that’s where Feyn comes to your aid.

With Feyn, we've taken a page out of the pythonic stylebook, and one of the core features is the batteries-included approach of just being able to drop in a dataset, and start extracting QGraphs. All the rest happens under the hood.

The only thing you need to consider is whether or not your column is:

  • Numerical
  • Categorical

And that’s it. Let’s take it for a spin!

Connection link established

Hacker voice: I’m in. via Giphy: (https://gph.is/2AcU1RD)

Let’s first connect to our QLattice. I’m running this post in our interactive JupyterLab environment, so I don’t need to specify a token. If you were to connect to your own you’d need both a URL and Token to authenticate.

Let’s have a look at our data types, and also present a helper function that creates a dict mapping your categorical input to what we call the ‘categorical’ semantic type. This function just guesses it based off of the pandas datatypes, but you could manually create it as well.

type                object 
price int64
size object
vegetarian bool
dairy bool
fish bool
ingredient_count int64
dtype: object

And Bob’s your uncle!

Wait, was that it?

Yeah, that was really it. Now you just need to provide this dict in the stypes parameter when fitting your QGraph.

Notice how we haven't even fed any data into it yet. So far, we've only been working on the conceptual level of the problem domain.

That’s pretty neat!

So let’s train a model using the QLattice

Let’s just split the dataset into train and test for evaluation.

Next, we’ll run an update loop for our QLattice, training it on our pizza dataset to predict the price. We’re gonna gloss over this, but read the docs if you’re curious for more.

Let’s quickly evaluate the solution

It’s important to stop and check that what we have makes sense, so let’s check out the RMSE on the train and test set. The prices are in DKK.

train 8.637344036951065
test 13.75645201020599

We see a slight bit of overfitting, but good enough for our purposes. Let’s plot it real quick.

Looking into the semantic types

TypeError: Maggie. via Giphy: (https://gph.is/1Tk0CJi)

I’ve declared a simple function to help us inspect the graphs, but let’s back up first with some theory.

Feyn has two kinds of stypes, the numerical and categorical type.

The numerical semantic type

The default behaviour in Feyn is standardisation, using something like the MinMaxScaler you might already know and love from sklearn. It works on the input node level, and auto-scales your inputs to be within -1 and 1 based on the minimum and maximum value it sees. The output node has the exact same behaviour, but also ensures that your output is automatically destandardised to your expected values.

Let’s look at the ones from our trained graph:

'Name: ingredient_count' 
{'feature_min': 1.0, 'feature_max': 9.0, 'auto_adapt': True}

The first interaction in the graph is ‘ingredient_count’, and as you can read from the state, it has detected the minimum value of 1.0 and the maximum value of 9.

'Name: vegetarian' 
{'feature_min': 0.0, 'feature_max': 1.0, 'auto_adapt': True}

The second interaction is ‘vegetarian’, which is a boolean feature, and Feyn has detected the min value of 0 and max value of 1. So this has just resulted in a recentering around 0.

The categorical semantic type

The categorical type is LOVE, y’all! via Giphy: (https://gph.is/g/ZkNwoMY)

So what’s the magic behind the variables? The categorical semantic typeis essentially a kind of auto-adapting one-hot encoding. Let's take an example:

Suppose you have the pizza menu, and you have three types of pizza offers: regular, family, and lunch. A traditional one-hot encoding approach would convert these into three mutually exclusive features: is_regular, is_family, and is_lunch.

What Feyn does is similar, but instead assigns a weight to each category that will be adjusted during training. So when you pass in a pizza that is family-sized, it'll use the weights for family-sized pizzas, and the same for regular and lunch offers. Unlike one-hot encoding, all of this happens within the same feature node.

Let’s get real:

'Name: size' 
{'categories': [
('regular', 0.2848099238254899),
('family', 0.7835609434141901),
('lunch', -0.021162856624498433)]}

We can see for this interaction that it has learned a weight for each feature value. By looking at this, we can also see which values drive higher separation. For instance, ‘family’ has learned a high weight (close to 1), regular is a bit to the positive side of 0 (the center), and ‘lunch’ has a slight negative weight.

Interpreting this requires you to look at the full graph, but we see clear separation already where we can guess that family and regular drive higher prices than lunch.

'Name: type' 
{'categories': [
('Panini', 0.10376852511743478),
('Salad', 0.14931395908685818),
('Starter', -0.1884395583120563),
('Pasta', 0.3851385143572459),
('Pizza', 0.38809084947386874)]}

The same as above is experienced on the type, where we can see pizzas and pasta commanding higher separation than salads and paninis, and the starters driving to the negative, gisting at lowering the prediction for price.

So what does this mean?

In this post, we dive even deeper into the graphs. We’ll also further explore how to get insights into your graphs in future series. Hopefully, you’ve learned something new on how the semantic types save you time with data preprocessing, and how the categorical semantic type can even help you on the way of getting insights out of your data and models you wouldn’t have seen otherwise.

You’ll also have more insights the next time you order a pizza with three extra servings of cheese and you wonder about the price spike.

If you’re tempted to take a look yourself, head on over here and sign up for a QLattice!

Abzu

If it’s AI, it’s Abzu.

Kevin Broløs

Written by

Kevin is the resident Mad Scientist at Abzu.ai. He works with data science, AI research, software engineering and in his spare time he’s a Mixed Musical Artist

Abzu

Abzu

Abzu is an applied research startup whose technology is inspired by self-organizing systems, resulting in a transparent and trustworthy AI.

Kevin Broløs

Written by

Kevin is the resident Mad Scientist at Abzu.ai. He works with data science, AI research, software engineering and in his spare time he’s a Mixed Musical Artist

Abzu

Abzu

Abzu is an applied research startup whose technology is inspired by self-organizing systems, resulting in a transparent and trustworthy AI.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store