Redis streaming playground: A free ticket

Published in

DataReply

11 min readOct 21, 2022

In this article, we are going to explain and publish an AWS-based, quasi-serverless Redis playground. It all started as an internal project, and we thought in collaboration with Redis that it was worth sharing it. At Data Reply innovation and sharing are part of our core values, and we truly hope you will find our work interesting and useful. Feel free to contact us with any questions/requests to collaborate.

What we do at Data Reply DE

At Data Reply DE our job is to find innovative solutions to the technologically demanding requirements of some of the biggest industries in Germany. Some of the biggest players in the automotive, manufacturing & retail, and telecommunications industries have a huge Tech stack and it is our pleasure to propose the best practices applied in the Big Data World.

Best practices, however, are not static.

New technologies are developed every day and being informed on every paradigm or framework is a challenge that nobody can face alone.

That’s why here at Data Reply, each employee is allowed to take out 10 working days per year to learn about new technologies. During my first year, I had the opportunity of learning and achieve different tech certifications: K8s (CKA and CKAD), Confluent for Apache Kafka (CCDAK and CCAAK), AWS (Solution architect Associate), and Redis Developer.

Saving some working time to stay up to date is not always easy and I truly hope those few pages may help you to get started in the Redis Streaming world.

Discover Redis

Nowadays every Big Data Engineer has heard about Redis: its efficiency and availability may be some of the reasons why this name can be found under large companies’ tech stack, however, the reasons are many more and we were interested in discovering them.

But what is Redis in the first place? What does this name even mean?

Redis = ReDiS = Remote Dictionary Server

So, we have a dictionary.

This dictionary is an abstraction for a key-value database.

The dictionary is also remote, which means it may run on a different server, or on a different instance. What you already may know is that Redis is running on memory to guarantee faster lookups and updates. But Redis is not just that. The in memory-database company offers a wide variety of services including Redis Cloud an enterprise tool that helps you to build Redis-based apps faster across different cloud providers.

Amongst other features, it offers an intuitive data structure that does not require you to invest hours in studying the documentation: easy to install, easy to use, and lightweight.

The list of pros is already long and that’s why we were so curious about it.

None of my words would beat your own experience so let’s try something together. Visit one of the pages about the command line documentation: let’s say we want to know how the append command works.

At the bottom of the page, you would find an embedded Redis server ready for you to experiment with the command: if you have any doubt about a corner case jumping out in your head just try it out there on the online doc page.

Redis Streaming

Data Reply is subdivided into BU (business unit): you can think of a business unit as a team of 20 people with a layer of managers. Our BU focuses on Streaming: 90% of our projects use Apache Kafka and it has been a while since we wanted to try out some new frameworks.

The in-memory database’s promising Streaming features described in the RU202 course at Redis University attracted us and we decided to put the newly acquired knowledge into practice in an internal project.

The first idea was to build a playground ready to be used by our colleagues.

Our wish was that whoever wanted to experiment with Redis in a real scenario could do it without wasting time in the setup.

In a tech world more and more oriented to Platform Engineering, it was very clear that this could have been the first milestone in our discovery journey.

Using infrastructure as code our goals could be easily achieved, and I was so excited about building something that could save a lot of time for my colleagues.

Even if you are super interested in the topic, when the client deadline approaches and the project must be delivered, it is hard to think about how to wrap your head around how to set up the right AWS policies to access the s3 bucket or the Redis instance using Lambda.

In the following paragraphs, we are going to describe how we automated all those solutions in a ready-to-be-used Redis Streaming playground. In our project we often work with Terraform and being an explorative journey, we decided to try out Stack Formation from AWS.

I know it is not the most popular tool for infrastructure as code, but we just wanted to have an insight. It is indeed tough to judge what you don’t know, and you cannot take your tech decision based on review only.

Maybe in the future, we will publish a detailed comparison between the different infrastructures as code tools but that would be out of the scope of today’s article.

Infrastructure

In Germany, there is a proverb saying “Lange Rede, kurzer Sinn”: the translation would sound something like “long speech, shorter meaning”, so enough Bla-Bla and let’s see the details of the project.

To download the playground just execute the following command

git clone https://github.com/DataReply/redis-streaming-playground.git

The general idea is to build a consumer-producer scenario: something simple that could be used as a building block for further development.

On top of that, we provided a couple of use cases just to give an overview of Redis's capabilities.

Below you can find a picture of the architecture.

As a source, we loaded the datasets in an S3 bucket.

The consumer and producer code can be found in two different Lambdas using a python client.

Python clients often need libraries and we decided to zip those as described in the AWS documentation. A future version will substitute zipped files with a Docker image.

The objective here is not to reproduce a production-like scenario but to give our consultants the chance to play with the tech and see how it works: a Lambda seemed to be an interesting solution.

Between the two we have our streaming engine: a Redis instance.

As a final step, we wanted some visualization tool that could be used in a flexible and extendible way, therefore we decided to use OpenSearch.

Since the infrastructure and the application are meant to be decoupled by the need of building different use cases (different consumers and producers) we decided to split the CloudFormation files in two.

The first file I am going to show is the one building the infrastructure side, no surprises, it is called infrastructure.yaml.

In this file we are going to tell the AWS CloudFormation tool to build the main components:

Network: Main subnet, Public subnets, Private subnets, Public, private route table, NetGateway, Route table associations, and everything that is needed for connectivity. Just little modifications were made with respect to the template gently published by AWS, no need to reinvent the wheel, let’s sit on giants’ shoulders
Elasticache Redis: security group, subnet group, cluster
OpenSearch:Access policies,Cluster
Lambda Role: access to the s3, access to the VPC, access to Redis, access to Open Search

One of the main pros of CloudFormation is that the documentation and the naming convention used to define the properties are intuitive and are capturing well what is going to be created by the infra-as-software tool.

Anyway, once again, if some doubts/suggestions come up while reading the code or if something is not clear don’t hesitate to reach out and we will be happy to clarify the questions or modify the template according to your proposals.

The code gives all the details needed to instantiate the components and allow communication between the involved parts. In the picture below you can find a more detailed overview of the architecture used for this project comprehensive of VPC and subnets in AWS.

We did not follow all the best security practices in this use case since the objective is not to create a prod-like environment but to ease the life of the developers willing to try out the streaming feature. However, we still wanted to use private subnets for the lambdas to reach the internet connection through the public subnets.

For the same demo reason we opted for the smallest instance available in the AWS catalog, this allows everybody to make his trial with the cheapest possible price. For being precise the instance selected is the cache.t2.micro

If you want to save even more money, we advise deleting the OpenSearch section from the YAML file. In this way, you will not have a fancy UI, but sometimes pure logs are more than enough.

In the output section of the infrastructure.yaml we declared:

the OpenSearch domain
the Redis endpoints
The VPC id and the subnets ids

Most of the output is needed for the Lambdas setup.

Use case — Prerequisites

We will assume that you have an AWS account setup.

Download the dataset from the GitHub repo and load the dataset into s3
Download the producer/consumer zips from the repo and put them into the same bucket
Follow the README.md in the GitHub repo to set up the stack

The reason why we used zips and not Dockerfile was to save you some time. This way you should only download the zip with the python libraries and upload it in the s3 bucket whereas with the Dockerfile you would have needed to build a tag and push the image to the ECR registry. This way we have saved you the trouble of creating the ECR repo and pushing the image there.

Our advice is to pull the repo and upload the YAML defining the infrastructure (infrastructure.yaml) into CloudFormation (detailed instructions on how to do it in the repo’s README) so that it’s easier to apply any modification (like deleting the OpenSearch component).

The other option would be to load the definition file in the s3 bucket but that is suggested for more static scenarios and not dynamic demos.

Use case

In our first use case, we used a Smart Home dataset that we found for an online challenge (here you find a detailed description and schema).

The dataset contains measurements for different plugs installed in 40 houses.

Events with the same timestamp are ordered randomly with respect to each other.

At this point it should be clear that it’s not our priority to work with a huge dataset, we just want some data to play with.

In the first use case, the objective is to visualize the houses’ energy consumption at different hours of the day. The schema has one attribute called property: this property is called 0 for the “work” data, the cumulative one in KWh, and 1 for the “load” data better for capturing smaller amounts of used energy.

Below you can find the code for python’s “producer” script which does the following:

Reads data from the s3 bucket
Produce all the data into Redis

The command used to add the data into Redis is called XADD.

All the commands with an “X” prefix are related to streaming.

The command takes inputs a key and a message.

As we stated in the introduction, Redis is a key-value datastore, that’s why every command that is producing or consuming a data structure uses a key.

Consider the key as the “address” of the data we want to find.

The XADD command appends the message to the stream entry at the specified key, when a new key is provided a new stream is generated but stream auto-creation can also be simply disabled using the NOMKSTREAM option.

For more deep details about how the command works, it is to check the doc.

In the same YAML template, you can find the consumer code.

The consumer does the following:

Reads the data from Redis one message at a time
Filters for the load data
Produce the data into Open Search

Please note that neither the producer nor the consumer is meant to be professional, they are just a ready-to-go code meant for you as a starting point. Feel free to modify it.

In OpenSearch, we aggregate data by the hour and display the average Watt Usage into a bar graph.

Next Steps

With the example above we just wanted to show what is possible with this architecture.

In the GitHub repo, you will find instructions on how to reproduce the same use case step by step. You will also find code related to different use cases. Feel free to try them out.

In our project, we used building blocks from different sources and we bound them together: we strongly hope that this project could be another of those blocks for one of you.

Our final take about our Redis journey is overall very positive.

The architecture can be the first step to designing a solution for different use cases we have already implemented like event handling, data augmentation, real-time processing, and so on.

As an example by substituting the Open Search block with another s3 bucket some real-time processing or data augmentation can be done with lambdas. If more hardware and computing time are needed the lambdas can be substituted with EC2.

It’s clear that Redis is trying to play the “Kafka role” in this setup.

Redis offers streaming features like consumer groups in an intuitive way, it may be also already present in the tech stack of a company so it may be easier for a developer to use it as a streaming hub.

It will be our pleasure to deepen the analysis and understand better which tool is best suited for which use case.

Some of the next steps could be the following: