Validating, scaling and evolving an API-backed dashboard application

Published in

comsystoreply

6 min readDec 2, 2021

Starting and validating a new idea is hard, scaling and evolving it can be even harder. Every day, we work on such challenges for and with our clients as passionate engineers, critical thinkers, and pragmatic consultants. The interesting part here is typically not a static view on any specific state of our product at a given point, but how everything changes over time. Decisions are made, some turn out to be right, others turn out to be wrong after revisiting them. In particular in the initial and more exploratory phase it is crucial to avoid too much overhead and complexity for often short-living stuff, but at the same time invest profound thought where it really matters. As it is important to go through this process and its trade-offs over and over again, we thought it might be helpful to others to share our way of working as well as some insights we gained.

Therefore, we were looking for a project we can pursue for a couple of months besides the daily work with our customers to showcase typical challenges and share some good practices to cope with them. Our use-cases will not be super special to make sure what we will be describing is actually relevant to a broad audience. So let us begin with this: often we want to integrate data from different sources and do stuff with it. Twitter is one well-known example with a properly documented API that will just serve us as one initial example for data we want to ingest. From here there are many paths we could take. We might turn it into useful visualizations. We might build integrated applications on top of that. We could try to derive insights from the plain data. There are countless opportunities.

As a starting point, we will monitor trending hashtags or topics over time, for example to analyze the impact of marketing campaigns we are running on certain topics. But before we get there, here are a few more general thoughts to start with.

(as we publish more posts, this will be updated)

Introduction: you are here
Validating the concept: Creating an API to access the data

How we will tackle this endeavor

We started as we would with a real product — build a small MVP as fast as possible and iterate from there. The product will evolve over time, and we don’t know yet where it will end up. Just like we experienced it from the real world out there. It might even be interesting if you, our readers, point us towards the next step that would be logical or at least worth a thought from time to time. In the upcoming blog posts, we will scale and evolve different parts of the showcase. We will always explain why we make certain changes and sometimes deep-dive into the corresponding technology trade-offs and decisions.

In the remainder of this blog post, we will outline our initial lightweight setup to store data in our own initial architecture, including the underlying cloud infrastructure. The visualization itself will be part of the next more frontend-focused article.

Why build it in the cloud?

For us, leveraging the cloud here is a natural choice. It gives us full flexibility for the evolution described above, we can use managed services to speed up the process, we can easily scale up and down as desired. All the basic infrastructure for networking, databases, compute, etc. is already in place. So we can get going right away with the minimal upfront investment of registering an account (in our case, we even had that in place as well) — that’s it! If we want to move fast, it is a good idea to avoid the overhead of complex multi-cloud setups and build natively on a single cloud provider. For us, simply due to our prior knowledge and experience, the first choice is AWS, but that might be different in your case.

Our initial building blocks

The fundamental choices we always have to make when setting up a new software product in the cloud are typically: “Which kind of compute shall we use?” and “Where do we store our data?”. There are countless options for both, but you can start simple if you have no clue about traffic patterns, long-term architecture, etc. yet. One way of doing so is to go down the serverless road and use Lambda and DynamoDB. Both are strictly pay-per-use, i.e. they come for free when not being used and scale infinitely without any big changes. Furthermore, both integrate very smoothly with many parts of the AWS ecosystem and good building blocks for a modular architecture. Generally speaking, many of the advantages for small pet projects apply here as well.

One core principle we strictly apply in any phase of our products is to use infrastructure-as-code. Instead of manually configuring and documenting how to set up all the virtual resources and deploy the software, this should always be automated, fully reproducible, and under version control. Conceptually, it does not really make much of a difference whether we are using the classic AWS choice CloudFormation, the more recent CDK, or third-party tools such as Terraform — what matters is the way of working. In our case, we will use the CDK which makes it easy to adopt good coding practices in our infrastructure. Proper modularity and reuse are just easier to apply if we do not only define our infrastructure in some well-defined syntax but real TypeScript code.

Our architecture for the first step

As outlined above, our initial goal is to retrieve and store data from Twitter with pre-defined search terms in our own database. As a starting point we went with the search terms ‘aws’, ‘serverless’, and ‘kubernetes’. The corresponding data flow is visualized in this diagram:

Initial architecture of the showcase in AWS

Once every 30 minutes, we fetch and store data from the Twitter API using a TypeScript Lambda. We have decoupled the polling and the storage stage of the data flow and integrated them via SQS for better modularity. The non-obvious connections in the diagram are probably the ones between our polling Lambda and the database. That is due to the fact that we store and retrieve the state for each search term to make sure we do not fetch duplicates or miss any data. Note that we use only one DynamoDB table for all of that in line with the single table design approach. In fact, this is one of the aspects to put in notable effort right from the start since changing the schema of data afterwards is often hard and rarely comes without a migration. In a subsequent article, we will be discovering challenges and tradeoffs using the single table design approach and see how the schema-less approach of DynamoDB affects the evolution of our project.

If you want to dig deeper, you can find all the code here.

Where does that leave us?

With this initial setup in place, our application is tirelessly pumping tweets into our database. To give you a first impression: over the course of 2 weeks, we have collected 77,103 matches for ‘aws’, 75,406 for ‘serverless’, and 32,152 for ‘kubernetes’. However, without a proper visualization it is very hard to derive any more than the sheer numbers from that.

Distribution of search terms in our initial setup

This very early one-shot chart is a lot better already. However, it is not at all integrated, and we can probably do better by adding a proper frontend to our showcase. So this is exactly what we are going to work on now as a first validation for what we have built so far. Stay tuned for the next posts in this series!

This blogpost is published by Comsysto Reply GmbH