Building a data engineering project. Part 1 — from idea to system design.

Kirill Zaitsev
CodeX
Published in
5 min readAug 31, 2021
Image source

Have you ever asked yourself a question “What do I do to land a data engineering job?” After surfing the Internet for a while, we have our top-1 advice — build projects. Yet here come some of the questions that largely influence the positive impact of such a project on your portfolio:

  • how to come up with a project idea
  • which technology could be used
  • how could the design of the application look like
  • which technical challenges are envisioned
  • how to make it portable and accessible to other developers
  • which production flairs could be added

I will try to answer the questions with a series of articles based on the project I’ve built recently, which is hosted on GitHub free for you to use and experiment with. Here is an outline:

I hope you will enjoy the series. Let’s begin.

Coming up with a project idea

To come up with an idea for your next data engineering project, there is often no need for prolonged brainstorming sessions, reading numerous articles, or studying books. While these are essential for executing your idea, there might be a more efficient way to develop a solid idea for the project in the first place.

My suggestion is as simple as observing your daily activities with a keen eye, asking yourself the question ‘how it was built?’ along the way, and viewing it from the perspective of your current area of interest. You do not have to replicate the system from scratch, as it took years for a lot of highly qualified people to build it. Your goal is to come up with an idea that you can build from start to finish on your own, so simplify the problem, make assumptions, and cut some corners. Ensure that the idea will lead to an application that is still usable in some way.

For example, while scrolling through an Instagram feed, you may stop for a moment and wonder how Instagram engineers created a system that shares heavy media content across millions of users with real-time updates. “What if I create a system that allows posting images that are in near real-time visible to a user with another device?”

Another example you could come across could be the way blogging works on the Medium platform. While reading a blog post you notice that it has an estimate for the reading time, an author, a certain number of likes and comments, and others. But how is the blog post stored? Is it a SQL or NoSQL database? What schema does it have? How do changes propagate to the database when I press the ‘like’ button? “Can I create a distributed database to store blog posts and author data?”

At this point, you will have an idea, and this is the first step.

What we will build

Have you ever wondered how search engines like Google or Bing work? They parse information from a webpage, store and serve it. But how is it possible to store information about hundreds of millions of web pages and deliver it in a matter of milliseconds when a user searches for ‘hello world’? Think about it before proceeding. Probably it’s about mapping words (keys) to web pages (values) that are considered the best fit for this input. Guessed a huge hash map? I refer you to this link to discover more.
An inverted index is a data structure that allows search engines to be blazingly fast in responding to our queries. How can we implement such a thing on our own, what should we use as a data source to build it, and how to serve results? I suggest pondering an answer for a little bit and reading the design section to see my suggestions.

By the way! Let’s add a bit more color to our arising project idea. Now imagine that we are very curious to know what people we are familiar with say on Twitter. Huge hash-map (our query -> people tweets)? Remember the search engine case? So, it looks like we are going to build a service that monitors activity on Twitter in real time and allows its users to find out more about what some specific people are saying. Wow, this sounds more interesting than cooking, just an inverted index!

Designing the application

Data source: Our use case is a stream of real-time events, namely tweets of the people we are interested in. Twitter API is what we are targeting, and, thankfully, it allows us to do real-time monitoring of tweets employing some straightforward Python.

Data ingestion: Since sending data right into a system is considered a bad practice, and I refer you to this article for some considerations on why it is so, let’s use a message queue that will decouple our streaming data source from the application. In a real-world project, you must research a lot to select the best of several alternatives for the queue, such as RabbitMQ, Redis, Apache Kafka, and others. Since Apache Kafka is one the most popular and well-documented tools at the time of writing, showing off your skills using it will be a plus in your portfolio.

How do you handle raw text data? Since Python has a huge ecosystem for text processing, we will use its NLTK library to clean up and tokenize words that will be the keys in the inverted index hash map. By tokenization, I mean converting a string representation of a word into an integer which results in reducing potential hashmap size issues (strings take more memory space than a single integer) and simplifies interaction with the map as long as mapping string -> token (an integer) remains the same.

Building the index: I suggest Java as a language for creating the index. Its concurrent programming support will give us the means to build a hashmap (inverted index) that a part of the workers will modify in parallel, while the other part will serve it to the users. In addition, sharing the fame with its derivative Scala, Java belongs to the most in-demand languages in data engineering, making experience with it a strong argument on your resume.

How do users interact with our app? Client-server architecture (check out the link to find out more) denotes a design pattern that is widely used for organizing communication between the service and its users over the network. But wait, that’s precisely our case! While our service knows how to build the map (query -> tweets), it takes just one additional entity — a Server — that takes the built map and allows users to query it by sending requests over the network.

UML diagram that summarizes the above points:

UML diagram of the application

Now we are good to go with implementing the architecture of our app. With this, I invite you to part 2 of the series.

Feel free to check out the final implementation of the app we will develop gradually using this link.

--

--