Apache Kafka Guide #44 Social Media Application example

Paul Ravvich
Apache Kafka At the Gates of Mastery
4 min readApr 30, 2024
Apache Kafka Guide #44 Social Media Application example

Hi, this is Paul, and welcome to the #44 part of my Apache Kafka guide. Today we will discuss Social Media Application examples as practical training in learning Apache Kafka.

Social Media Application Task

So Social Media is a company that operates as a social media platform, allowing users to share images and engage with content through likes and comments. This platform aims to incorporate standard social media functionalities, ensuring that users can post content, as well as like and comment on posts. Additionally, users need to view the total number of likes and comments on each post in real time, catering to the dynamic nature of user interaction. Anticipating a high volume of data from the onset, the platform is designed to support significant throughputs to manage the expected influx of user activity efficiently. Moreover, the company emphasizes the importance of featuring trending posts, highlighting the need to accommodate a broad range of user requirements for an enhanced social media experience.

  • The ability for users to create posts, place ‘likes,’ and leave comments.
  • Real-time display of the total count of ‘likes’ and comments for each post.
  • Preparedness for a large influx of data on the launch day.
  • Capability for users to view popular or ‘trending’ posts.

Solution with Apache Kafka

In this architectural discussion, we’re addressing the integration of three key entities within Kafka: posts, likes, and comments. Each of these entities will be managed as a separate topic, a decision stemming from the architectural design. The post entity, for example, will be generated by a posting service that allows users to create content, incorporating text, links, hashtags, and more. This content, once validated, is directed to the dedicated post topic.

In parallel, user interactions such as likes and comments will be handled by a specialized service. This service acts as a producer, channeling user-generated likes and comments into their respective topics. Despite the possibility of merging these into a single topic, the distinct nature of the data justifies maintaining separate topics for likes and comments.

A significant challenge arises in aggregating this data to reflect the dynamic nature of user interactions — likes and comments accruing in real-time. A traditional database approach could potentially struggle with load and concurrency issues. This is where Kafka’s strength becomes apparent; it effectively separates data production from aggregation. Utilizing Kafka Streams, we can aggregate data from posts, likes, and comments topics to calculate metrics such as the total number of likes or comments a post has received.

Furthermore, to identify trending posts — those receiving significant attention in terms of likes and comments — we can deploy Kafka Streams for real-time analysis, pinpointing posts that dominate user engagement within a given timeframe. These insights then feed into services designed to refresh and update the feed, ensuring that users have access to the most relevant and engaging content. These services, which act as consumers, ultimately facilitate the presentation of this curated content on the user interface, enriching the overall user experience.

Summary

Regarding the topic of the post, it’s evident that a post can originate from multiple producers, highlighting the need for a highly distributed system. In choosing a key for this distribution, I would opt for the user ID. The reason behind this choice is the desire to retrieve posts by user ID in sequential order, coupled to achieve higher data retention for the topic.

When it comes to likes and comments, these elements also stem from multiple producers, necessitating an equally distributed approach due to the anticipated higher volume of data compared to the posts themselves. For a large social media site, implementing around 100 partitions might be a practical strategy given the significance of these interactive elements. In this context, selecting the post ID as the key differs from the choice for posts. The rationale for preferring post ID over user ID is to consolidate all likes and comments associated with a specific post within the same partition. This organization facilitates the efficient aggregation of data by Kafka Streams applications.

Lastly, on the subject of data handling within Kafka, it is recommended to format the data as events. An event, for example, could detail that user_1 created post ID 10 at 7:40 PM, along with additional necessary information about the post’s content and interactions such as likes and deletions. Structuring data in this event-driven manner simplifies the processing and management of the information flow.

Topics posts:

  • Can be produced by multiple sources.
  • High distribution is necessary if the data volume exceeds 30 partitions.
  • I would select user_id as the key for partitioning.
  • A long retention period for the data on this topic is desirable.

Topics likes, comments:

  • Also, have multiple producers.
  • High distribution is required as the data volume is expected to be significantly larger.
  • I would choose post_id it as the key for distribution.

The data in Kafka should be formatted as “events”:

  • User 1 created a post with ID 1 at 3:00 pm.
  • User 2 liked post_id 10 at 4:00 pm.
  • User 1 deleted post_id 456 at 7:00 pm.

Thank you for reading until the end. Before you go:

Paul Ravvich

--

--

Paul Ravvich
Apache Kafka At the Gates of Mastery

Software Engineer with over 10 years of XP. Join me for tips on Programming, System Design, and productivity in tech! New articles every Tuesday and Thursday!