Apache Kafka Guide #43 Taxi Application example

Paul Ravvich
Apache Kafka At the Gates of Mastery
4 min readApr 25, 2024
Apache Kafka Guide #43 Taxi Application example

Hi, this is Paul, and welcome to the #43 part of my Apache Kafka guide. Today we will discuss Taxi Application examples as practical training in learning Apache Kafka.

Taxi Application Task

Our latest case study centers on GetTaxi, a company that specializes in connecting users with nearby taxi drivers instantly on demand. The business has approached us with specific requirements for building their service. They want a system where users can easily find a driver close to them. Additionally, the pricing model should adapt dynamically, increasing during times when the availability of drivers is low or when there’s a high demand from users. Moreover, they require that all location data, both before and during the trip, be stored in an analytics database. This is crucial for accurately calculating costs and for providing essential data for any data science endeavors the company wishes to pursue. The task at hand is to devise a strategy for implementing these features using Kafka, taking a moment to consider the best approach.

  • Users should be able to quickly match with a nearby driver.
  • Pricing should increase during peak demand or when driver availability is low.
  • All location data before and during a trip should be captured in an analytics store for accurate cost calculation.

Solution with Apache Kafka

The idea revolves around the architectural design focused on user and taxi locations. The concept is to track users’ positions as they open and leave the application running, which is crucial for understanding user locations.

The architecture includes a user application that communicates with Kafka indirectly. Instead of a direct connection, a user position service acts as a proxy between the user application and Kafka, serving as a producer to the user position topic. This setup is designed to handle a high volume of data efficiently.

Similarly, the taxi driver application follows the same pattern, sending its data to the taxi position Kafka topic via a designated service. Given the nature of both applications, it’s anticipated that the data volume from taxis might even surpass that from users, indicating a significant flow of information.

To manage these distinct data streams effectively, two separate topics have been created, reflecting the fundamental differences between users and taxi drivers as entities. From this setup, a surge pricing topic emerges. This requires integrating data on user and taxi positions to calculate surge pricing accurately.

For the surge pricing computation, a model employing either Kafka streams or Apache Spark could be utilized. This model would process the inputs from both the user and taxi position topics, outputting the results to a surge pricing topic. Kafka streams are notable for their flexibility in handling multiple input topics and executing various computations.

Once the surge pricing topic is established, it can inform the user application about the service cost or provide an estimated cost, enhancing transparency and user experience. A taxi cost service, acting as a consumer of the surge pricing data, will facilitate this process.

Finally, to support data analytics and satisfy the data scientists’ needs, the architecture proposes leveraging a tool like Kafka Connect to funnel the data into analytics storage, such as Amazon S3, instead of traditional options like Hadoop. This approach aims to optimize data accessibility and utility for analytical purposes, thereby empowering data science efforts with rich, real-time data streams.

Summary

The topics of taxi and user positions can involve multiple producers and are highly important, particularly during periods of high demand such as Christmas or New Year’s Eve. For managing these topics, a distributed approach is essential to handle the volume effectively. When assigning keys for data organization, using user ID for user positions and taxi ID for taxi positions is the preferred strategy. This ensures that data for both users and taxis is organized efficiently, recognizing that such data is ephemeral and doesn’t require long-term retention in Kafka. Consequently, this approach reduces the need for extensive data storage within Kafka.

Moreover, the surge pricing topic, derived from the Kafka streams application, tends to be high in volume and can vary by region. This topic, alongside potentially underrepresented ones like weather or event information, could be integrated into the Kafka streams application. Incorporating these additional data points would not only improve the accuracy but also enhance the overall model, making it more robust and responsive to varying conditions.

Topics “taxi_position_topic” and “user_position_topic”:

  • These are topics that support multiple data producers.
  • They need to be highly distributed for large data volumes, over 30 partitions.
  • The key selection would ideally be “user_id” or “taxi_id”.
  • The data is transient and doesn’t likely require long-term retention.

Topic “surge_pricing_topic”:

  • Surge pricing calculation is sourced from the Kafka Streams application.
  • Surge pricing can be region-specific, so this topic may see high data volumes.
  • Other topics, such as “weather” or “events”, can be incorporated into the Kafka Streams application.

Thank you for reading until the end. Before you go:

Paul Ravvich

--

--

Paul Ravvich
Apache Kafka At the Gates of Mastery

Software Engineer with over 10 years of XP. Join me for tips on Programming, System Design, and productivity in tech! New articles every Tuesday and Thursday!