School Bus Tracker System Architecture

19 min readMay 6, 2024

In this blog post, we will explore how to build a tracking system for vehicles, specifically focusing on school buses. The design of this system will enable the monitoring of school buses across a broad spectrum of educational institutions. Additionally, we will enhance our system with a notification feature.

This feature will alert parents when a bus is nearing a specific student’s home or when the student is being dropped off. Throughout this post, I will cover a wide range of technical topics related to the development and implementation of this system.

Understanding the Functional Requirements

we want to design a system that can keep track of the school bus locations, this system contains multiple schools and each school includes multiple students.
the following are the requirements for each school

Frequent Updates on Bus Locations: The system must update and display the current locations of buses at high frequencies to ensure accurate tracking.
Visibility for Stakeholders: It is essential to maintain a real-time visualization of bus positions on a map, accessible to both parents and school managers.
Proximity Notifications: When a bus approaches a child’s house, the system should automatically send a notification to the parents. While our primary focus will be on implementing Push Notifications, the architecture should be extensible to support additional notification methods such as SMS.
Data Isolation Between Schools: Each school should have access only to information about its own fleet of buses, ensuring data privacy and system integrity.

Non-functional requirements

The business objectives provide a basis for identifying several critical non-functional requirements for the school bus tracker system:

Scalability: The system must efficiently handle scaling, both in terms of the number of buses actively transmitting location updates and the number of parents concurrently tracking their students’ locations.
High Availability: The system must maintain high availability to ensure continuous service and real-time tracking without interruptions.
Reliability: The system should guarantee no loss of notifications. Every alert generated by the system, particularly those indicating that a bus is nearing a student’s home, must reach the intended recipients without fail.
Security: Given the sensitive nature of location data, the system must incorporate robust security measures to protect against unauthorized access and potential data breaches.
Privacy: Strict privacy controls must be enforced to ensure that users can only access the location information of buses they are directly associated with. This will prevent any unauthorized viewing of a vehicle’s location.

Back-of-the-Envelope Estimation

To better understand the scale and demands of our proposed school bus tracker system, let’s perform some initial calculations based on the expected usage:

Location Update Requests Per Second (RPS)

Total Buses: There are 2000 schools, each with an estimation of 10 buses. Thus, we have 20,000 buses in total.
Update Frequency: Each bus sends a location update every 2 seconds.

Calculation of RPS:

Total requests over 2 seconds: 20,000 buses
RPS = 20,000 requests / 2 seconds = 10,000 RPS

This results in a very high rate of incoming data, necessitating substantial backend resources to handle the load effectively.

2. Parent App Usage

Total Students: 2000 schools × 300 student per school = 600,000 student
Checks Per Minute=180,000 parents / 2 minutes = 90,000 checks per minute
Checks Per Second= 90,000 checks per minute / 60 seconds=1,500 RPS

Although the rate of data reads is significant, it is within the capacity of modern databases and should not pose a major challenge.

Propose High-Level Design

The high-level design of our school bus tracker system revolves around several critical components. Here’s an overview of each:

API Design
Algorithms to Find Student Houses
Bus Location Update
Data model

API Design

We will use a RESTful API to design a simple and powerful backend that will handle the location update and view of the buses

GET /v1/buses/locations

Authentication: Utilize an “X-Authorization” header with a JWT to validate the logged-in user.
Function: The endpoint identifies the buses associated with the user’s children and returns the current locations.

Response Body

{
  "buses": [
    {
      "bus_id": "e3d54bd9-bbbc-4d8e-a653-a937c7c4a215",
      "driver": {
        "id": "a23e80d5-679d-4041-96bd-20d3954099ff",
        "name": "Rosella Huel"
      },
      "children": [
        {
          "id": "0d95d921-97ba-48a9-8f5d-73b54c4318f5",
          "name": "Herminia Rippin"
        }
      ],
      "location": {
        "latitude": 40.712776,
        "longitude": -74.005974
      }
    }
  ]
}

WebSocket /v1/buses/{bus_id}/location

Function: Each bus sends its current location through the WebSocket connection, allowing for continuous data flow without needing to open new HTTP connections.

Request Parameters:

{
  "latitude": ...,
  "longitude": ...,
  "timestamp": ...
}

Algorithms To Find Student Houses

We aim to ensure that our database contains a comprehensive list of student house locations. These locations should be organized such that we can notify parents approximately two minutes before the bus arrives at each location.

There are two types of geospatial indexing approaches, which are shown in the following figure:

Evenly Divided Grid

A straightforward approach is to divide the world into small, evenly-spaced grids. In this system, each grid could encompass multiple houses, with every house on the map assigned to one specific grid.

However, this method may not suit our business needs due to a major issue: house distribution is significantly varied. We need the ability to control the size of the grids more precisely to meet our estimated notification times effectively.

Geohash

Geohash is better than the evenly divided grid option. It works by reducing the two-dimensional longitude and latitude data into a one-dimensional string of letters and digits. Geohash algorithms work by recursively dividing the world into smaller and smaller grids with each additional bit. Let’s go over how Geohash works at a high level.

First, divide the planet into four quadrants along with the prime meridian and equator.

Latitude range [-90, 0] is represented by 0
Latitude range [0, 90] is represented by 1
Longitude range [-180, 0] is represented by 0
Longitude range [0, 180] is represented by 1

Second, divide each grid into four smaller grids. Each grid can be represented by alternating between longitude bit and latitude bit.

Repeat this subdivision until the grid size is within the precision desired. Geohash usually uses base32 representation. Let’s take a look at two examples.

Geohash of the Google headquarters (length = 6)

1001 10110 01001 10000 11011 11010 (base32 in binary) → 9q9hvu (base32)

Geohash of the Facebook headquarters (length = 6):

1001 10110 01001 10001 10000 10111 (base32 in binary) → 9q9jhr (base32)

Geohash has 12 precisions (also called levels) as shown in Table. The precision factor determines the size of the grid. We are only interested in GeoHashes with lengths between 6 and 7. This is because when it’s longer than 8, the grid size is too small, while if it is smaller than 7, the grid size is too large and our notification approach will be mostly un efficient in this case

| Geohash Length | Grid Width x Height        |
|----------------|----------------------------|
| 1              | 5,009.4 km x 4,992.6 km    |
| 2              | 1,252.3 km x 624.1 km      |
| 3              | 156.5 km x 156 km          |
| 4              | 39.1 km x 19.5 km          |
| 5              | 4.9 km x 4.9 km            |
| 6              | 1.2 km x 609.4 m           |
| 7              | 152.9 m x 152.4 m          |
| 8              | 38.2 m x 19 m              |
| 9              | 4.8 m x 4.8 m              |
| 10             | 1.2 m x 59.5 cm            |
| 11             | 14.9 cm x 14.9 cm          |
| 12             | 3.7 cm x 1.9 cm            |

How do we choose the right precision? We want to find the minimal Geohash length that covers the whole circle drawn by the user-defined radius. The corresponding relationship between the radius and the length of Geohash is shown in the table below.

| Radius (Kilometers)    | Geohash Length  |
| - - - - - - - - - - - -| - - - - - - - - |
| 2 km (1.24 mile)       | 5               |
| 1 km (0.62 mile)       | 5               |
| 0.5 km (0.31 mile)     | 6               |
| 0.086 km (0.05 mile)   | 7               |
| 0.015 km (0.009 mile)  | 8               |

for more information about how GeoHash works, you can use this site.

const geohash = require('ngeohash');

// Example coordinates for a location
const latitude = 40.6892;  // Latitude for Statue of Liberty
const longitude = -74.0445;  // Longitude for Statue of Liberty

// Generate the GeoHash
const geohashCode = geohash.encode(latitude, longitude, 7);
console.log("GeoHash code:", geohashCode);

This approach is highly effective for our current setup, although there are minor issues related to boundary definitions. However, these will not impact our existing framework.

Quadtree

A quadtree is a data structure that is commonly used to partition a two-dimensional space by recursively subdividing it into four quadrants (grids) until the contents of the grids meet certain criteria. For example, the criterion can be to keep subdividing until the number of businesses in the grid is not more than 100. This number is arbitrary as the actual number can be determined by business needs (houses in each grid).

With a quadtree, we build an in-memory tree structure to answer queries. Note that quadtree is an in-memory data structure and it is not a database solution. It runs on each server, and the data structure is built at server start-up time.

The following figure visualizes the conceptual process of subdividing the world into a quadtree. Let’s assume the world contains 200m (million) houses (Just a random number to demonstrate the usage).

The root node represents the whole world map. The root node is recursively broken down into 4 quadrants until no nodes are left with more than 100 businesses.

How to get nearby businesses with Quadtree?

Build the quadtree in memory.
After the quadtree is built, start searching from the root and traverse the tree, until we find the leaf node where the search origin is. If that leaf node has 100 businesses, return the node. Otherwise, add businesses from its neighbors until enough businesses are returned.

Real-world quadtree example

Yext provided an image that shows a constructed quadtree near Denver. We want smaller, more granular grids for dense areas and larger grids for sparse areas.

if we did not take into mind the operational considerations for Quadtree and that adding new houses will take a lot of time to be updated on the map, also we do not have control over the size of the grids we are dividing in the map as they are divided by the houses length instead of the size which can be another problem to using Quadtree.

Google S2

Google S2 geometry library is another big player in this field. Similar to Quadtree, it is an in-memory solution. It maps a sphere to a 1D index based on the Hilbert curve (a space-filling curve) . The Hilbert curve has a very important property: two points that are close to each other on the Hilbert curve are close in 1D space. Search on 1D space is much more efficient than on 2D. Interested readers can play with an online tool for the Hilbert curve.

S2 is a complicated library But because it’s widely used in companies such as Google, Tinder, etc.

S2 is great for geofencing because it can cover arbitrary areas with varying levels. Geofencing allows us to define perimeters that surround the areas of interest and to send notifications to users who are in those areas.

Another advantage of S2 is its Region Cover algorithm Instead of having a fixed level (precision) as in GeoHash, we can specify min level, max level, and max cells in S2. The result returned by S2 is more granular because the cell sizes are flexible. If you want to learn more, take a look at the S2 tool

Recommendation

As a concluding recommendation, it is advisable to weigh the merits of using GeoHash versus Google S2, as both offer the functionality suited for our application.

However, it is important to note that while one is a database solution, the other operates as an in-memory solution. For the purposes of simplicity and the scope of this article, we will proceed with the GeoHash solution.

Bus Location Update

Each bus transmits a location update approximately every two seconds. Given the write-heavy nature of location updates, a direct write to the database could be feasible but may not be optimal. An alternative approach involves writing the most recent locations to a Redis Cache while maintaining a historical log of bus locations in the database. This method enhances the scalability of our write operations.

Additionally, it’s important to consider how the Parents app receives these updates. Instead of querying the database repeatedly to refresh the bus’s location, we can leverage Redis Pub/Sub to provide real-time notifications.

Understanding Redis Pub/Sub:

Redis Pub/Sub functions as an efficient, lightweight message bus. Channels, also known as topics, are cost-effective and easy to create. A modern Redis server, equipped with gigabytes of memory, can support millions of channels.

In our design, location updates received via a WebSocket server are published to a specific bus’s channel on the Redis Pub/Sub server. New updates are then instantly republished to all subscribers of that bus, ensuring real-time location updates on the map.

Location Update Delivery

Considering the high frequency of updates from each bus, using a RESTful API to receive these updates would be inefficient. A more effective solution is to use WebSockets, which maintain a persistent connection and allow for continuous data transmission without necessitating a full TCP handshake every few seconds.

Data model

In this section, we discuss the read/write ratio and the schema design for our application.

Read/write ratio

The system exhibits a high write volume, as numerous buses emit updates every two seconds. Consequently, the number of writes substantially exceeds the queries from parents tracking their students’ locations. Given this write-heavy nature, a NoSQL database is preferable. For the purposes of this blog, we will utilize DynamoDB.

DynamoDB as a Database

DynamoDB is a scalable and highly available database designed to manage large volumes of reads and writes. It requires thoughtful schema design but offers robust features that are beneficial for our use case.

We propose the creation of three primary tables:

1. Parent Table
2. Students Table
3. Bus Location History Table

In DynamoDB, a table consists of a Primary Key (PK) and an optional Sort Key (SK). A unique combination of PK and SK (or just PK if SK is not used) is essential to prevent duplicate records. Querying is typically constrained to these keys; for queries using other attributes, a Local Secondary Index (LSI) or Global Secondary Index (GSI) may be employed.

Given our prior discussions, GeoHash will be employed for location queries. It is crucial to define our queries precisely before schema creation to ensure the database is optimized for our operational requirements.

Parent Table

This table normalizes student_ids to minimize query volume when retrieving the locations of parents' children. The schema allows the caching of Bus IDs associated with each parent to enhance performance.

Key: parent_id
Value: bus_ids

Student Table

Includes a GeoHash field to determine the geographic block to which a student belongs, also the Latitude and Longitude Fields represent where the house is located and are used to calculate the estimated distance to arrive.
In addition, it’s imperative to establish a Global Secondary Index (GSI) featuring geohash as the partition key (PK) and the bus_id as the Sort Key (SK). This GSI serves to facilitate queries targeting specific geographical areas for a specific bus and it is used to notify parents about Bus arrival as we will see in the section “Storing Bus Location History” Section.

Bus Location History Table

Separates latitude and longitude attributes to accurately display the bus’s location on the map, catering to the needs of parents.

While it is possible to consolidate all data into a single table — a common practice in DynamoDB for simplification — this blog will illustrate using separate tables to clarify the concepts for readers.

System Design (Bus Flow)

Upon reviewing each component of our system, we are now ready to develop the initial diagram of our application. Here are key points to consider while designing our system architecture:

Bus Notification: Buses will send location updates every two seconds using a WebSocket connection to our server.
Data Handling: Our server will push these data updates to a durable storage platform (not directly to a database) for Async processing.
Location Caching: The most recent bus location is cached using Redis.
Real-time Publishing: The updated location is then published to parents subscribed to the bus channel.
Database Updates: Finally, the location data is saved in the database.
Geo-Location Awareness: When a bus enters a new GeoHash location, all parents residing in that area are notified.

Bus Location Report

WebSocket Gateway: We use an AWS Managed WebSocket Gateway to establish a WebSocket connection and send location updates every two seconds.
API Gateway and Lambda: The API Gateway triggers a Lambda function after every message, which forwards the data to a data stream platform such as Kafka or Kinesis.
Data Durability: The data remains in the stream until it is processed.

The preceding architecture effectively addressed our initial objectives. Now, let’s shift our focus to the third and fourth points outlined in our design: managing the current location and facilitating parent notifications.

Bus Location Update & Notify Users

Data Stream to Redis: Data from Kafka/Kinesis is processed by a worker, which then saves it to a Redis instance. This process involves setting a TTL (Time to Live) for the data.
Redis Pub/Sub: Upon successful caching, the new location is published on a Redis Pub/Sub channel dedicated to that specific bus (e.g., bus#uuid). This enables real-time notification to all connected clients.
Queue for Asynchronous Processing: After publishing, the data is queued into SQS for asynchronous processing and database storage.

When pushing data to the queue, it’s essential to include both the newly received location and the latest location stored in the Redis Cache. Additionally, maintaining an in-memory HashMap can help minimize read calls to Redis, enhancing efficiency.

Thus far, we have successfully managed to receive location updates from the bus and store both the current location and live updates in the Redis Cache. These updates are then seamlessly published to the parent app for all users to access.

Storing Bus Location History

Data Processing from SQS: Data pulled from the queue via a Lambda worker is saved in the database with the bus ID and its latitude-longitude coordinates.
Conditional Notifications: To optimize notification efficiency, a check is performed to determine if there has been a change in the GeoHash of the last observed location. If so, parents associated with the new GeoHash location are identified, and notifications are dispatched to SQS.

Notification Flow Design

Data Retrieval: When a notification is needed, the system first attempts to retrieve customer information from the cache. If unavailable, it accesses the database.
SNS for Fan-Out: SNS is utilized to distribute notifications based on user preferences.
End Processing: Messages are queued in another SQS and processed by Lambda functions to deliver the final notification to the end user.

This system design ensures real-time update dissemination and effective data management for location-based notifications. Please review the end-to-end architecture diagram carefully to fully grasp the interactions between each component.

This architecture might initially seem complex, so please take your time to thoroughly review and understand each component of the system.

System Design (Parent App Flow)

The architecture of the Parent App is simpler compared to the Bus Flow and can be illustrated in a single diagram.

Upon opening the app, it performs three main actions:

Initial WebSocket Connection: The app establishes an initial connection with the WebSocket Gateway.
Query Current Bus IDs: It queries the current bus IDs from the Redis Cache to fetch the latest location information and display them on map.
Listen to Bus Channels: Using the retrieved bus IDs, the app subscribes to their respective channels on the ECS side. This ensures that any new location updates are received through Redis Pub/Sub and delivered via the WebSocket Gateway to the user’s phone.

Server-Side Logic For Handling Connection

It’s crucial to understand the behavior of ECS when a new connection is established, as this forms the basis of the Parent App architecture.

Upon a user’s connection to the WebSocket Gateway, it triggers an ECS and invokes a designated function. This function retrieves the list of student IDs from DynamoDB and allows the current ECS instance to listen to their channels.

Below is a sample code snippet demonstrating this process:

const redis = require('redis');

// Create a Redis client
const subscriber = redis.createClient({
    url: 'redis://localhost:6379' 
});

subscriber.on('error', (err) => {
    console.log('Redis Client Error', err);
});

subscriber.connect();

// Function to subscribe to a new channel
function subscribeToChannel(channelName) {
    subscriber.subscribe(channelName, (message, channel) => {
        console.log(`Received message: ${message} from channel: ${channel}`);
    });
    console.log(`Subscribed to ${channelName}`);
}

// Initially subscribe to a default channel
subscribeToChannel('initialChannel');

// Example of subscribing to a new channel dynamically based on some event or condition
setTimeout(() => {
    // Simulating a new channel subscription trigger after 10 seconds
    subscribeToChannel('bus#uuid1');
}, 10000);

Every new update to subscribed channels is instantly reflected in the ECS instance. Using the AWS SDK, the ECS publishes live updates through the WebSocket Gateway to the mobile application.

To optimize performance, we maintain an in-memory HashMap for subscribed channels for each ECS instance. This allows efficient management of channel subscriptions, ensuring that duplicate subscriptions are discarded. Additionally, we store WebSocket Gateway IDs of connected users along with their corresponding Channel IDs in memory. This approach streamlines the notification process, eliminating the need for frequent database queries.

Moreover, the client app incorporates logic to automatically reconnect to the WebSocket Gateway in case of connection loss, ensuring uninterrupted communication between the server and the mobile application.

Scalability Deep Dive

Now that we have finalized the design for each application component, let’s explore strategies for independent scalability. It’s crucial to consider that some components are managed by AWS, which provides scalability features out of the box.

WebSocket servers

WebSocket servers can be auto-scaled based on usage, particularly with managed services like API Gateway WebSocket, where AWS handles scaling automatically. ECS and Lambda are responsible for executing specific logic in response to WebSocket messages, and they can be scaled easily. Lambda functions can be scaled by configuring the concurrency settings, while ECS instances can be scaled based on CPU and memory usage.

DynamoDB

DynamoDB is highly scalable and available, offering two scaling options: On-Demand and Provisioned Capacity. While On-Demand scales automatically based on usage, Provisioned Capacity allows for predictable performance and cost. Monitoring is essential to ensure optimal performance, especially given our usage pattern.

Cache & Pub/Sub (Redis)

AWS ElastiCache provides a hosted Redis service with auto-scaling capabilities. ElastiCache Serverless offers automatic workload traffic accommodation by continuously monitoring resource utilization and scaling out as needed, adding new shards, and redistributing data without downtime.

Data Streamings (Kafka/Kinesis)

Both Kafka and Kinesis are managed services by AWS, offering scalability with the right configuration. Regardless of the chosen platform, AWS manages scalability, relieving us of operational concerns.

SQS (Simple Queue Service)

AWS offers two versions of SQS: FIFO and unordered queues, each with its own quota and limitations.

FIFO Queues:
FIFO queues support a quota of 300 transactions per second per API action (SendMessage, ReceiveMessage, and DeleteMessage). With batching, FIFO queues can handle up to 3,000 messages per second per API action, representing 300 API calls, each containing a batch of 10 messages.

Given our Back-Of-Envelope Estimation, which indicates an expected write rate of 10,000 requests per second (RPS), utilizing FIFO queues — even with batching — may not be optimal. In such cases, we have several alternatives:

Unordered Queue: Utilize an unordered queue and manage potential duplication in the application logic.
Data Streaming Platform: Replace SQS entirely with a data streaming platform such as Kafka or Kinesis, particularly for the notification part. Use queues only for the notification process with SNS.
Over-Provisioning Queues: Over-provision queues and implement consistent hashing logic to distribute the load among these queues. This approach can help handle higher write rates efficiently.

Alternative to Redis pub/sub

Erlang is presented as a viable alternative to Redis pub/sub for routing tasks, particularly praised for its suitability in highly distributed and concurrent applications. This recommendation stems from Erlang’s ecosystem, which includes the Erlang or Elixir programming languages, the BEAM virtual machine, and OTP runtime libraries. The advantage of Erlang lies in its efficient handling of lightweight processes that are inexpensive to create and manage, allowing millions to run concurrently on a single server with minimal resource use.

For practical application, using Erlang would involve replacing the Redis pub/sub setup with a distributed Erlang system where each user is represented by an individual process. These processes could efficiently manage real-time updates, such as location changes, and facilitate a network of updates among a user’s friends, making Erlang a strong candidate for systems requiring scalable, real-time data distribution, assuming the availability of Erlang programming expertise.

Final Thoughts

In this comprehensive blog, we’ve delved into various aspects of system architecture, focusing on caching, pub/sub, and handling high volumes of writes. We explored two distinct architecture flows in detail, providing insights into each component’s functionality and interactions.

Moreover, we concluded the blog with a thorough examination of scalability strategies, emphasizing the ability to scale each component independently. By leveraging managed services and implementing best practices, our systems are engineered to be highly scalable and available, capable of accommodating changing demands and ensuring seamless performance.

I encourage you to take your time to review the blog thoroughly. If you have any questions or suggestions for further improvement, please feel free to share your thoughts in the comments section. Your feedback is invaluable in enhancing the content and providing greater value to our readers.