Developing a High-Traffic Data Service With AWS Kinesis
Hi everyone,
In this series of articles, we will try to give information about our method of storing and processing incoming high-traffic data that we have used on one of our projects.
The topics that we will briefly touch on in this series of articles will be as follows:
- Service Needs
- Cloud solutions for problems: Kinesis Data Stream, Data Firehose, S3 Buckets,
- Data ETL Tools (Extract, Transform, and Load): Glue Spark, Glue Data Catalog
- Query Optimization: Athena, Partitioning
- Conclusion
Service Needs
Our service collects data from external data sources with an average of 1000 webhook requests per second. We expect this density to reach 10 thousand according to the new customer potential.
We need to store and query this data both in granular and aggregated forms. We use an RDBMS database to join the aggregated results with other tables.
Solutions For Problems
In this section, let’s briefly talk about the solutions used for the related problems and give the details in the next sections.
- Handle high load requests → Data Streaming method: Ability to convert incoming data to stream and handle high requests. Related technology: AWS Kinesis Data Stream
- Keeping the data in a warehouse instead of an RDBMS → Data warehouse: Querying the data directly from the stream by writing to AWS S3 in a partitioned form: AWS Kinesis Firehose
- Repartition data according to query needs → Data Partitioning and Transform: Spark Glue Jobs
- Ability to query with SQL, which is simple, understandable, easy to maintain and manage, to use data effectively: AWS Glue Data Catalog
- Ability to run high-performance queries in the relevant data warehouse without moving the data elsewhere: Amazon Athena
After summarizing the current situation, let’s start to explain the part of meeting high-traffic data with Kinesis, which is the main topic of this article. We’ll have follow-up articles about other topics in the future.
AWS Kinesis
A suite of products developed by Amazon to easily collect, process and analyze data streams in real-time.
Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. Amazon Kinesis offers key capabilities to cost-effectively process streaming data at any scale, along with the flexibility to choose the tools that best suit the requirements of your application. With Amazon Kinesis, you can ingest real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. Amazon Kinesis enables you to process and analyze data as it arrives and responds instantly instead of having to wait until all your data is collected before the processing can begin.
Advantages
- Real-time: Amazon Kinesis enables you to ingest, buffer, and process streaming data in real-time, so you can derive insights in seconds or minutes instead of hours or days.
- Fully managed: Amazon Kinesis is fully managed and runs your streaming applications without requiring you to manage any infrastructure.
- Scalable: Amazon Kinesis can handle any amount of streaming data and process data from hundreds of thousands of sources with very low latencies.
Kinesis and High-Load Data
The main point we changed in the current system was to use the advantages of the scalable structure by converting the incoming data to the format we want and transferring it to the data stream, instead of queuing and processing. In this way, we would be able to reach high RPS easily and remove our bottleneck parts such as RabbitMQ and MySQL.
It is possible to do the related transformation with Lambda on the AWS side, but it brings high costs in high RPS systems. Since lambda costs are calculated as ‘per request’, the cost increases exponentially as the request increases.
By routing the incoming stream data with Kinesis Firehose to S3, we achieved a data warehouse solution instead of the traditional relational database. Let’s continue with how we use these tools, their limitations, and their challenges.
AWS Kinesis Data Streams
Amazon Kinesis Data Streams is a serverless streaming service that makes it easy to capture, process, and store data streams at any scale. The most important factor in selecting the Kinesis Data Streams is; it does not have a server to manage and it can dynamically provide the required capacity with an on-demand mode.
How does it work?
The diagram below shows the architecture of the Kinesis Data Stream. It can process and analyze data in real-time.
The output of the Kinesis Data Stream can be an input of another system. For a stream, multiple applications can also use the data independently and in parallel.
Components
The main component of the Kinesis Data Stream structures is called shards.
- Shard: The shard, which is defined as a uniquely defined set of data records, creates a data stream by combining one or more. Each shard offers a fixed capacity. 5 transactions/sec per shard provides a total of 2 MB/s of read, while in writing it provides 1,000 records per second, a total capacity of 1 MB/sec. The capacity of the stream varies depending on how many shards it contains.
- Data Record: It is the type of data record written into the shard. It basically consists of three components. These are called partition key, sequence number, and data blob.
- Partition Key: It is the key used to group data among shards. The relevant key hash output is modded with the number of shards, and it is decided which shard the data will be placed on. The maximum length can be 256 characters.
- Sequence Number: It is the sequence number of each data record in the relevant shard.
- Payload: It is the part where the data is written. It is used by converting JSON data type to byte array.
Quotas and Limits
There is no upper limit for creating data streams in your AWS account. However, we can create a maximum of 50 data streams in on-demand mode. This limit can be increased per account by AWS Support (500 is the hard limit). With a simple calculation, it can grow up to 500,000 RPS with a 500x1000 write record.
On the other hand, Kinesis Firehose has a maximum 2000 RPS limit. But if we feed the Firehose with DataStreams, the limit is not applied.
The maximum payload size on Kinesis is 1 MB before BASE-64 encoding.
Another life-saving feature is the Data Retention Period, which can be adjusted between 24 hours and 365 days, higher numbers bring more cost.
In cases where we cannot record the data properly, knowing that there is raw data in Data Stream gives confidence while building the system.
API Samples,
ObjectMapper mapper = new ObjectMapper(); PutRecordRequest putRecordRequest = PutRecordRequest.builder().streamName(kinesisDeliveryStreamName) .data(SdkBytes.fromByteArray(mapper.writeValueAsBytes(event))) .partitionKey(Timestamp.from(Instant.now()).toString()).build(); kinesisClient.putRecord(putRecordRequest);
This example demonstrates how we send an incoming event object to the Data Stream with Java.
AWS Kinesis Data Firehose
Data Firehose reliably loads real-time streams into data lakes, warehouses, and analytics services. In our new system, Data Firehose has been positioned as an intermediary that carries the data from the stream to S3. As with all Kinesis tools, Firehose can also scale automatically without the need for any manual operation. It converts incoming data to formats such as Apache Parquet and dynamically partitions stream data without creating any processing lines.
We will talk about dynamic partitioning and its importance in more detail in the next stages; but to summarize, much more performance results can be achieved with partitioning. It represents an index-like structure in traditional databases.
Firehose can connect to more than 30 fully integrated AWS services and streaming targets such as Amazon Simple Storage Service (S3) and Amazon Redshift. We selected S3 as our target to minimize the costs and benefit from Athena.
How does it work?
Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service that reliably captures, transforms and delivers streaming data to data lakes, data stores, and analytics services.
Data received from Kinesis Data Stream or an SDK can be converted with Lambda, but it should not be forgotten that Lambda cost is also present. Firehose itself can convert to Apache Parquet or Apache ORC formats without the need for any Lambda. The Parquet format will be the most suitable option if the data need to be queried. Recent studies propose that Parquet has obvious superiority over JSON.
If you have activated the Format Conversion process, Firehose will ask you to select a Glue schema and will convert the incoming data to the appropriate format using the schema you created before.
Dynamic Partitioning
Another structure that should be used carefully and one of the most important issues about Data Firehose is dynamic partitioning.
The data from a stream is written to the buffer memory in the Firehose and when this memory is full or expires, it is transferred to the final destination. Buffer size and interval can be set and their ranges are as follows.
There cannot be more than 500 active partitions in this buffer. This value can be increased up to 5000 after contacting AWS Support. When this limit is exceeded, the operations with the data are stopped and the raw state S3 is written directly in the error destination.
For example, if the account_id is a partition key and there are more than 500 customer events in 60 seconds or 128MB of data, the system will not work properly. You have to pick your keys carefully.
NOTE: Starting with a prefix such as ‘data’ while writing data to S3 eliminates many problems to be experienced in the future. It allows separating the error content since Firehose need to store the content with error in the same S3 Bucket
Another feature of Firehose is that it can store incoming data in a backup S3 Bucket before transforming or format conversion. Having a backup of your data in case something goes wrong can always save lives.
Basic Architecture With Kinesis
You can briefly see the structure that forms the backbone of our new system above. Our main motivation here was to write the incoming data to an S3 Bucket in a secure and desired format without being stuck with limits. With Cloud and stream technology, we will have no loss from high-load incoming data. The system will be able to scale automatically, will not require any maintenance, and data will be stored in secure environments.
In the following posts, we will explain how we repartition, query, and store this data and how the expired data is deleted. Follow us…