Serverless Streaming At Scale with Azure SQL

Kappa and Lambda architecture with a post-relational touch, to create the perfect blend for near-real time IoT and Analytics.

Implement a Kappa or Lambda architecture on Azure using Event Hubs, Stream Analytics and Azure SQL, to ingest at least 1 Billion message per day on a 16 vCores database

Long Story Short

Serverless: This is the way

Kappa and Lambda Architectures

Ingesting data with Event Hubs

Partitioning is King

stream = (stream
.withColumn("deviceId", ...)
.withColumn("deviceSequenceNumber", ...)
.withColumn("type", ...)
.withColumn("eventId", generate_uuid())
.withColumn("createdAt", F.current_timestamp())
.withColumn("value", F.rand() * 90 + 10)
.withColumn("partitionKey", F.col("deviceId"))
)

Throughput Units

Decisions

Rule of thumb: create an amount for partitions equal to the number of throughput units you have or you might expect to have in future

Processing Data with Stream Analytics

Stream Analytics support a powerful SQL-like declarative language: tell it what you want and it will figure out how to do it, fast.

SELECT  
[user],
feature,
DATEDIFF(second,
LAST(Time) OVER (
PARTITION BY [user], feature
LIMIT DURATION(hour, 1)
WHEN Event = 'start'
),
Time) as duration
FROM
input
TIMESTAMP BY
Time
WHERE
Event = 'end'

Embarrassingly parallel jobs

Streaming Units

Storing and Serving Data with Azure SQL

With Azure SQL you can do both schema-on-read and schema-on-write, via native JSON support

CREATE TABLE [dbo].[rawdata]
(
[BatchId] [UNIQUEIDENTIFIER] NOT NULL,
[EventId] [UNIQUEIDENTIFIER] NOT NULL,
[Type] [VARCHAR](10) NOT NULL,
[DeviceId] [VARCHAR](100) NOT NULL,
[DeviceSequenceNumber] [BIGINT] NOT NULL,
[CreatedAt] [DATETIME2](7) NOT NULL,
[Value] [NUMERIC](18, 0) NOT NULL,
[ComplexData] [NVARCHAR](MAX) NOT NULL,
[EnqueuedAt] [DATETIME2](7) NOT NULL,
[ProcessedAt] [DATETIME2](7) NOT NULL,
[StoredAt] [DATETIME2](7) NOT NULL,
[PartitionId] [INT] NOT NULL
)
SELECT TOP(100)
EventId,
[Type],
[Value],
[ComplexData],
DATEDIFF(MILLISECOND, [EnqueuedAt], [ProcessedAt]) AS QueueTime,
DATEDIFF(MILLISECOND, [ProcessedAt], [StoredAt]) AS ProcessTime
[StoredAt]
FROM
dbo.[rawdata2]
WHERE
[DeviceId] = 'contoso://device-id-471'
AND
[PartitionId] = 0
ORDER BY
[DeviceSequenceNumber] DESC

End-to-End ingestion latency is around 300msec

Sizing Azure SQL database for ingestion at scale

As expected, Azure SQL handled the additional workload without breaking a sweat.

The deployed solution can handle 1 billion messages a day, with an average processing latency of less then a second.

Partitioning is King, again.

CREATE PARTITION FUNCTION [pf_af](int) AS RANGE LEFT FOR VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
CREATE CLUSTERED INDEX [ixc] ON [dbo].[rawdata] 
([StoredAt] DESC)
WITH (DATA_COMPRESSION = PAGE)
ON [ps_af]([PartitionId])
CREATE NONCLUSTERED INDEX ix1 ON [dbo].[rawdata]
([DeviceId] ASC, [DeviceSequenceNumber] DESC)
WITH (DATA_COMPRESSION = PAGE)
ON [ps_af]([PartitionId])
CREATE NONCLUSTERED INDEX ix2 ON [dbo].[rawdata] ([BatchId])
WITH (DATA_COMPRESSION = PAGE)
ON [ps_af]([PartitionId])

Scale-out the database

Columnstore, Switch-In and Switch-Out

Still not fast enough?

Conclusion

Azure SQL is a great database for IoT and HTAP workload

--

--

Notes on creating streaming at scale solution in the Azure cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Davide Mauri

Data Geek, Storyteller, Developer at heart, now infiltrated in Azure SQL product group to make sure developers voice is heard loud and clear. Heavy Metal fan.