Can Tarantool beat Redis in IoT?

10 min readJul 30, 2021

Cyberpunk 2077 is already around us. Yesterday’s fiction is today’s reality. Smart devices, an ‘intelligent’ living environment, automation of manual labour and equipment control through time and space. Complete triumph of cybernetics, remote startup and control. Ray Bradbury, Arthur Clark and Stanislav Lem predicted what we now call the ‘Internet of Things’.

Technologies inspire business people, engineers, inventors and visionaries to invent new applications and products. New products require appropriate technologies. This article is about how a company named Ready For Sky implemented Tarantool in-memory DBMS in a smart home solution.

What do we do?

We make smart household appliances. They can be controlled both locally with buttons and remotely via a mobile app or with voice. Smart devices can control each other without human interaction. For example, an air conditioner or floor heating turns on upon the action of a door opening sensor. We use Tarantool to make all this work quickly and securely.

The beginning

There is a service implemented with Python3 and Tornado (let’s call it Forwarder). It receives requests from voice assistants supported by Yandex, Google, Mail.ru, MTS and others. It responds with appliance status: if the kettle has turned on or not, if the pressure-cooker has started the sequence or not, and so on.

The Forwarder converts the request to binary commands for appliances and saves them in Redis. CC (Control Center) service takes the commands via Redis pub-sub. CC is written in Java/Kotlin + RxKotlin. It forwards commands to equipment via MQTT broker according to specific topics. It receives status responses via the broker too.

CC has three main objectives:

Monitor statuses of devices;
Control connections between devices and gateways which are connected to a broker;
Forward commands and responses.

These are objectives for a dispatcher. Forwarder listens to all Redis events via subscription to the pub-sub channel. It reads responses to commands and device statuses. Then it conveys information to voice services in an appropriate form.

Although it looks complex, the scheme is quite simple. Redis is a bus for communicating data via pub-sub and a cache for storing device statuses. CC listens to both Redis and MQTT brokers at the same time. Forwarder listens to Redis.

Problems

The first one is related to CC. This service is written in the RxKotlin framework for reactive programming. It couldn’t cope even with 400 messages per second. It had a memory leak, and we still don’t know why. This service was hard to debug. To know what happened to a device and its representation on the service side, we had to obtain access to a part of CC status. But this function was not available.

A service coded using the RxKotlin framework for reactive programming failed to process even four hundred messages per second

We had to reboot the system via cron once every several days. As a result, we were losing device statuses. We had to restore sessions and connections to devices. Load on the MQTT broker was getting out of control. All that impacted user experience. The last straw was that we did not have good knowledge with Java and Kotlin for the backend.

The second one is infrequent but sudden losses of messages from Redis. Its pub-sub pattern is not the most reliable mechanism. A client doesn’t send confirmations when subscribing to channels in Redis and receiving messages from channels. Redis doesn’t repeat messages which leads to losses.

The third one was that both device statuses and commands with responses were stored in Redis. It held information about what devices were assigned to which users, and also the gateway and device bindings.

All that had to be retrieved quickly. In Redis, a number of additional keys were used as indexes. Redis could occupy all the available server memory. There were about 10 keys per device and per gateway on average. Then, after CC reboot via cron, memory had to be cleared in it as well.

Anyway, we could get used to all that. But a new problem emerged. We are keen to sell as many products as possible. This means we need more gateways to control smart devices via the Internet. So the load on the infrastructure is growing dramatically.

Enter

When we were looking for a solution, we knew that one big company created Tarantool. It handles enormous loads — up to 500,000 requests per second, and even higher. Nokia built a whole smart home system using Tarantool. Awesome, it turned out to be hot stuff! So we read the manuals and decided to give it a try.

Tarantool is used in a smart home system

What we liked apart from tried-and-true cases

Tarantool can be used as an application server with a database. You can write code that works closely with data, captures and stores events, deletes something, and executes processes.LAst but not least, you can receive and save data on the spot. Very handy!

You can connect to the Tarantool instance and get a console with REPL. This is useful for debugging, administering, and investigating incidents.

There are integral tools for sharding and replication. There is a very cool tool called Cartridge with necessary data primitives. This includes convenient classification of services by role, with clustering. It would take us a lot of effort to do the job with bare Python.

Given all that, we made our final decision.

Writing the code

We chose not to take Cartridge with sharding and replication right away, but rather to start with very basic things and proceed step by step. Time frames were tight, about 4 or 5 months, and the budget was small. So we wrote a new traffic service from scratch with bare Tarantool. We dropped Redis at the very beginning because Tarantool had everything we needed, and even more.

However, we faced many minor troubles. For example, there are rather few libraries for Lua in the open-source world, especially ones for working with MQTT. Native Tarantool connector didn’t fit: MQTT version 5 was required. So we had to tailor a small library luamqtt.

Example: using luamqtt with Tarantool:

local fiber = require ‘fiber’
local log = require ‘log’
local mqtt = require ‘mqtt’
local ioloop = require ‘mqtt.ioloop’
local connector = require ‘app.tnt_mqtt_connector’
local M = {}
local client
ioloop = ioloop.get(true, { sleep_function = fiber.sleep })
function M.subscribe(topic, qos)
client:subscribe {
topic = topic,
qos = qos or 0,
no_local = true
}
end
function M.unsubscribe(topic)
client:unsubscribe { topic = topic }
end
function M.publish(topic, payload, qos)
client:publish({
topic = topic,
payload = payload,
qos = qos or 0
})
end
function M.start(opts)
client = mqtt.client {
id = opts.client_id,
uri = opts.host .. ‘:’ .. tostring(opts.port),
clean = opts.clean or true,
connector = connector,
version = mqtt.v50,
keep_alive = opts.keep_alive or 60,
reconnect = opts.reconnect or true,
username = opts.user,
password = opts.pass
}
client:on {
connect = function(connack)
if connack.rc ~= 0 then
return log.error(
‘mqtt error: error = %s, ack = %s’,
connack:reason_string(),
connack
)
end
log.info(‘mqtt: connected’)
end,
message = handle_msg,
error = handle_error
}
ioloop:add(client)
fiber.create(function()
while true do
ioloop:iteration()
fiber.yield()
end
end)
end
return M

The code doesn’t show functions handle_msg и handle_error, but it’s clear what they are for. Module tnt_mqtt_connector appeared as a result of the modification of a connector that was in the library for working via LuaSocket. It is a plain and friendly library that can be used in different Lua environments.

We had to think about how to process and respond to messages. It’s not really handy to write code where every coming request creates a new fiber that does something and executes fiber.sleep(t)for timeout. This is not suitable for networking services that convert and route messages. Their operation does not always fit in the ‘request-response’ scheme. With a large number of fibers, Tarantool’s capacity could decrease. We conducted a small experiment to check this: created many fibers which sleep on blocking, and measured the running time of the main fiber’s fiber.sleep. We found an error.

local fiber = require ‘fiber’
local clock = require ‘clock’
local log = require ‘log’
local LIMIT = 32000
local lock = fiber.cond()
local counter = 0
local t1 = clock.realtime()
for i = 1, LIMIT do
fiber.create(function()
lock:wait()
counter = counter + 1
end)
end
local t2 = clock.realtime()
log.info(‘fibers created, took %s seconds’, t2 — t1)
t1 = clock.realtime()
fiber.sleep(10)
t2 = clock.realtime()
log.info(‘real sleep time: %s seconds’, t2 — t1)
t1 = clock.realtime()
lock:broadcast()
fiber.yield()
t2 = clock.realtime()
log.info(‘fibers unlocked, counter is %s; took %s seconds’, counter, t2 — t1)

Let’s have a look first at how the script works with variable value LIMIT = 32000.

Everything works so far, no error, and we get approximate accuracy fiber.sleep():

fibers created, took 0.32585644721985 seconds
real sleep time: 9.6747360229492 seconds
fibers unlocked, counter is 32000; took 0.026816844940186 seconds

Let’s try to create more fibers… For example, LIMIT = 50000. The result is somewhat discouraging:

SystemError fiber mprotect failed: Cannot allocate memory
SystemError fiber mprotect failed: Cannot allocate memory
fatal error, exiting the event loop

Alteration of memory setting (memtx_memory) didn’t help. This was observed in Tarantool 2.5 and 2.6. Quite logical error indeed: memory is limited. Therefore, you should avoid the uncontrolled creation of numerous fibers which can sleep massively and occupy memory. If you have to do this, then it is advisable to wrap fiber.create call in pcall.

We had to switch our thinking promptly to finite-state machines. Timeouts for certain statuses of finite-state machines can be made using queue and expirations. Fibers are cool, but they should be used wisely.

Timeouts are organized quite simply. We store the status and sending time of the last message in the space (table) for each device. If a response is expected from a device (for example, to ping), then we retrieve by index all devices for which timeout logic needs to be launched. This is how it looks approximately:

local states = {
OFFLINE = 0,
ONLINE = 1,
OPENING = 2,
READY = 3,
WAITING = 4
}
-- create space for user devices
local devices = box.schema.space.create(‘devices’, {
engine = ‘memtx’,
if_not_exists = true
})
devices:format {
{name = ‘id’, type = ‘string’},
{name = ‘state’, type = ‘number’},
{name = ‘ts’, type = ‘number’, is_nullable = true}
}
-- create index for timeouts and states
devices:create_index(‘pulse’, {
parts = {‘state’, ‘ts’},
type = ‘tree’,
unique = false,
if_not_exists = true
})
local function auto_yield()
if not ticks then
ticks = 512
end
local count = 0
return function(…)
count = count + 1
if count % ticks == 0 then|
fiber.yield()
end
return …
end
end
local function get_expired(state, timeout)
local now = math.floor(clock.realtime())
return box.space.devices.index.pulse
:pairs({states[state], now — timeout},
{iterator = box.index.LT})
:map(auto_yield())
:take_while(function(device)
return device.state == states[state]
end)
:filter(function(device)
return now — device.ts > timeout
end)
end

Function get_expired is invoked every half-second inside fiber. Function auto_yield() allows interrupting iterators and passing control to other execution streams. Iterators perform very well in Tarantool. However, there is one thing about them. When iteration is performed over a large piece of data, there’s a risk of blocking the execution of all the other fibers and Tarantool processes for some time. Iterators don’t pass control by themselves, this needs to be done manually. Therefore, we thought up that auto_yield trick. The function invokes fiber.yield every 512 records.

We redesigned the service processing requests from voice assistants so that it worked with Tarantool directly. We used asynctnt library to work from Python/Asyncio. Further, we used a wrap for the same library, but for working with a queue. This way, we got a compact and simple service for controlling smart appliances. It is in production already.

Bottom line

We don’t have to reboot the service via cron every night and we don’t care about memory leaks. Users have less trouble. We handle problems much more easily now. We can connect to the Tarantool instance and solve the problem at once.

The system is now more responsive and faster. There are no excessive messages forwarding between services, and requests are performed quicker. Equipment control via voice assistant is now more stable. Testers and technical support have much less trouble. Same as developers :)

The old CC could consume 9 gigabytes of server memory in several days. The average CPU load was around 40% (all four cores), although about three thousand smart devices were connected at a time. Now the maximum number of devices online approaches four thousand.

The new CC on Tarantool produces only 7% CPU load. As to memory usage, it consumes 1 gigabyte or less even with peak loads.

But the main thing is that users stopped complaining that they can’t control their devices. Some isolated cases are due to problems like interference when Bluetooth doesn’t work right.

The most interesting part is ahead. We haven’t done clustering and sharding of traffic service yet. We are not using Cartridge yet. Migration of a part of services written in Python/Tornado to Tarantool remains to be done. Then we are going to need Cartridge and clustering.

Thank you for reading. There are two useful links if you are interested.
Watch this video to get a better Tarantool.
Join Telegram group to get free 24/7 consult and support for Tarantool open-source.