Scaling a flume agent to handle 120K events/sec

Published in

Data Collective

3 min readJul 18, 2018

Apache Flume is a distributed service for collecting large amount of data like logs. Flume agent has three components: source, channel and sink. In simpler terms, data flows from source to sink though channel. Source produces the data, channel buffers it and sink writes the data to storage. We can increase the throughput of the flume agent by the following methods.

Batching

We can tune batchSize and batchDurationMillis of the source and sink to increase the throughput. Value for these properties depends on the type of source/sink and the ingestion latency we can afford. By tuning the batch size and transaction capacity we were able to get 15K events/second. To get more details on these properties follow this blog.

<agent_name>.sinks.<sink_name>.batchSize = 10000
<agent_name>.sinks.<sink_name>.batchDurationMillis = 10000

Sink Parallelization

In our case, sink became the bottleneck of the pipeline due to slower processing rate. We tried to increase the throughout of the sink by following approaches.

Sink Group

Sink group allows us to group multiple sinks as one. We configured sink group with Load balancing sink processor to write events to multiple sinks, but the throughout remained the same. After checking the forums we found that in a sink group only one sink will be active at a time, so the performance will be same as single sink.

Multiple Sinks

Instead of attaching sink group to channel, we connected the sinks directly to channel. Throughput should increase since each sink runs in its own thread, but we didn’t see any drastic improvement. After various testing we found that channel became the bottleneck now since multiple sinks are competing for a single channel.

Multiple Sinks — Multiple Channels

We configured multiple channels and attached a sink to each channel. Replicating channel selector is used to copy the events to all channels. Throughput increased linearly with every channel-sink pair. We had to find a way to distribute events to multiple channels.

Multiplexing Channel Selector

Flume natively supports two channel selectors: replicating and multiplexing. Replicating channel selector cannot be used in this case since it copies the same event to all channels. Using multiplexing channel selector we can map the events to particular channel based on the value of a specific header. This approach needs uniformly distributed key and the number of channels is limited to the cardinality of the key. It is hard to find this type of key for each event stream.

Round-Robin Channel Selector

To overcome the issues of multiplexing channel selector we developed Round-Robin channel selector which distributes the events to all channels in round-robin fashion.

Build Round-Robin Channel Selector

git clone https://github.com/saravsars/flume-round-robin-channel-selector.git
cd flume-round-robin-channel-selector
mvn clean package
cd target
cp flume-round-robin-channel-selector-1.0.jar FLUME_CLASSPATH/

Configure Round-Robin Channel Selector

agent.sources = source1
agent.sinks = sink1 sink2
agent.channels = channel1 channel2
 
agent.sources.source1.channels = channel1 channel2
 
agent.sinks.sink1.channel = channel1
agent.sinks.sink2.channel = channel2
 
agent.sources.source1.selector.type = com.sars.flume.RoundRobinChannelSelector

After enabling batching and round-robin channel selector we were able to achieve 120K events/second in an eight core machine

Thank you for reading!