Santander’s Bigdata Architecture
Notes from Cloudera’s blog on Santander’s Architecture using technologies HBASE, Flume, Kafka, Rocks DB & Spark.
Flaflka(Flume + Kafka ) is used to enrich incoming stream data & save in multiple channels.
- Each Flume agent runs own RocksDB instance locally to host meta data for enrichment. Rocks DB updated hourly by using Computed output By spark. Local Rocks gives higher performance by avoiding rpc calls.
- Custom Flume Enrichment Interceptors process events from upstream, transforms queries using metadata in Rocks DB, Writes results to kafka topic.
- HBASE: Observer coprocessor used compute derived data on put and store derived data under different row key. This has several issues 1. Since computed row is stored in different row automicity cannot be guaranteed. 2. Every transaction has to read computed data and update with latest computed results, decreased write throughput by 1/20.
- To avoid above problem, instead of precomputing and storing, computed values are calculated on the fly using End point coprocessor running on Hbase region server.
Note to self to read further:
- Read about local stream processing by Ted Malaska: http://blog.cloudera.com/blog/2015/06/architectural-patterns-for-near-real-time-data-processing-with-apache-hadoop/
- How to write flume custom interceptors?
- Read more about Flafka? why not just use kafka?
- HBAE: observer coprocessor and end point coprocessor.