Digital Files Analytics(DFA) System’s Ingestion Platform — A real time data ingestion platform.

Here I want to introduce you to real time data ingestion platform used for Digital File Analytics(DFA) system to stream extracted data from heterogeneous source like images, pdfs and movies. You can find more detail about DFA using this link.

Environment Setup:-

Installation of Ingestion system of DFA is very simple. Decompress the tar and set the environment in .profile to make easier to access scripts. Make sure you set appropriate permission to user group. We are using Kafka along with Zookeeper therefore make you have both and running.
For demo purpose create a topic name “XEROX_DOCUMENTS” with single partition and only one replica.

kafka-topics.sh --create --zookeeper victor:2181--replication-factor 1 --partitions 1 --topic XEROX_DOCUMENTS

We can now see that topic if we run the list topic command:

kafka-topics.sh -- list-- zookeeper victor:2181

Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.

kafka-console-producer.sh --broker-list victor:6667 --topic XEROX_DOCUMENTS

Kafka also has a command line consumer that will dump out messages to standard output.

kafka-console-consumer.sh -- bootstrap-server victor:6667 -- topic XEROX_DOCUMENTS -- from-beginning

Initial Configuration of Apache Flink:-

Installation is very simple. You simply need to decompress the tar into the /usr/local/share/flink-1.3.2 directory and configure the shell environment variables in .profile to make it easier to use your scripts.

export FLINK_HOME=/usr/local/share/flink-1.3.2
export PATH=$PATH:$FLINK_HOME/bin
Start a Local Flink Cluster:
 /bin/start-local.sh
 Starting jobmanager daemon on host “victor”

The JobManager’s web frontend should be running. We can check it as follows:

tail /opt/flink-1.3.2/log/flink-sergio-jobmanager-1-ldapserver.log | grep JobManager

Intial Configuration of Apache HBase

Verify HBase running by using HBase Web UI on port 16010. We will keep the schema for our use case straightforward. The rowID will be the filename, and there will be two column families: “info” and “obj”. The “info” column family will contain all the fields we extracted from the images. The “obj” column family will hold the bytes of the actual binary object(if <10MB otherwise storage would be HDFS). We use the MOB path in HBase for objects larger that 10MB. The name of the table in our case will be “dfds.”

The command below will create the table and enable replication on a column family called “info.” It’s crucial to specify the option REPLICATION_SCOPE => ‘1’ else the HBase Lily Indexer will not get any updates from HBase.
The IS_MOB parameter specifies whether this column family can store MOBs, while MOB_THRESHOLD specifies after how large the object has to be for it to be considered a MOB.

create ‘dfds’, { 
 NAME => ‘info’, 
 DATA_BLOCK_ENCODING => ‘FAST_DIFF’,
 REPLICATION_SCOPE => ‘1’
 },
 {
 NAME => ‘obj’, 
 IS_MOB => true, 
 MOB_THRESHOLD => 10240000
 }

Use the shell commands ‘describe “dfds” ‘ and ‘list’ we can check if the table was created successfully.

Build Leptonica and Tesseract.

How to install Tesseract OCR go to below URL where I have setup environment on Centos6 and same are applicable here for DFiP setup.

OCR means “Optical Character Recognition”. Once we have Tesseract OCR configured, the resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0.
Above “How to install” tutorial is meant as a practical guide; it does not cover the theoretical backgrounds concept of OCR and algorithm used in Tesseract. They are treated in a lot of other documents in the web.
Tesseract is supported beautifully with Ubuntu without issues (with apt-get) but with Centos required some effort and correct version to build.