No-Code Data Collect API

Published in

The Startup

8 min readDec 5, 2019

Introduction

Building a data pipeline that handles 1,000,000 and more events per second is not a trivial task. To handle such big traffic, all data pipeline components should be designed and implemented properly. Fortunately, not all data pipeline components should be built from scratch. An open-source community offers a bunch of very solid solutions. In this article, I will show how a data ingestion component, which is one of the most volume-sensitive parts of data pipelines, can be built without writing a single line of code, just by using free and open-source building blocks.

Please visit the https://github.com/dimastatz/no-code-data-ingest.git repo to see all needed scripts and configurations and give a star if you like it.

Overview

A modern data pipeline should contain all steps required to automate the movement, transformation, analysis and consumption of data. You can see a high-level data flow of the 6-steps data pipeline in the diagram below. The first step contains all data generating sources. Anything can be a data generating source — software systems, hardware devices or even human beings. After data is generated — it should be ingested to the data pipeline. The second step, collect/ingest, provides all the necessary components that collect and store data in the data store. After data is stored, it can be processed. The main goal of data processing is to prepare data for use by the analytics database. Usually, it’s called ETL — extract, transform and load. When ETL is done — data can be analyzed and then consumed by reporting systems.

The main focus of this article is the Data Collect step.

Data Collect/Ingest

Any data can be ingested to a data pipeline by either pull or push. We will focus here on ingesting data by push, meaning that external data sources interact with a data pipeline explicitly by sending its data to some known interface. The best example of such an interface is a REST API. Pushing events directly to the data pipeline’s REST API is a simple and incredibly fast way to ingest events/event batches/log files and it is especially good for real-time use cases. On the diagram below you can see an example for such Data Collect service. A Data Source generates data and sends it as HTTP requests by using some DNS name. DNS name is resolved in AWS Route53 to Nginx Server IP address.

The core components of no-code Data Ingest are Nginx and Fluentd. Both are free and open-source software components. Nginx is used as a Web Server and Fluentd is used as a log shipping component.

Nginx

Inside Nginx Worker Process

Nginx is a high-performance HTTP server, it handles incoming connections, performs security checks, applies rate limits, read requests data, writes data to disk and returns HTTP response to the client. On the diagram below you can see the processing pipeline inside Nginx. The most interesting steps here are the steps that have exclamation point icons. If you want to go with your code implementation that handles HTTP requests, you should add an upstream configuration. For example, uWSGI, Python and Flask/Django or FastCGI and C++ will get the job done. In our case, we are looking for a no-code solution and thus the upstream is not used. All incoming data is collected in the form of Nginx logs in the ‘access.log’ file.

Nginx Configuration

One of the most convenient ways to start with Nginx is to use the Nginx official docker image. Here is our pretty simple docker file: we are using Nginx image from the DockerHub, removing the default Nginx configuration file and copying the data-ingest.conf

FROM nginx                                               
RUN rm /etc/nginx/conf.d/*                       
ADD data-ingest.conf /etc/nginx/conf.d/

The second file is data-ingest.conf — Nginx configuration. This file is fully available on GitHub. Here, I would like to explain the most interesting config section for /events location. By using provided configuration, Nginx will handle /events GET requests and log requests data to /var/log/nginx/events_data.log

location /event {
        default_type text/plain;
        add_header Allow "GET" always;
        access_log  /var/log/nginx/events_data.log  getdata;
        if ( $request_method !~ ^(GET)$ ) {
         return 405 'GET allowed only\n';
        }
        return 200 '$time_local\nURI: $request_uri\n:$request_id\n';
    }

Fluentd

Fluentd is a cross-platform open-source data collection software project originally developed at Treasure Data. It is comparable to other log shippers like logstash, Apache flume, rsyslog, but it has a clear advantage when it comes to the number of plugins available.

Like in the case of Nginx, the best way to start with Fluentd is to use one of its docker images. On the attached fluentd Dockerfile section (full Docker file is here) you can see that Apache Kafka and AWS S3 plugins are added.

RUN apk update \                        
&& apk add --no-cache \                         
   ca-certificates \                               
   ruby ruby-irb \                               
   su-exec==${SU_EXEC_VERSION}-r0 \                               
   dumb-init==${DUMB_INIT_VERSION}-r0 \                        
&& apk add --no-cache --virtual .build-deps \
   build-base \                                                           
   ruby-dev gnupg \                       
&& echo 'gem: --no-document' >> /etc/gemrc \                        && gem install oj -v 2.18.3 \                        
&& gem install json -v 2.1.0 \                        
&& gem install fluentd -v 0.14.25 \                        
&& gem install fluent-plugin-kafka --no-document \                        && gem install fluent-plugin-s3 -v 1.0.0 --no-document \                        && apk del .build-deps \                     
&& rm -rf /tmp/* /var/tmp/* /usr/lib/ruby/gems/*/cache/*.gem

Fluentd configuration files allow users to control input and output behavior. For the Data Collect AP, Fluentd is built with one data source only — Nginx log file, and two outputs AWS S3 and stdout. Based on this configuration, Fluentd will tail /var/log/nginx/events_data.log and l write all rows as is (parse = none) to output stores.

<source>
  @type tail
  path /var/log/nginx/events_data.log
  pos_file /var/log/fluentd/td-agent/access.log.pos
  tag nginx.access
  <parse>
    @type none
  </parse>
</source><match nginx.access>
  @type copy
  <store>
    @type stdout
  </store>
  <store>
    @type file
    path /var/log/fluentd
  </store>
  <store>
    @type s3
    # aws key_id and key are optional for EC2 with IAM Role
    #aws_key_id YOUR_AWS_KEY_ID
    #aws_sec_key YOUR_AWS_SECRET_KEY
  
    # change bucket name and region
    s3_bucket e2e-test-io
    s3_region us-west-2path logs/
    s3_object_key_format %{path}%{time_slice}_%{index}.%{file_extension}
    buffer_path /var/log/fluent/s3
    time_slice_format %Y%m%d-%H-%M
    time_slice_wait 1m
    utc
    format json
  </store>
</match>

When writing output to AWS S3, Fluentd will create one S3 object per minute (time_slice_wait 1m) and objects names will have the following pattern — %Y%m%d-%H-%M. For example, it can look like 20191203–17–01_0.gz.

Run it

To get a general feeling of how the Data Collect API works, you can run it locally. Just clone the following repository:

git clone https://github.com/dimastatz/no-code-data-ingest.git.

Ensure that you have all required prerequisites on your machine git, docker, and docker-compose

After cloning the repo, navigate to the no-code-data-ingest folder and add execute permissions to run_all.sh and stop_all.sh. For example, you can do it as follows:

$ cd no-code-data-ingest/                                           $ chmod -R 777 *.sh

After that, navigate to no-code-data-ingest/fluentd folder, edit fluent.conf to set your AWS region and bucket.

Now you can run run_all.sh script

$ ./run_all.sh

This bash script prepares your local machine by setting up all necessary folders and permissions and then runs the docker-compose.yml. The docker-compose.yml defines Nginx and Fluentd services and mounts shared volume on /tmp/data-collector.

version: '3'
services:
  nginx:
    build: nginx/.
    volumes:
      - /tmp/data-collector:/var/log/
    ports:
      - "8080:8080"
  fluentd:
    build: fluentd/.
    volumes:
      - /tmp/data-collector:/var/log/
    ports:
      - "8888:24224"

After running run_all.sh, you will see in your terminal that Docker builds Nginx and Fluentd images.

Building nginx
Step 1/3 : FROM nginx
latest: Pulling from library/nginx
000eee12ec04: Pull complete
eb22865337de: Pull complete
bee5d581ef8b: Pull complete
Digest: sha256:50cf965a6e08ec5784009d0fccb380fc479826b6e0e65684d9879170a9df8566
Status: Downloaded newer image for nginx:latest
 ---> 231d40e811cd
Step 2/3 : RUN rm /etc/nginx/conf.d/*
 ---> Running in d1d40abbd03c
Removing intermediate container d1d40abbd03c
 ---> 89a1280987b5
Step 3/3 : ADD data-ingest.conf /etc/nginx/conf.d/
 ---> eee6692acf94
Successfully built eee6692acf94
Successfully tagged no-code-data-ingest_nginx:latest
Building fluentd
Step 1/18 : FROM alpine:3.5
3.5: Pulling from library/alpine

After finishing the build process, docker-compose will run Nginx and Fluentd containers and you will see that Fluentd is running and follows the tail of Nginx events_data.log

fluentd_1  | 2019-12-04 13:15:25 +0000 [info]: #0 starting fluentd worker pid=14 ppid=6 worker=0
fluentd_1  | 2019-12-04 13:15:26 +0000 [info]: #0 following tail of /var/log/nginx/events_data.log
fluentd_1  | 2019-12-04 13:15:26 +0000 [info]: #0 fluentd worker is now running worker=0

Now you can perform end-to-end tests. Open a new terminal and run the following curl command

$ curl localhost:8080/event?data=event_content

Curl performs HTTP GET request to local HTTP service on 8080 port. Nginx will handle the request and write it to the events_data.log file. After that Fluentd will ship the events_data.log to stdout:

2019-12-04 13:26:25.935724194 +0000 nginx.access: {"message":"04/Dec/2019:13:26:25 +0000, GET, data=event_content, 264, 200"}

Fluentd will also ship logs to AWS S3:

$ aws s3 ls e2e-test-io/logs/
2019-12-03 17:01:46        118 20191203-17-01_0.gz
2019-12-03 17:04:01        119 20191203-17-02_0.gz
2019-12-03 17:06:02         98 20191203-17-04_0.gz
2019-12-03 17:07:02         92 20191203-17-05_0.gz

That’s all, now you have working Data Collect API which ships logs to AWS S3. By editing fluent.conf you can add any output store of your choice: Apache Kafka, AWS Kinesis, ElasticSearch, Splunk, BigQuery and many others.

Performance

NGINX is well known for its great performance. Nginx uses a non-synchronized, event-driven architecture to cope with this prodigious load. It can take high loads and loads that vary wildly all in its stride, leveraging predictions for RAM usage, CPU usage, and latency to achieve greater efficiency. By using https://openbenchmarking.org/ you can find that Nginx can handle up to 10K RPS on each core. When looking at a more optimistic benchmark from https://www.nginx.com/blog/testing-the-performance-of-nginx-and-nginx-plus-web-servers/ we can see up to 40K RPS on a single core and 1.2M on 32 cores. Of course, it is a matter of configuration and setup details, but the bottom line is that a target of 1M RPS is achievable.

Conclusion

The no-code development approach is increasing in popularity as the number and quality of open-source building blocks grow and at the same time, as companies deal with a limited supply of software developers. In this example, we saw that data ingest for high load pipelines can be built by writing zero lines of code, and all involved building blocks are absolutely free of charge.