No code — way forward for development

Mani sankar J
Capillary Technologies
10 min readJun 10, 2022
Photo by Max Chen on Unsplash

Capillary Technologies being a B2B SaaS platform, provides intelligent technology-driven solutions to help businesses drive growth and build long-lasting relationships with their customers. One of the crucial parts in successfully delivering this includes the initial onboarding of customers(will be referred to as brands from here on) to our platform. This series of blog posts talk about how we built an internal No-Code data integration tool called Connect+ to simplify this process.

What is a No-Code data integration tool?

Simply put, an integration tool is a software that facilitates data ingestion from a source to a destination. This may include cleansing, transformation, and mapping of data in a way that the destination system understands. Think of it as an adapter between source and destination. And if the user can achieve this data integration seamlessly without coding, it is nothing but a No-Code data integration tool.

Why build this?

Each brand has its own way of processing data at its end. So there isn’t one single or a fixed set of ways that the data can be ingested into our platform. Source data can be in various formats such as CSV, JSON, archived files, etc., that can be present at various storage types like FTP server, AWS S3, Azure BLOB, etc. The destination can be either Capillary Platform or any other external service. Few brands may need data massaging(cleansing) or re-mapping their data and few others may need complete transformation such as decryption.

Earlier we used to write a separate integration service for each brand catering to their use case. While this was fine initially, as the platform started gaining more and more brands this has become increasingly complex to maintain and monitor. We noticed that we have more than 25 such brand integrations actively running, amounting to more than 100 dataflows processing at least a file every day.

Another drawback is that the time and energy spent on writing separate integration services for each and every brand can be significant. The turnaround time for the data integration hence increases delaying the onboarding process. It was also observed that there is a lot of commonality in these integrations with minor differences in specific areas.

Wait, Isn’t that the same thing?

Hence we wanted to build a tool that is relatively easy to maintain and can be used by brand PoCs (non-technical teams) to create these data integrations quickly.

Key Design Decisions

  • No coding knowledge is required for the person creating these integrations
  • Access control for the tool — Basic access control for the tool such as ADMIN and USER
  • Templatizing frequently used types of integrations — This facilitates an easy way to create the workflow from a template that is already defined
  • Ability to scale well in the future to process multiple parallel datasets
  • Easy monitoring, alerting and error tracing of the tool
  • Simple reporting of the workflows that have been created

Few Examples of Commonly Used Flows

  • Pull a CSV file from SFTP → Ingest into Capillary platform using APIs
  • Pull a Zip file from S3 → Extract files → Join them based on a common header → Put them in an SFTP location
  • Pull files from FTP → Join them → Ingest into Capillary platform using APIs → Get error report of failed records due to validation

In the next section, we will look at Apache NiFi and how we leveraged that project to build Connect+

What is Flow-Based Programming?

According to Wikipedia, flow-based programming (FBP) is a programming paradigm that defines applications as networks of “black box” processes, which exchange data across predefined connections by message passing, where the connections are specified externally to the processes. These black-box processes can be reconnected endlessly to form different applications without having to be changed internally.

Simply put, it is a network of different components that can perform various tasks, connected with one another in order to perform desired data processing. Conventional Programming is generally synchronous flows with tight coupling among different modules. By taking a conventional programming approach it would again become increasingly complex to maintain various flows due to long-running processes, resource allocation, etc. Whereas Flow-based programming helps achieve asynchronous processing of datasets, reuse of the components as building blocks, and efficient resource allocation.

Example of flow-based programming for parsing a log file and alerting on a particular error

We evaluated a few services along with Apache NiFi and based on the key design decisions mentioned in the previous part, we decided to go ahead with NiFi. The following table shows a high-level evaluation of the services.

Evaluation Matrix for StreamSets, Nexia, and Apache NiFi

Apache NiFi:

Apache NiFi is a software project from the Apache Software Foundation designed based on the FBP concepts to automate the flow of data between software systems. It comes with many features that help us create data integrations with ease. It also has a user interface where you can create workflows by just dragging and dropping, creating connections between the various components. NiFi comes with a bundle of processors out of the box that caters to many use cases like fetching a file from FTP, inserting records into a database, etc. Developers also have an option to create their own custom processors if none of the existing processors serve their purpose.

NiFi Terminologies:

  • Flow File — A data object that flows through various processors/components. It contains the data as well as the attributes that describe the data which are key-value pairs.
  • Processor — A component that performs some work/task. It has access to the flow file’s data and attributes that pass through them.
  • Connection — Acts as a link and helps in communication between the processors.
  • Process Group — Combination of processors and connections that represents a workflow.
  • Flow File Repository — Repository that maintains all the flow files in the system
  • Content Repository — Flow File doesn’t contain the data within itself, it just contains a reference to the content repository where the actual data resides
  • Provenance Repository — Each and every action on the flow file generates an event called Provenance Event. This repository contains all such events generated by the dataflows.

Architecture of NiFi

Apache NiFi architecture. Pic taken from Apache NiFi
  • NiFi runs in a JVM
  • Web Server provides APIs that can be used to create/modify dataflows
  • The flow controller is responsible for running the processors
  • NiFi can also be deployed as a cluster to scale horizontally. Zookeeper takes care of electing a cluster coordinator.
Apache NiFi in cluster mode. Pic taken from Apache NiFi

Creating a Simple Dataflow in NiFi

Let’s try to get our hands dirty by creating a basic dataflow in NiFi. We will try to create a simple data flow that will perform a listing on a remote SFTP server and will push the files to an AWS S3 location.

  • Follow the instructions in the Apache NiFi installation guide to download and install and get it up and running.
  • Open http://localhost:8080/nifi in a browser and you will see a canvas displayed on the screen. (Port depends upon the version of Apache NiFi and may vary, it can be found in the log file nifi-app.log inside the logs directory present in the installation directory).
  • Create a process group by dragging and dropping the process group from the top panel onto the canvas. Provide a name for your data flow.
  • Enter into the process group by double-clicking it. You will again see an empty canvas as the process group is empty as we haven’t created any components yet.
  • Create a processor by dragging the processor item from the top panel. You will be displayed a popup asking for what type of processor needs to be added.
  • Select ListSFTP processor from the list and proceed with the processor creation. Now the processor has been added to the canvas.
  • Create FetchSFTP and PutS3Object processors in the same way
  • Connect these processors by hovering on the centre of the processor and dragging the connection to the destination processor. Specify the relationship of the connection. The dataflow looks like the following after creating “success” connections between the processors.
  • Set the properties of the processors by double-clicking them. Following image shows the properties being set for the ListSFTP processor.
  • Once the necessary properties are set the processor will display that it is in a stopped state. You need to auto terminate the relationships that are not used for each processor that was added.
  • Start all the processors by starting them individually or by starting the process group.

That’s it! We have created our first dataflow that transfers the files from SFTP to S3 storage. You can also create your own custom NiFi processors if you think the existing processor types don’t cater to your needs.

The next section will talk about how we have wrapped Apache NiFi with our own APIs and created an internal tool called Connect+

Connect+

While Apache NiFi has many features under the hood, we discovered that in order to achieve our use cases and have a better experience in creating, maintaining, and monitoring, a wrapper service will be required. Following are the key factors that went into play to come to this conclusion.

  • In NiFi, the user will be able to create their own dataflows, whereas we wanted strict control of what can be created.
  • We wanted a cleaner UI where the information regarding the tenants can be shown in order to have a better data transformation configuration experience.
  • Eliminate many unknown pieces of information related to NiFi to the end-user. The user should not worry about knowing the details and workings of NiFi.

Architecture

  • We have created a service called Glue in Spring Boot, which is responsible for creating the dataflows by communicating with NiFi using NiFi APIs. It also bundles the UI that was written in React.
  • The metadata required to store the information regarding processors, their connections, properties, etc is stored as part of the MySQL database. It also contains user information, their access to various workspaces, etc.
  • Redis datastore is used to store user sessions as Glue application is stateless. It also stores the OAuth tokens generated in the dataflows to call our platform APIs.
  • MongoDB is used to store the provenance and error events that are created by NiFi. These are later processed to create the error reports.
  • We use NewRelic as our monitoring and alerting tool and hence we also sync the provenance events to NewRelic.

Monitoring

Apache NiFi can provide metrics in Prometheus format. We need to enable the Prometheus Reporting Task in NiFi and configure Prometheus to scrape data from NiFi. Dashboards can be built on Grafana and can be monitored and alerted based on the data.

We also push the provenance events to NewRelic after enriching them with a few details. We created dashboards to have real-time statistics of the data that is getting processed.

Sample Grafana dashboard based on Prometheus queries

Reporting

Provenance events in NiFi don’t contain much information and it is difficult to search. We had used the Provenance Reporting Task in NiFi which calls our Glue API to store the events in MongoDB. Before storing the events, we enrich more information for them in order to be more searchable. For reporting via email, we had to write a job that goes through the provenance events stored in MongoDB and generates a report that can be mailed to dataflow users at a scheduled time. We used Quartz scheduler for scheduling the reports.

Scalability

According to our infra architecture, we have deployed NiFi as a stateful set of Kubernetes in a cluster mode with 3 nodes currently. We are aware that this doesn’t allow us to have auto-scaling but we are also aware that most of the data ingestions that happen through this platform are stable in terms of volume and we don’t observe spikes as such.

But the scalability issues will arise when more and more brands start adopting this service and we ideally want to get into a place where manual evaluation and intervention should be eliminated in order to scale up. Scaling up here refers to processing multiple datasets concurrently. The NiFi cluster can be increased to more nodes in this case based on metrics like CPU utilization.

Conclusion

  • We used Apache NiFi to automate the flows and got rid of redundant integrations.
  • We built a wrapper UI, API, and custom processors around it to achieve simplicity, access control, monitoring, error reporting, and custom requirements.
  • New workflow requirements can also be onboarded with minimal or no coding knowledge just by adding metadata.
  • We were able to reduce the brand onboarding into our platform from a few days to a few hours.

--

--