September 7, 2017
Data Engineering groups in large enterprises are typically decentralized. Teams develop specialized skill sets in particular areas of data processing, and have specific charters. For example, a team may be responsible for data acquisition. Another may be responsible for cleansing, transforming, normalizing and analyzing data. Another team of data scientists may be responsible for consuming this data, and applying machine learning models to derive insights from data. This results in the creation of complex data processing dependencies in a large enterprise. Typically, these dependencies are events generated by a given process that another process may depend on. Some examples of these events could be:
- Arrival of new data
- Successful completion of a process
- Failure of a process
- Time-based events
- Creation of a file at a given location
A process may have a dependency on a single such event, or a complex combination of multiple such events.
Also, typically in a production environment, processes are not run one-off. They are run periodically — hourly, daily, weekly, etc. They are typically scheduled on a time basis. However, it is often impossible to determine the time to run a process, because of the dependencies mentioned earlier. Moreover, a delay in running an upstream process would mean that a downstream process scheduled to run at a particular time now does not have its input ready when it is triggered. Such dependencies, although common, can lead to wastage of resources, and unpredictability in your data processing systems. In such scenarios, it would be much better for processes to be scheduled based on specific events — commonly known as event triggers. To illustrate this, let’s consider a few scenarios:
Solstice Networking is a large telecommunications company. The data infrastructure group at Solstice has recently set out to develop a data lake. The data acquisition team is responsible for acquiring data from aggregators, and other sources, and make various feeds available periodically for other teams across the organization to consume. For example, the data science team uses the device-data feed to test their models. The data analytics team consumes these feeds and runs aggregations, normalizations, and analytics to generate periodic reports. These downstream processes depend on the feeds for meeting their production SLAs. As a result, they would expect predictability in terms of when the upstream processes complete successfully, and when feeds are available for processing. In addition, each of these feeds has different characteristics — be it arrival rate, data size, SLAs, contents or location. These characteristics may also change over time — e.g. a feed may be produced at a different location than before. The downstream processes would need to know these characteristics accurately when they are triggered, so that they can use this information to run correctly.
Pixeltube Corporation has a large customer care organization, with a tremendous volume of support requests. To make the customer support process more streamlined, and to better help their customers, they have developed a complex data processing system with the help of the data infrastructure team. The main aim of this system is to help solve customer issues in the least possible time with the least possible use of resources. All customer support calls have been instrumented to collect data about the issue at hand, the suggested remedies, and the resolution. The data collection team collects this data from all the servers where it is sent and makes it available at a central location on an Apache Hadoop cluster periodically. The data analytics team cleanses, transforms and aggregates this data and makes it available in a standard format. The same cluster also has other data such as device metadata and user profiles, generated by other processes and teams. The data science team generates, maintains and tests their models across these datasets and derives insights that can help in predicting solutions to future customer problems. A reporting team uses all these datasets to generate timely reports that can be used by the customer support team. As you can see, this is a complex system with a lot of interdependencies across multiple teams. For this process to function correctly, it is necessary for the system supporting it to allow users to express these dependencies clearly, and then manage them at runtime.
The above two scenarios are representative of data engineering/infrastructure organizations of large enterprises today. The common theme that cuts across all such organizations is complex interdependencies between processes. If these dependencies are not managed correctly, it can result in operational hazards as well as financial losses. Some of the typical characteristics of such interdependent systems are:
- Teams with different responsibilities perform specialized tasks in a data processing system
- There are complex runtime dependencies between teams in production (tight coupling), but during development and testing, teams run their processes independently (loose coupling).
- In production systems, all the processes are typically run periodically, and not one off.
- It is difficult, and more importantly inefficient to schedule these processes on a time basis.
- A more efficient way of scheduling such processes is via event based triggers
- Merely firing an event from an upstream process that would trigger a downstream process is not enough. The upstream and downstream processes need a handshake in the form of some payload containing information about the upstream process.
- For operational purposes, it is important to be able to visualize these dependencies, so problems/issues can be traced effectively and resolved.
Event-based triggers in the Cask Data Application Platform (CDAP) address these requirements. They allow users to trigger workflows based on specific, system generated events such as:
- The change in state of an upstream workflow
- The arrival of a new partition in a partitioned datasets
Here is a video that illustrates event based triggers in action in CDAP.
To try out Event Triggers and other new CDAP features, download CDAP Local Sandbox or spinning up an instance of the CDAP Cloud Sandbox on AWS or Azure. CDAP is also available to install in distributed environments. Reach out to us via chat or email should you have any questions or issues, or just want to give us your valuable feedback!