Data Pipeline and Data Lifecycle management.

Kiran Mainali
Big Data Processing
4 min readNov 15, 2021

Data Pipeline

Data pipeline or data analytics pipeline is a series of steps for data processing in which data moves from source to destination. A pipeline is an engine for data transformation with a series of automated and manual interconnected operations or tasks. A data pipeline is the foundation of analytics, reporting, and machine learning, with processes that move and transform data from sources to destinations generating new values [1]. A data pipeline task could be individual or collected with an accepted execution order [2]. In some pipelines, the order is essential, whereas some pipelines can have interchangeable task orders. Each data analytics project has its purpose, and to meet the project purpose; a data analytics pipeline is created. Depending on the project goal, the tools we use and the task we make will differ.

Figure 1: Simple Data Pipeline

In general, a data pipeline is built for efficiency to minimize or eliminate the manual process. Several commercial and open-source tools are available to use in the data analytics pipeline, and lots of comparative studies can be found [3,4]. However, selecting tools and technologies for the data analytics pipeline depends on organization, people, technology, data governance policy, and data management policy [5].

Data Lifecycle

Figure 2: Data Lifecycle Stages

In the above Figure, a general data life cycle is illustrated. Planning the project is the first step. The purpose and procedure of data collection and transformation, analysis, publish, and storage procedures are defined in this stage. Also, data lifecycle governance and data quality management plan are formulated. Here we are not considering planning as part of the data lifecycle. Data lifecycle goes through the collection/creation, processing, analyze, and publish stage. Data will be archived/stored in persistent storage in each step, simultaneously storing in temporary memory for easy access to other stages. Temporary memory data are deleted immediately after use in the next stage. So, to access the same data in the future, each phase must check persistent storage. Persistently stored data will eventually be destroyed after the purpose of data is fulfilled.

Data governance policy governs the lifecycle and handles the flow by implementing rule-based decisions for storage (temporary and persistence), archival, access, and data disposal. Furthermore, data quality management will check the quality of data (input and output), a process associated with data throughout the data lifecycle.

The data lifecycle presented above is a cyclic process. Published data can be collected from storage, processed, and analyzed to deliver different analytics results. So, the data lifecycle will continue with each data lifecycle producing new data as a result.

Data Pipeline in Data Lifecycle management

Figure 3: Data Pipeline in the Data Lifecycle Management

Data pipelines are a means of sequencing data processes throughout the data lifecycle [2]. In general, the data pipeline often starts from data collection or creation, as in Figure 3, where pipelines P1, P2, and P6 use data directly from the source, whereas the rest of the pipeline access data from storage (temporary or persistent). Data pipelines can cover the whole data lifecycle or be built for its specific stage. Data pipeline coverage decision over data lifecycle stages is made at the planning phase with project scope consideration. The data pipeline does not require sending data to a storage service; it can route to other pipelines or applications [6]. For example, in Figure 3, the assumption is that pipeline P3 can take input directly for P2 and route output directly to P4 without storing data in the storage service.

In Figure 3, Illustration shows how data pipeline is used in different lifecycle stages. We outlined different possible data pipelines with dotted ovals. Several potential data pipelines can be made per the data analytics project requirement. The figure shows a data movement from one data lifecycle stage to another, transforming data through a series of tasks inside the pipeline. The block on top of the figure resembles a pipeline with one or more interconnected tasks.

Moving data from one stage to another should be a system of interconnected tools, technologies, and processing steps assembled inside the data pipeline. So, a data pipeline is a means of data transportation with the transformation from one stage to another.

References:

  1. J. Densmore, Data Pipelines Pocket Reference, First, 2020.
  2. B. Plale, I. Kouper, The Centrality of Data: Data Lifecycle and Data Pipelines, in: Data Anal. Intell. Transp. Syst., Elsevier Inc., 2017: pp. 91–111. https://doi.org/10.1016/B978-0-12-809715-1.00004-3.
  3. A. Bansal, S. Srivastava, Tools Used in Data Analysis: A Comparative Study, 2018.https://www.ijrra.net/Vol5issue1/IJRRA-05-01-04.pdf.
  4. H. Khalajzadeh, M. Abdelrazek, J. Grundy, J. Hosking, Q. He, A Survey of Current End-User Data Analytics Tool Support, in: Proc. — 2018 IEEE Int. Congr. Big Data, BigData Congr. 2018 — Part 2018 IEEE World Congr. Serv., Institute of Electrical and Electronics Engineers Inc., 2018: pp. 41–48. https://doi.org/10.1109/BigDataCongress.2018.00013.
  5. Z.A. Al-Sai, R. Abdullah, M.H. Husin, Critical Success Factors for Big Data: A Systematic Literature Review, IEEE Access. 8 (2020) 118940–118956. https://doi.org/10.1109/ACCESS.2020.3005461.
  6. G. Alley, What is a Data Pipeline? | Alooma, (2018). https://www.alooma.com/blog/what-is-a-data-pipeline.

--

--