The Data Engineering Lifecycle

Dom N
5 min readNov 8, 2022

--

A primer into the 5 key components of how data engineering is delivered end-to-end to provide business value. This is the first of a 6 part series and is a summary of my thoughts and learning from the book, ‘The Fundamentals of Data Engineering’ and other talks and articles on these topics.

The DE Lifecycle (from the book)

As seen in the image above, we notice the 5 main components as well as how those 6 undercurrents touch each part. The 5 main parts include:

  • Generation
  • Ingestion
  • Storage
  • Transformation
  • Serving (ML, Analytics, Reverse ETL)

Generation

A source system is any origin of data that outputs raw data we later want to use. They can come in many forms such as a transactional database, an IoT device, flat files (CSV, XML), RSS feeds, web services and many more.

Data Engineers (DEs) need to have an understanding of how to interact with some of these sources and communicating with the upstream stakeholders that manage them. This can be difficult when there is commonly many different systems a DE must be aware of.

Key considerations for evaluating source systems:

  • What is the essential characteristics of the data source?
  • Length of data persistence, i.e. long term or quickly deleted?
  • Rate of data generated, i.e. events/size per second/hour?
  • Level of consistency, i.e. standards, formatting, cleanliness of data
  • Frequency of errors or duplicates
  • Timing of data values, i.e. are some later than others?
  • What is the schema? How are schema changes communicated (if any)? Is it schemaless or fixed schema?
  • Frequency of refresh (align to use case)
  • Type of tables e.g. periodic snapshots or update events or logs etc?
  • Performance considerations when reading from data
  • Data quality checks, i.e. are they in place upstream or do you need to do it?

Ingestion

Source system generation and ingestion are often the bottlenecks for many organisations. This is because source systems are managed externally and ingestion can stop working at random times. If these 2 parts are unreliable, they can cause a ripple effect across your lifecycle.

Key considerations for evaluating ingestion patterns:

  • What are the use-cases for this data? Can I re-use this data instead of create multiple versions of the same dataset?
  • Destination after ingestion
  • Frequency of access
  • Volume of data during arrival
  • Format of data during arrival
  • Pull vs push

The biggest debate at this phase lies between deciding on batch vs streaming styled ingestion. Batch ingestion means taking data in chunks which could be decided by size or time period or threshold. Streaming data means consuming it in a (near) real-time and continuous flow. Batch has been historically the most popular method for moving data however, we anticipate to see the trend of streaming to overtake this in the coming years. Considerations for batch vs streaming:

  • Can downstream storage handle stream flow?
  • Microbatch vs Realtime? would minute batches suffice?
  • What specific benefits do I gain from using streaming instead of batch?
  • What’s the difference in cost (money, time, maintenance, downtime, oppurtunity cost)?
  • Would an ML model benefit from continuous training?

As a concluding thought on ingestion patterns, batch ingestion will work great in almost all use-cases (e.g. weekly reporting or model training), therefore, only adopt streaming if there is a real business case for it which has better trade-offs.

Storage

Storage is relevant across all parts of the lifecycle and can happen in different ways in each part. Choosing storage solutions is important and has many tradeoffs. Generally speaking, storage can happen on premises (on-prem) or on the cloud, which has gathered mass adoption in recent years for good reason (we will not touch on the benefits of cloud computing here).

Key considerations for evaluating storage systems:

  • Compatibility with required write and read speeds (+ SLAs)
  • Bottleneck considerations for downstream processes
  • How the storage works, i.e. does it prioritise long term storage or frequent and fast reads? Does it support complex query patterns?
  • Scalability
  • Capturing metadata (schema evolution, data flows, lineage) and managing governance (MDM or golden records + compliance)
  • Is it schemaless (schema on read) or fixed schema (schema on write)?

Not all data is created equal and depending on use-cases, retrieval patterns will be very different each time. We can therefore refer to some data as ‘hot-data’ which means it is accesses very frequently (many times per day or minute) or ‘cold data’ (seldom queried and can be archived).

Transformation

This is where the DE adds the most value to the business. Without this step, data won’t fulfil its potential of acting as decision support for a business. Transformations can range from basic to very complex. On the more basic side, we might often see data getting casted to different data types or standardising formats. We might then see transformation of the schema, normalisation of data or aggregation. The most complex (and most impactful) transformation however, can come from applying business logic to the raw data. This can also be considered as ‘data modelling’.

An example of data modelling might be creating the tables in your data warehouse to reflect the heirarchy and functions of your business unit. A specific business rule you might need to implement is defining what a customer is. Are they someone who has simply created an account? Do they need to have purchased before? What time period do you consider?

Key considerations for transforming:

  • Comparing cost to ROI of the transformation
  • Is the transformation simple and self isolated?
  • What business rules does the transformation support?

Transformation can also happen in other parts of the lifecycle too, e.g. it can happen in flight as data gets ingested where a stream applies certain logic to the destination. Examples of this include adding timestamps or enriching with extra fields/calculations.

The last major category of transformation is featurization. This is often done in preparation for a machine learning model and involves combining domain expertise with extensive data science knowledge to derive and enhance features.

Serving Data

The last stage of the lifecycle is where you will interact the most with other stakeholders. This is the part that those downstream stakeholders actually recieve the benefit from your labour. We categorise the 3 main outputs for serving data as:

  1. Analytics — includes published reports or dashboards, ad-hoc analyses on data. Can be further split into BI, operational or embedded analytics.
  2. Machine Learning — includes serving data used for purpose of prediction or decision making.
  3. Reverse ETL — involves feeding the result of transformed data back into a source or other system for further use.

Concluding Thoughts

So we’ve briefly discussed the 5 key stages of the data engineering lifecycle and some considerations for each. We will dive into each of these in more depth in other articles.

I hope you got something out of this and learn a little about the different components of the lifecycle a good DE should understand to be productive in their business.

--

--