Pipeline: Your Data Engineering Resource

Your one-stop-shop to learn data engineering fundamentals, absorb career advice and get inspired by creative data-driven projects — all with the goal of helping you gain the proficiency and confidence to land your first job.

Extract. Transform. Read.

Data Engineering Un-contained

3 min readMar 27, 2025

--

The following short read is an excerpt from my weekly newsletter, Extract. Transform. Read. sent to 2,000+ aspiring data professionals. If you enjoy this snippet, you can sign up and receive your free project ideation guide.

For a STEM discipline, there is a lot of abstraction in data engineering, evident in everything from temporary SQL views to complex, multi-task AirFlow DAGs. Though perhaps most abstract of all is the concept of containerization, which is the process of running an application in a clean, standalone environment–which is the simplest definition I can provide.

Since neither of us has all day, I won’t get too into the weeds on containerization, but I will offer a brief explanation followed by some best practices.

If my simple definition doesn’t provide enough detail for containerization, consider this example. One afternoon, while on a walk, I explained to my non-technical (but very intelligent) wife that a container with an image running on an infrastructure layer like a virtual machine, is like setting up a computer with an operating system that contains only what is minimally necessary to run an application.

In our example, the application was a game from her childhood she wanted to theoretically run. The instructions, including installation of the game and OS to run the game would be the container image.

If using a service like Docker, the spark that jumpstarts the engine of infrastructure is the Dockerfile, which contains detailed instructions in the form of one-word commands like:

  • FROM (a base image which could be something like “python:version” or another Docker image)
  • COPY (typically copying what you need from an environment into the container’s default directory)
  • RUN (used to install dependencies with “pip” and to execute scripts)

Like other tech concepts, a desire for a candidate knowledgeable about containerization appears on job descriptions in the form of buzzwords like Kubernetes (cluster management) and Docker (the industry standard for creating, maintaining and running containers and images).

Photo by Ian Taylor on Unsplash

To stand out as a container-izer, definitely take time to learn the quirks of management services like Docker, but also:

  • Consider using :slim versions of images to conserve memory when stored in a remote repository and when executed at runtime
  • Understand image tags and how to ensure you’re using the “latest” version of each image
  • Construct your local directory and CI/CD pipeline properly so the Dockerfile is within the scope of the build step
  • Double-check file paths when writing CLI commands and when creating yml files

Obviously Docker and containerization goes well beyond the scope of this brief overview. One of the challenges I faced was learning how to properly inject environment variables with API keys into an image at run time.

Assuming others have faced the same issue, I wrote about how to authenticate GCP when running a Docker image.

Thanks for ingesting,

-Zach Quinn

--

--

Pipeline: Your Data Engineering Resource
Pipeline: Your Data Engineering Resource

Published in Pipeline: Your Data Engineering Resource

Your one-stop-shop to learn data engineering fundamentals, absorb career advice and get inspired by creative data-driven projects — all with the goal of helping you gain the proficiency and confidence to land your first job.

Zach Quinn
Zach Quinn

Written by Zach Quinn

Journalist—>Sr. Data Engineer; new stories weekly.

No responses yet