Do all tasks have data targets attached to it?

Published in

Databand, an IBM Company

3 min readJun 24, 2021

Ifat Yaakobi: Do all tasks have data targets attached to it? Or is it possible to have a task that does not have any input or output target? How common is this situation?

Harper: Great question. This will be difficult to address due to several complexities around the idea of a ‘data target’, but I’ll do my best.

First, it’s important to recognize the terminology used here is closely coupled to Databand.ai terminology (i.e. data targets, task). Let’s start with some definitions, followed by a general overview, then close with how that ties to Databand.ai.

These are the terms I want to clarify:

Dataset — a collection of data points

Task — a collection of step(s) to enact a logical process

Pipeline — a collection of tasks

Data Target(s) — Databand object which represents a client’s dataset

Generally speaking, a task does not need a dataset to function. However, without receiving, seeking, or producing data, the task is unable to know the context in which it is operating and will not be able to inform any tasks further downstream. Even the simplest task will need to produce a data set (1 or 0) to ensure the pipeline knows how to proceed; did the task pass or fail?

So, in general, I would say it’s extremely uncommon for tasks to not have a data set. Whether that data set is persisted or in-memory is the differentiation between the general data set and the data target.

Bringing this back to Databand.ai, data targets can represent any type of data set. However, I believe users default behavior will be to track persisted data sets, such as tables, files, or objects. For this reason, I would say not all tasks have an attached data target, and it would be a common scenario to see a task without an attached data target.

In summary, data targets are a Databand object that captures information about data sets in a client’s system. It is not required for a task to have a data target in Databand. I do not think it would be uncommon to see tasks without an attached data target.

Ifat Yaakobi: One more question about data targets. When I look at a target volume history, which view gives more interesting insights — to observe target volume history on its own, or compare it to other dataset histories in the run and task.

Harper: It is always interesting to understand a target’s volume over time. Whether it is interesting to look at the volume history relative to other datasets would be entirely dependent on the correlation between those data targets.

To determine correlation, I would want to answer questions like:

Are these data targets from the same source?

Are these data targets from the same domain? (Finance, Healthcare, etc)

Do these data targets have the same granularity?

Are these data targets created by the same task? pipeline?

Do these data targets have the same velocity/cadence?

Ifat Yaakobi: OK. It sounds like there are a lot of dependencies and considerations here. Thanks for this feedback.

This post is a part of the AMA Data Engineering series hosted by Databand.ai. Data Engineering AMAs (Ask Me Anything) are an opportunity for community members to get their burning questions answered by our team of experts. Curious about any topic from pipeline design and best practice to use cases and data engineering workflow? Leave a comment below and we may feature your question next!

___________________________________________________________________

Who’s Harper?

After years of studying Accounting, Mathematics, and Economics, Harper stumbled into the world of Big Data and has never looked back. Most recently, Harper has led Data Engineering teams in the NLP and Data Ops spaces, where he prioritizes folk’s psychological safety above all else. In his current role as Data Solution Architect for Databand.ai, Harper loves conversations around Data Engineering pain points and how best to solve them.

Do all tasks have data targets attached to it?

Written by (Michael) Harper (he/him)