Ia! AWS! DataPipelines Fhtagn!

Hillel Wayne
6 min readSep 7, 2020

--

Note: this was an article I was writing in 2017, got halfway through the first draft, and then never touched the company again. It’s unedited, completely outdated, and the entire second half is missing. I’m only publishing it because I wanted to share the draft with a couple of people. Caveat emptor.

AWS says that Data Pipelines are “a web service that helps you reliably process and move data between different AWS compute and storage services, as well as on-premise data sources, at specified intervals.” Nothing more than a simple ETL service, as natural and simple as the air we breath. This is a lie they’ve fed us to keep us in happy ignorance. Data Pipelines is an eldritch horror. It is the ambergris of dead gods rotting between the stars. It is the atoms-madness that uncoils to the programmer-priests of AWS and so thrusted upon us, machines unmade and unknowable by man or son. Behold our digital Necronomicon.

This recounts our experiences with Data Pipelines, our scattered recollections of a doomed expedition. The memories are hazy- while we set off in April, only recently was I released from the sanitarium. Nonetheless, may this serve as a warning to others… or a manually for the truly foolhardy, or desperate.

Motivation

At eSpark Learning we build individualized learning tools for students and teachers. As a side effect of this we accrue a lot of actionable data principals and superintendents find useful. For example, if students in their school are consistently underperforming in specific reading standards, they can []. [Also, we make more money. etc]. Most of this data, those, is tied up in event logs in our data warehouse. We wanted to make this more accessible and decided in April to build a data dashboard for principals.

We can generate analysis and reports from our Redshift warehouse, but that doesn’t scale. Redshift chokes on more than 15 simultaneous connections, and the queries can be fairly intensive. This meant on-demand reporting would be too expensive and difficult to do. Instead, every ten minutes we’d crunch the new event rows and upload them to a dedicated postgres database that stored our analysis. Principals would get a decent resolution on analytics, we’d get a performant solution, everybody is happy!

Kinda like this

Since the rest of our infra was on AWS, we decided to use Data Pipelines for our ETL.

In sunken R’lyeh dead Cthulhu lies waiting.

Let’s begin with what we wanted. Ideally, the ETL is pretty simple:

  1. Run a query on our Redshift database
  2. Copy the result to an S3 bucket
  3. Upload the csv to our Postgres database.

Data Pipelines has a GUI you can use to visually build the pipeline, which turns out to be a terrible mistake, but for now we were naive. Here’s what that looks like:

Shouldn’t be too hard to understand. Every time period, DP spins up an EC2 instance, and does the copy activities. Each activity is copying to a data node. For our SQL nodes we have to specify the database they belong to. We don’t need to do that for S3.

Makes sense, right? Good. Nyarlathotep hungers for our souls and doesn’t like waiting.

The Falcon Can’t Find the Falconeer

“Hey,” you might say, “how do you give AWS access to the pipelines?” Good question! Obviously we can’t hardcode the password because that’s a good way to get your data ransomed. DP allows for custom parameters you can inject at runtime. All custom variables are of the form myVariable — yes, the my prefix is mandatory. Then if you put password: foo-#{myVariable} in an object field, and inject myVariable=bar in the activation, DP will interpolate that into password=foo-bar at runtime! So how do you inject a parameter in a GUI?

You can’t.

In fact, the GUI won’t even consider the pipeline as valid! You see, you used myVariable in a field, which means you have to add myVariable to the list of defined parameters in the pipeline definition. There’s also, coincidentally, no way to do that in the GUI. You can only declare parameters by defining the pipeline via API.

Let that sink in. The Data Pipelines GUI can’t create valid pipelines.

And we’re just getting started.

CLIents of a Dark God

Okay, a broken GUI isn’t the end of the world. We were planning on doing everything through scripts anyway. AWS provides SDKs and CLIs for the DP API. [Acronym Joke]. To set up a data pipeline in a script, you have to do the following:

  1. Call create_pipeline with a pipeline name and a unique_id.
  2. Call put_pipeline_definition with your pipeline schema and the pipeline_id.
  3. Call activate_pipeline with any parameter overrides and the pipeline_id.

For the record, unique_id ≠ pipeline_id. The unique_id is used for absolutely nothing in the data pipeline or the API and, in fact, cannot even be queried for after creation. It exists solely to make creating pipelines more difficult.

When you create a pipeline, the return value is the pipeline_id, by which I mean the pipeline_id in a json object:

{
"pipelineId": "df-00627471SOVYZEXAMPLE"
}

Here we run into another problem with data pipelines: inexplicable breakages from side effects. For some reason, piping the output in bash (say, to pull the id with jq) caused that specific pipeline to bug out and be unusable. Using the CLI was out of the question, we’d do it in Ruby instead.

The Color from out of Space

Now that we can actually upload our pipeline definition, how do we specify it? There’s actually two different json schemas DP uses:

Importing or exporting from GUI or CLI:

{
objects: [
{
field1: value1,
field2: value2
}
],
parameters: [
{
type, id
}
],
values: {
id1: value1,
id2: value2
}
}

Importing or exporting from API, validating from CLI:

{
objects: [
{
fields: [
{ key: field1, string_value: value1 },
{ key: field2, string_value: value2 }
]
},
],
parameters: [
{
attributes: [
{
key: type,
string_value:
}
]
},
values: [{id, value}]
}

We’ll call the former schema gallant and the latter, goofus style. Fun facts:

  • A goofus schema of a data pipeline is about 3 times larger as the gallant schema.
  • The goofus schema stores the parameter values as an array, meaning if you want to replace a parameter value you have to iterate through the array and modify the existing key. If you don’t, DP will silently convert your parameter’s type to an array type.
  • There’s no provided way to convert between the two schemas. You have to do it yourself.
  • Did I mention you’re writing these definitions by hand, and that working with the goofus schema is absolutely awful?

So: our choices were bash or ruby. If we use bash, we run into a sporadic piping bug, and we’re trying to modify json objects with bash. If we use ruby, we’re forced to use pipelines in a squamous mockery of a json schema.

Instead, we did both.

I belong to Azathoth now.

Validating

“Pipeline activation failed”. Why did it fail? We can call validate-pipeline, which is supposed to tell us the errors. Instead it asks us to reupload the pipeline definition. Even though we already defined it, we need to upload it through that specific command in order to validate the definition. validate-pipeline only accepts goofus schema.

“What if we validate through the CLI?” Good idea, that allows for gallant schema. It would require another subprocess call but — haha, just kidding, validate-pipeline-definition breaks with consistency and is the only CLI command that exclusively works with goofus schema. validate-pipeline-definition hates you and doesn’t want you to forget that.

“What if we validate through the GUI?” Good idea, but remember that the GUI rejects valid pipelines. If you check the GUI to find what’s wrong, you’ll be swamped under dozens of errors that aren’t errors at all. Fun fact, at least once we’ve run into a case where a pipeline activated properly, but when we looked at it in the GUI it retroactively broke on the command line, too. Using the GUI should be a last resort, after blind guessing and sacrificing your interns to Yog-Sothoth.

“What if we sacrifice interns to Yog-Sothoth?” Good idea, but interns are a nonrenewable resource. Let’s just go with the blind guessing.

Now I never have to log into Medium ever again! Woop woop!

--

--

Hillel Wayne
Hillel Wayne

No responses yet