From cURL to Automated Workflow

Getting ARK fund data is not hard; handling the downloaded data is

Geoffrey Hung
USF-Data Science
12 min readFeb 24, 2021

--

I am not a big fan of the stock market but still hear famous ARK funds recently [their bets on Tesla and Square!]! Frankly, I admit I am very curious about what they did, after all, who doesn’t like money.

Interestingly, ARK funds are transparent enough to show their holdings daily and people can study what they are holding, but they only share the data snapshot of that day! [ie: what they are currently holding?]. To know what they held and how they changed the holdings weights, we need to have multiple snapshots.

In this article, I will talk you through my side project last weekend on scraping the data from the ARK site using the command line, pipelining the data to BigQuery, using dbt to do data quality testing, and lastly, automate and schedule the workflow using GitHub Actions.

High-level flow

We will explore some data engineering problems and WHY these tools are useful or even necessary if you don’t want to be a firefighter/human debugger all the time. Our goal is to leverage as many engineering tools as possible to create an automated workflow such that our limited life could be more meaningful

>>>TALK IS CHEAP<<<

Functional Requirements

  1. Get the data
  2. Store and access the data
  3. Check data in (1)
  4. Regularly repeat (1), (2), (3)

Nice-to-have:

  1. Cost: Both time and $, ideally zero after development
  2. Quality: Either alert me of the problem or I will simply use the data directly
  3. Reliability: Don’t rely on the human or specific computer
  4. Workflow: Monitor the whole process, alert me if things go wrong

Normal approach: this is not unheard of!

  1. Get the data: Manually click the links and save them as CSV
  2. Store and access the data: Store somewhere in share folder and Excel
  3. Check data in (1): Eyeball check OR creating many pivot tables
  4. Regularly repeat (1), (2), (3): Human

More advanced approach: use code

  1. Get the data: request in Python OR curl in the command line
  2. Store and access the data: store somewhere in the shared folder and Jupyter Notebook to access the data
  3. Check data in (1): Many plots OR many cells with assert using Pandas
  4. Regularly repeat (1), (2), (3): cron job

The drawbacks are obvious:

  • What if the people sick leave/computer is down and miss few days? [Data lost!]
  • Who knows where they store them exactly? [maybe people write in some doc/email/share folder? who knows]
  • Can we trust their checks? [Sorry I can’t, because I can’t even trust my checks]

In short, these approaches are unreliable, there are endless possibilities to go wrong, some people think that’s the value of a data team, that they can have the capability to make the hypotheses about the potential data issues. But being able to do it doesn’t mean I need to do it “manually” and “repeatedly”.

Recall what makes software eat the world: The marginal cost of software approaches zero. We should develop reliable solutions to automate as much as possible to reduce the marginal cost!

Get the data: command line

Firstly, How can we get the data? Getting the data from the ARK site is no more than running the following command in the terminal

curl result

But seeing it in the terminal is not fun, we want to download it as CSV file! No problem, just add > ARKK.csv

We open the CSV and see, hmmm what are these things……these remarks are not part of the data…….After I observed the patterns, we can simply remove the last 3 rows. What can I do? Maybe I can write a Python script to remove the last 3 rows [Seems to overkill]

csv weird ending

For simple thing, maybe you can consider some command line, here I make a pipe

tail -r | sed ‘1,3d’ | tail -r simply means:

reverse the rows, delete the first 3 rows, reverse the rows

aka: remove the last 3 rows

Am I so smart to write this on my own? Surely not, google!

Nice materials to learn more about scripting

With the extra for loop and some transformations, that’s the core part of collect_and_load.sh! OK, now we have a script in hand, it will download the data to the folder when we run it. The next problem is where should we store the data such that we can access it later?

Store & Access the data: BigQuery

Although people said data are just binary at last, different data formats require appropriate ways to store and access. Data formats could be table [like csv], object [like json], image [like png], text [like string], audio [like wav] etc etc.

For our case, the file format is CSV, it belongs to a huge family [structural data], which is the most typical data format and the standard way to query it is SQL [Basically appear in every DS interview]. So the problem is how can we set up the database such that we can upload and query the data?

Aside:

For relational databases, there are two main types: OLAP (Online analytical processing) and OLTP (Online transaction processing).

OLAP is for analytics purpose, when people talk about data warehouse, they usually refer to OLAP. It can do aggregation queries in second-level for tons and tons of data. One of the reasons why it is so fast is that the data is stored in columnar format [eg: Parquet], ie: data in the same column are stored nearby, so some people also called them columnar database. The famous cloud solutions include: BigQuery [GCP], Athena, Redshift [AWS], Snowflake

OLTP is for transaction purpose, when people develop the applications [try to google some web application tutorials requiring database], they usually refer to OLTP. It can access and operate on a single row very fast, the data is stored in row format, ie: data in the same row are stored nearby, famous solutions include: MySQL, PostgresSQL, SQLAlchemy

You can checkout Data Engineering Lecture in https://stanford-cs329s.github.io/syllabus.html for the overview

The solution I picked was Google BigQuery, the key reasons I picked are:

  • Serverless: Pay what I use, I don’t want to do any DevOps work for a side project and the free quota is enough
  • Scalability: Although it won’t be a case for this project, it is always nice to have a solution that can handle TBs+ level data
  • Console: query in the browser, no client software is needed
  • dbt: dbt supports BigQuery [Explain below]

I am not biased to GCP, AWS does equally good, checkout Athena / Redshift

The steps are boring [It is a compliment because it means simplicity], basically:

  1. Go to GCP / BigQuery, create a project and dataset, use schemabq_schema.txt and set date as partition field to create the table

Aside:

Partitioning is one of the easiest and most useful optimizations. Imagine we have a TB level data warehouse, we definitely don’t want to scan the whole dataset for every query [Because the pricing is based on how much data it processed]. Partitioning means to specify a field as “index”, when you specify the partitions in where-clause, it will ONLY scan the data in those partitions, hence reduce the scan size!

2. Generate a service account credential and put it in the right location

Required permission reference

3. Use bq_load.py to load data to BigQuery, we first remove the data [to avoid duplication] and load the data

Reference: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv

Data is loaded to BQ yoho

More optimizations could be done for bigger data, but I don’t think they matter a lot for this project, just share some possibilities below:

a. Load to GCS first instead of loading to BigQuery directly. This optimization will ensure the data can be kept safely in Cloud Storage and be able to load bigger data to BigQuery. Local upload only allows 10MB max

b. Save the CSV as gzip/parquet. Just for convenience & showcase because of small data

Great, now we use the script to get the data, store & query them in BigQuery successfully. What’s next? I remember a conversation with my previous colleague [another team, non-tech people], I just told her: give you a number is easy, give you the right number is super hard!

Why hard? The data we use in schools are mostly cleaned but in reality, everything can happen. How can we have more confidence in the data quality?

Disclaimer:

At most, we can only check the data fulfil some logics, we can’t auto correct the wrong data. Similar to testing in software, it definitely has problems if it can’t pass. But even it passes all tests, it doesn’t mean the program is correct

Check the data: dbt

Nowadays, with more and more cloud solutions, storage and data access are hard to be the bottleneck for most companies, out-of-the-box solutions such as BigQuery (Redshift / Athena), Google Cloud Storage (S3), Apache Spark can store and process TB level data easily

Still, one of the most annoying things about DS is dirty data. It is similar to a disease, the earlier you find it, the easier you can cure it but it is hard to detect until some people use it.

Let’s try a dummy example:

Suppose we want to join `users` information to `orders` table, can you spot the potential issues?

Join two tables

My answers below:

orders:

  • negative amount!?

users:

  • duplicated uid!?!?
  • NULL value!?!?!?

join:

  • since duplicated uid in table:users, it may expand the joined tables

If you can spot them all, nice!

Now imagine the tables become tens of columns and millions of rows, and I want you to join across multiple tables. What will you do?

This is one of the most common errors because I will simply look at the schema and start to join them until it looks good. The queries are vulnerable to these little errors because they are not easy to enumerate and super annoying to examine even if you can list them all. When the bomb explodes, usually the downstream pipelines are polluted already :(

This pattern is similar to software, it is hard to imagine any mature programs without testing [Unit testing at least]. So is it possible to introduce the testing to the data world? Let me introduce you dbt (data build tool)!

dbt is famous for doing data transformation in data warehouse, yet my favorite feature is data testing, I can use SQL to write unit tests! How?

Use ARK CSV as an example, I want to assert few things:

Example ARKK data
  1. Numeric fields should be positive
  2. The sum of weights should be (almost) 100%
  3. (date, fund, company) pair should be unique

Implemented test cases

What does this query do? It will return records if either shares, market_value, or weight ≤ 0! [If data are normal, it returns no records]

That’s it, dbt will handle the rest, I will simply run:

Notice that I skip the setup part, dbt needs credentials and profiles.yml to connect to the data warehouse, under the hood, they compile our tests [turn it to queries] and run, PASS if no return else ERROR

dbt test output

Wooow TBH, I relieve a little bit when I see the tests are all passed, I know I can work on data that make some sense, but not wasting hours or days to know it is a data issue.

I personally think this is a game-changing feature because we can adopt the testing practices from software engineering! Maybe not dbt at last, but something similar with low learning curve should be part of the data engineering tech stack

On one hand, it encourages the team to write data tests instead of ad hoc queries for exploration. We can leverage what others developed and step on top. I can be less anxious when I need to review 200+ lines SQL and figure out what goes wrong.

On the other hand, it enables the new workflow to check data quality from upstream. With more iterations and better test coverage, data quality can be improved gradually because we are leveraging the team intelligence [don’t make the same mistake twice]

dbt itself is worth a series of articles, I only use one of the features, to learn more about dbt, try their tutorial: https://docs.getdbt.com/tutorial

Below are my immature opinions on dbt:

  • Integrate dbt testing to existing data warehouse project is straightforward, just start a dbt project, set the source.yml and you can start writing the tests. Don’t need to migrate the transformation (T in ELT) part to there yet! Strongly suggested for existing projects if your current data warehouse is ‘test-naked’
  • I think dbt transformation is great if your data size is small [that you can afford full tables generation]. But for partitioned data, I don’t think the materialization is mature enough for big data yet, feel free to correct me if you know how to precisely update one partition. I will suggest using it to manage the views as the start!

I am not biased to dbt, you should also checkout https://greatexpectations.io/

Automate the work: GitHub Actions

To recap what we have achieved so far, we have the script to get the data, load & query them in BigQuery successfully, include data testing using dbt. The things seem quite automated already except someone needs to run the command from time to time :(

I definitely don’t want to do this myself, instead, we will use GitHub Actions as our CI & scheduler and make the workflow. Recall what we want it to do:

  1. At a specific time daily, kickstart the workflow
  2. Using the script to download data and load it to BigQuery
  3. Run dbt test
  4. Save the downloaded data to our GitHub repo as well [optional]

With clear goals in mind, time to Google!

Firstly, I found this nice GitHub repo to download the data from the ARK site using GitHub Actions workflow, except it only downloads the latest data versus what we want to keep every date. But this is nice as our base template!

Then I want to load data to BigQuery, I definitely don’t want to allow random people to load data to my BigQuery, therefore some credentials things are needed. After some google, I found google-github-actions repo to set credentials in GitHub Actions

Next, we want to run dbt test in GitHub Actions and I found code snippet by others, like this

With many trial-and-errors, I get my final script!

Trial and error with GitHub Actions

Code Structure:

The number represents my development steps, one means my first step

Conclusion:

That’s it, I have a workflow to run daily at 0030, 0830, 1630 to download data from the ARK site, load to BigQuery, do quality check using dbt, push the downloaded data to GitHub repo if any.

Scheduled workflow

We use GitHub Actions workflow as a scheduler and Google BigQuery as a database, all within free quota. And the most amazing part is, they are detached from me and my computer, I don’t need to concern the service reliability too much.

My attitude toward outsourcing service reliability to the cloud services-from <The Hunger Game>

If next time the data source isn’t ARK site anymore, of coz I need to modify some codes but the workflow is similar and scalable to much bigger data (as big as BigQuery can’t handle it OR you can’t afford the bill, I guess the later come first)

OK, the article is longer than I expected, let’s stop here, I hope you find this article useful!

you can star my repo!

OR add my LinkedIn [Welcome but please leave a few words to indicate you are not a zombie]!

Stay safe everyone! Remember to wear the mask!

--

--