Day 5: Feed old events

Gijs Nelissen
Solving Marketing Attribution
2 min readApr 28, 2020

I have a full archive of all segment events in our Postgres warehouse that I now want to feed the system. Let’s try to make that happen.

I have exported the postgres pages and identifies table in CSV files of 500k rows per file.

// file 1
SELECT
* FROM prezly_prod_frontend.pages WHERE date_part('year', timestamp) = 2019 LIMIT 500000;
// file 2
SELECT * FROM prezly_prod_frontend.pages WHERE date_part('year', timestamp) = 2019 LIMIT 500000 OFFSET 500000;

The database is using Amazon Aurora (postgres variant) so exploring the files went pretty smooth.

Now I have created two scripts to get that data into our new setup by:

  1. reading the file converting to JSON (some library)
  2. converting the flat (one column per property) to the event structure we need
  3. importing the records by 25 using dynamo batchWrite

Here is the code which I copied (and modified) from a StackOverflow article

src/tasks/import_page_from_file.js
src/tasks/import_identify_from_file.js

This dataset goes all the way back to 2016 and has millions of records. Because the batchWrite will upload this stuff pretty efficiently and trigger the process functions at a really good speed we have to make some changes to the DynamoDB and switch to On-Demand pricing.

Update the Resources part in serverless

A quick sls deploy will do the job. I did remove the entire infrastructure a few times testing this.

Now let’s feed some events:

node --max-old-space-size=4000 ./src/tasks/import_identify_from_file.js data/identifies.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_1.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_2.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_3.csv

It took about an hour to get all the events in there but now all history from 2018, 2019 and 2020 should be in the attribution tables.

With a larger dataset it might be more effective to go from SQL straight to DynamoDB or use one of Amazon Migration Services. In my case it was a great chance to see how it all performs under some stress and clean up the console.log calls throughout the code :-)

Tomorrow I will explore how I can feed some events back to segment using analytics.js.

— -

Previous posts:

Day 0: Why am I doing this?
Day 1: The Plan

Day 2: Capturing and Storing segment events
Day 3: Analysing the Referrer (upcoming)
Day 4: Running in production + API
Day 5: Importing Historic Data
Day 6: Feeding attribution back to segment
Day 7: Trigger webhooks + Visitor Source Reporting (mixpanel)

--

--

Gijs Nelissen
Solving Marketing Attribution

Belgian Techie. Builder. Bootstrapper. Dad x3. Entrepreneur. Smarty pants. Passionate about the web & technology. Founder of @prezly