Day 5: Feed old events
I have a full archive of all segment events in our Postgres warehouse that I now want to feed the system. Let’s try to make that happen.
I have exported the postgres pages
and identifies
table in CSV files of 500k rows per file.
// file 1
SELECT * FROM prezly_prod_frontend.pages WHERE date_part('year', timestamp) = 2019 LIMIT 500000;// file 2
SELECT * FROM prezly_prod_frontend.pages WHERE date_part('year', timestamp) = 2019 LIMIT 500000 OFFSET 500000;
The database is using Amazon Aurora (postgres variant) so exploring the files went pretty smooth.
Now I have created two scripts to get that data into our new setup by:
- reading the file converting to JSON (some library)
- converting the flat (one column per property) to the event structure we need
- importing the records by 25 using dynamo
batchWrite
Here is the code which I copied (and modified) from a StackOverflow article
This dataset goes all the way back to 2016 and has millions of records. Because the batchWrite will upload this stuff pretty efficiently and trigger the process functions at a really good speed we have to make some changes to the DynamoDB and switch to On-Demand pricing.
Update the Resources
part in serverless
A quick sls deploy
will do the job. I did remove the entire infrastructure a few times testing this.
Now let’s feed some events:
node --max-old-space-size=4000 ./src/tasks/import_identify_from_file.js data/identifies.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_1.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_2.csvnode --max-old-space-size=4000 ./src/tasks/import_page_from_file.js data/2019_part_3.csv
It took about an hour to get all the events in there but now all history from 2018, 2019 and 2020 should be in the attribution tables.
With a larger dataset it might be more effective to go from SQL straight to DynamoDB or use one of Amazon Migration Services. In my case it was a great chance to see how it all performs under some stress and clean up the console.log
calls throughout the code :-)
Tomorrow I will explore how I can feed some events back to segment using analytics.js.
— -
Previous posts:
Day 0: Why am I doing this?
Day 1: The Plan
Day 2: Capturing and Storing segment events
Day 3: Analysing the Referrer (upcoming)
Day 4: Running in production + API
Day 5: Importing Historic Data
Day 6: Feeding attribution back to segment
Day 7: Trigger webhooks + Visitor Source Reporting (mixpanel)