Data transfer is a fundamental concept to Simon. At the core of our system, we load data from core customer systems and then ultimately transfer data out to marketing, sales, and support channels. Jason gave a talk yesterday at an AWS meetup in Santa Monica and provided a mile-high view of the topic.
So much of the big data ecosystem assumes that your data is just “there”. In reality, your data is both coming from somewhere and then ending up somewhere else.
ETL problems are pervasive and many people are tackling them without even knowing that they’re tackling them. If you’re building a business dashboard, you’re doing ETL. If you’re setting customer records programmatically into Salesforce. If you’re syncing a customer segment into a Facebook custom audience, you’re joining customer behavior and then syncing it out to the web.
ETL is inherently brittle and hard to test. This includes source breakage (database connection timeouts), transform limitations (out of memory errors during computation), and loading issues (API connection issues, rate limiting). You have to set expectations that things are going to break, and you need to have strong expectations on how they’re going to break.
Finally, like any software engineering discipline, basic programming principles can be employed to maximize reliability of these processes. Testing, logging, graphing, and system idempotency are critical.
See slides here for more detail — enjoy!