Do you duplicate the entire production dataset for mining, or does all inquiry start with aggregate…
Sean Gubelman

One of the core infrastructure teams maintains a follower of the main production database for analysis and reporting purposes as part of the broader data replication system. That adds a little bit of load to the other replicas, but it’s not too painful.

Right now, the larger problem for us is that a single monolithic database is not a great idea for scaling, so lots of other data stores are appearing in difference places, some of which it will be critical for us to have access to for research purposes. So we’re having to re-advocate for having this sort of access over and over again as our data ecosystem gets more diverse.

Most of our work, though, focuses on behavioral data from clients. Messages like “someone started watching a channel”, “someone chatted”, etc. That doesn’t live in the production database at all and is generally a richer dataset for studying products. But for accounting-level data or anything transactional or some kinds of user metadata we need the production replica.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.