Mixpanel ingests petabytes of event data over the network from the mobile, browser, and server-side clients. Due to unreliable networks, clients may retry events until they receive a 200 OK message from Mixpanel. Although this retry strategy avoids data loss, it can create duplicate events in the system. Analyzing data with duplicates is problematic because it gives an inaccurate picture of what happened and causes Mixpanel to diverge from other client data systems that we may sync with, such as data warehouses. This is why we care deeply about data integrity.

Today, we’re excited to share our solution to deduplicating event data at petabyte scale. …

