Stories by Jeremy Stanley on Medium

An All-In Founder Forum

Jeremy Stanley — Wed, 17 May 2023 12:53:10 GMT

The All-In Founder Forum retreat in Tahoe in 2023, from left to right: Dan, Kaitlyn, Laura, Jeremy, Abby, Colin, Celine, Mitch, and Jeffrey

“I can’t keep this up,” confided the founder to a group of supportive peers who had become close friends. “I burned out months ago, and now I’m at the point where… I can’t even find the words.” With their head in their hands, a heavy silence filled the air as the fellow founders on the call nodded in understanding and empathy.

“I know I need to slow down, but whenever I try, I feel like I’m losing control,” the founder continued. “My fear of failure, or really, my fear of proving my parent’s doubts about me, is just overwhelming.”

Although the above conversation is a fictional blend of real experiences, it captures the deep self-reflection many startup founders may need to face. Coping with such intense emotions can be debilitating, but founders don’t have to confront these challenges alone.

One year into COVID-19, First Round organized a series of remote Founder Forums. They assembled groups of early-stage founders and led them in six virtual sessions to share best practices, build community, and, perhaps most importantly, provide support as the world changed rapidly.

Founders are busy people. We have employees to hire, products to build, prospects to pitch, and investors to update. Devoting two hours every month was a daunting prospect. But we each chose First Round as our seed investor partly because of the strength and diversity of their community. So we committed and dove in.

Those early forums were great. But they were temporary by design, participation was inconsistent, and the subjects were mostly tactical. Over time, they would fade into the background noise of our early-stage journeys.

But Dan Siroker had a plan. He had previously participated in forums organized by YPO while he was the co-founder and CEO of Optimizely. He wanted to take the best of what he had experienced in prior forums, optimize it for early-stage founders, and execute it virtually.

So Dan joined two of the First Round forums. After they concluded, he invited the most open and committed participants with varied backgrounds into a new forum that could live on indefinitely. We call this group the All-In Founder Forum, and our shared goal is to become the best versions of ourselves.

This post aims to inspire other founders to create similar forums. In addition to our story, we will share some of how we organize and sustain the forum. And in the end, we invite you to join us by applying for our one open spot.

The members include, from left to right in the header photo above, taken at our 2023 annual retreat in Tahoe:

Dan Siroker at Rewind, access perfect memory
Kaitlyn Knopp at Pequity, compensate with confidence
Laura Del Beccaro at Sora, seamless HR workflows
Jeremy Stanley at Anomalo, automate data quality
Abigail Holtz at The Lobby, scale influencer marketing
Colin Morelli at Source, improve patient outcomes
Celine Halioua at Loyal, dog lifespan extension
Mitch Gordon at Verto, begin college abroad
Jeffrey Dietrich at Rarebird, mental wellness coffee

The All-In Founder Forum meets virtually for two hours every month on the fourth Wednesday from 2–4 pm PT. We also meet in person for a three-day yearly retreat in the first week of February.

During these meetings, we discuss topics that span our work, personal, and family lives. We discuss wins, losses, challenges, and everything in between. Nothing is off-limits. Everything is confidential. We go deep and are vulnerable with one another because we believe that is critical to becoming the best versions of ourselves.

When communicating with each other, we observe various depths of meaning:

The first 20% are the cliff notes, which are cursory, sanitized, and generic
The next 60% are the meat of the story, the intellectually interesting details and logic
The next 15% are the feelings, both positive and negative, and how they affect us
The final 5% is the layer behind those feelings, the deeper narratives and emotions we are often unaware of in ourselves

We always encourage each other to find and explore that final 5%. We demonstrate by example, never judge one another, and share our support and love for one another through thick and thin.

This has led us through topics such as:

Depression and anxiety and the struggle for happiness
Love and empathy and the beauty of relationships
Narcissism and selfishness and the guilt they leave behind
Glory and satisfaction in achieving the perceived impossible
Childhood trauma and challenges and the reverberations that echo
Intelligence and creativity and the wonder of creation

Sharing this 5% may seem dramatic, scary, or indulgent. But every time someone goes there (and it happens often), we discover something new about ourselves. And our bond strengthens.

We want the All-In Founder Forum to last forever. But that takes a great deal of thoughtfulness and commitment. So we set high expectations for when and how we participate in these sessions and meetings.

Every member attends the in-person retreat and keeps work to the barest possible minimum during those days. We began our last retreat by sharing what we had each given up to participate. They included:

Missing time with family, children, and beloved pets
Delegating or delaying mission-critical customer or development activity
Stepping away from a fundraising process
Navigating layoffs or restructuring

This reinforced how important the retreat was to each of us and highlighted what, if anything, might still be distracting us.

We also hold ourselves accountable. We have a written charter to which we have all committed, outlining our norms and responsibilities. We have defined roles for moderation, logistics, and recruiting, which rotate annually.

In addition, every member is expected to attend every monthly virtual session, arrive on time, and remain through the end.

This may sound extreme, but the temptation to de-prioritize this forum is high in any given week. It is important to us, but it is never urgent. If we succumb, then the group dynamics suffer. We don’t share the same journey, we don’t benefit from each other’s wisdom, and resentment can build.

In the last year, members attended the sessions fully 84% of the time, arrived late or left early 9%, and missed only 7% of the time (vacations). We take attendance seriously and enforce it strictly.

We organize each session roughly as follows:

[3 min] Transition
[2 min] Short meditation to clear our minds
[1 min] Confidentiality reminder
[5 min] Check-in: on a scale from 0 to 100%, how present are you, and what, if any, urgent items might distract you today
[45 min] Updates — what’s the ONE business, family, or personal update since our last forum you want to share with the group and the 5% feeling associated with it?
[5 min] Break
[60 min] Four-Step Forum Exploration, where one member who is stuck on something goes deep into their challenge, and we help explore their circumstances and feelings, what they need from the forum, how our own experiences resonate with their challenge, and how we can support them going forward

Combined with our devoted attention, these ingredients lead to meaningful and valuable sessions and a tightly bonded group of friends. When individuals need urgent support on specific topics, we organize infrequent and optional ad-hoc meetings, usually on the weekend.

Our paths as founders can be lonely, harrowing, and deeply gratifying experiences. We are often isolated from our teams, co-founders, friends, and loved ones. Most of us will fail, and our journeys are rollercoasters of extreme highs and lows.

But we chose this path, and we are privileged to have the opportunity to dedicate ourselves to creating new businesses. The struggles we encounter create lives that are more worth living. The All-In Founder Forum has amplified our experiences.

We hope this post inspires others to create new founder forums that can have the same impact as ours. But we are also recruiting a tenth member to join us. We seek candidates who:

Are founders of venture-backed startups that are varied and non-competitive
Are individuals who will embrace vulnerability and commitment
Will enhance the diversity of ethnicity, gender, identity, experience, and personality in our forum

If you are interested in joining us, please tell us more about yourself and your startup here.

Detecting Extreme Data Events

Jeremy Stanley — Tue, 16 Nov 2021 16:53:51 GMT

Automatically detect and understand extreme events in real-world data

When analyzing data, we often think about high-level metrics — how much revenue did we generate yesterday? How many users queried this API end-point? How many times was this product purchased?

But sometimes these high-level metrics can become heavily skewed by extreme events.

For example, a bug in processing a discount code can allow a customer to checkout with a -$10 shopping cart. This might go unnoticed for a while, allowing the code to be shared online. Now thousands of customers are checking out with negative shopping carts, affecting your revenue and profitability metrics.

In other cases, such events don’t affect metrics, but can still negatively impact your business. For example, a new IP address can begin generating 3% of your site visits because a competitor is scraping your site. If undetected, the competitor can spend days or weeks with unrestricted access to your content.

How many black swans are lurking in your data lake?

Staying on top of these rare events can be a daunting task in a warehouse containing 100s of tables with 1,000s of columns and millions or billions of rows of data. As such, an automated solution is crucial.

Such rare events are called outliers. And in the rest of this article, we’ll briefly cover traditional outlier detection approaches and why they don’t work well for most real-world use cases.

Then we’ll explain our automated solution to outlier detection, and show examples using public data from New York 311 incident reports.

There are many algorithmic approaches to detecting outliers in samples of data. For example, from the sci-kit learn documentation, we have this comparison of techniques on a toy dataset:

Example outlier detection methods applied to a simple two-dimensional dataset.

In the above visualization, each column is an outlier detection method. Each point is colored blue if the approach has classified it as an outlier and orange otherwise.

In this example, all of these methods appear to be reasonable. But such “toy” datasets fail to capture the complexity of identifying outliers for a real business amassing vast volumes of data.

Real-world applications have tens or hundreds of columns of data in a wide range of formats, not just two continuous values. Data also changes over time — what was an outlier six months ago may have become a regular event today.

There are also many outliers in real-world datasets that are simply irrelevant. The user that bought 500 lemons may be an outlier, but the business may not need to do anything in response.

Finally, such approaches can be challenging to interpret, especially when working with large datasets. Understanding why a specific data point is an outlier is critical to informing any action taken.

At Anomalo, we have taken a practical approach to outlier identification with our Entity Outlier check.

First, we narrow the scope of the problem by requiring two pieces of input:

What type of entity do we want to identify as an outlier?
What specific metric do we want to evaluate for outliers?

For example, if we want to detect when new IP addresses begin scraping our site, then the IP address would be the entity, and site visits would be the metric.

Then we apply time series anomaly detection and root cause analysis techniques to provide very reliable, clear, and relevant outlier detection.

As an example, consider the following configuration for the New York 311 call service request data:

Configuring an entity outlier check in Anomalo

The entity we are monitoring here is the incident_address, which is the location in NY where the 311 incident report occurred. And the metric we are evaluating is the number of complaints per address, computed using the count(1) SQL snippet.

We are trying to answer the question: were there any addresses with an exceptionally high number of complaints on the most recent date?

The check creates a dataset that has the following shape:

address: where the call was reported from
date: reporting date
complaints: the number of complaints from that address on this date

We then take the maximum number of complaints by date, creating a time series with this shape:

date: reporting date
max_complaints: the maximum number of complaints received by a single address on this date

Then we can model this as a time series and detect if there is an outlier on the most recent date:

The time series of the maximum number complaints per address in the 311 complaints data, with July 18th shown as a clear outlier

On most days, the maximum number of complaints is between 10 and 50, with regular spikes upwards of 500. But on the most recent date, July 18th, there was an address with 2,227 complaints, which is truly exceptional!

We can then automatically see which locations were responsible for this massive surge in complaints:

The address 672 East 231st Street clearly has far more complaints than any of the other top addressess

We then automatically profile all of the complaints about this address and compare them to other incidents that occurred on the same day to identify what segments characterize them:

A root cause analysis of the spike in complaints, indicating that they are all for “Loud Music/Party”

In particular, it looks like 100% of the complaints at this location were of the ‘Loud Music/Party’ description, compared to only 35% of all other complaints on July 18th.

Either this was an epic party that irritated more than 2,000 people. Or a few people submitted complaints over and over again out of frustration. (Note that there is no information about who submitted the complaint in this dataset, so we can’t tell for sure.)

Extreme events can be incredibly important — in the words of Nassim Nicholas Taleb:

“I know that history is going to be dominated by an improbable event, I just don’t know what that event will be.” — Nassim Nicholas Taleb

Finding extreme events in a modern cloud data warehouse filled with Terabytes or Petabytes of data can be a very daunting task. Most outliers are meaningless.

But by narrowly focusing on key entities and metrics, and using robust time series models and root cause analysis techniques, we can clearly and automatically identify and explain the important ones.

Working with our customers, Anomalo has developed a targeted outlier detection approach that is easy to configure, well-calibrated, and provides rich and actionable context for users.

To get started with Anomalo, and begin identifying outliers pro-actively in your data, be sure to request a demo.

Detecting Extreme Data Events was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

Effective Data Monitoring

Jeremy Stanley — Wed, 17 Mar 2021 14:59:52 GMT

Ten steps to ensure your data monitoring solution is effective.

Every time a data alert fires (or fails to fire), one of four possible outcomes occurs.

In a perfect world, every alert received would be about a real data quality issue you cared about (a true positive). No alerts would be sent when there were no issues you cared about (a true negative).

In reality, most data quality monitoring solutions are far from perfect. They send alerts that are not useful (false positives). These distract your data team and erode confidence in your monitoring solution.

Or real data quality issues are missed by the monitoring tool (false negatives). These compromise your business decisions and data products and erode trust in your data.

In this article, we will cover ten steps you can take to reduce false positive and false negative alerts and to mitigate their impact when they do occur.

1. Use dynamic data testing strategies

Most data testing strategies begin with simple rules, such as:

column x is never NULL
table y row count is between 1,000,000 and 2,000,000

These rules are perfect for cases where you know exactly how you want your data to behave. But they come with several drawbacks:

Any violation of the rule, no matter how small, generates an alert.
They require time from data subject matter experts to create.
They may need frequent maintenance over time as your data changes.

You can reduce your false positives and false negatives by using dynamic data testing strategies.

A predicted range test, which utilizes a time series model, effectively identifies the spike in NULL % without any manual configuration or maintenance.

Dynamic tests use time series models (or other machine learning techniques) to adapt to your data over time and alert only when there is a sudden meaningful change. Such tests require less work to set up and increase test coverage while reducing false positives caused by misconfiguration or data drift over time.

2. Check only the latest data by default

By default, your platform should only check the most recent data in a table.

Checking the latest data should be a default that users can easily turn off.

Limiting checks to the latest data saves data warehouse costs and reduces false positive alerts from historical data that you no longer care to fix. It should be easy for users to disable this for any tables that are not append-only.

Checks can also keep track of their run history and send notifications only when encountering new issues in the table.

3. Support no-code configuration changes

Inevitably a data quality rule will generate a false positive alert. In these cases, users should be able to adjust their checks easily. Users will be reluctant to make changes if they have to edit code or change a complex YAML configuration file.

The types of changes users often make include:

Widening the expected range for a data outcome
Narrowing the scope of a rule using a where SQL clause
Waiting for updated-in-place data to arrive before applying a rule
Changing thresholds for machine learning alerts

Advanced options for adjusting a key metric or data validation rule that can reduce the risk of false-positive and false-negative alerts.

The UI to make changes should be one click away from the alert. It should be easy to understand and well documented. Finally, there should be an audit trail of changes to allow for easy reversion if needed.

4. Prioritize your data quality rules

Not all data quality rules are equally important. In some cases, users may be experimenting with the platform and don’t want to be alerted. In other cases, rules may be critically important, and any deviation from expected behavior should generate loud alerts.

In addition to changing alert behavior, priority levels can also change how alerts or tables appear in dashboards based upon the severity of failing alerts.

The first table has two failing alerts — including one that is a high priority. The second table has one failing alert. Whereas the third and fourth tables have low-priority alerts, and the fifth table passed without issues.

5. Use APIs to run high priority rules in your pipelines

For data validations where you have very high confidence that any issues would be real and have significant adverse consequences, it can make sense to run these alerts in your pipelines.

An example of how data quality checks can be run in a pipeline to quarantine and avoid publishing bad data.

For example, in Apache Airflow, you could use an API to execute data quality checks on transformed data and then poll for check results and publish data if there are no failures.

If a check does fail, you could run automated tasks to fix the bad data, abort the remainder of the DAG (sometimes, no data is better than bad data), or quarantine bad records using SQL produced in the API to query for good and bad data.

6. Cluster similar issues together into single alerts

Data quality issues often strike multiple columns or segments of data at the same time. Such cases should be correlated together into a single alert if they affect the same rows of data.

Three columns had an increase in NULL values on the same set of records, and so are clustered together in this alert.

In the above (masked) alert, three of 88 columns have an unusual increase in NULL values in the same rows of data. Clustering reduces the number of alerts the team has to review and can help identify the underlying issue.

7. Scan samples of raw data rows for any unexpected changes

With many important source tables, each containing hundreds of data columns, manually specifying and managing data quality rules for each source and column is untenable.

Instead, use unsupervised data monitoring to scan random samples of rows in source tables for significant anomalies.

A time-series view of table anomalies in a BigQuery public COVID dataset. The table columns are on the vertical axis and time is on the horizontal axis. The circle sizes correspond to the strength of the anomalies.

A summary like the one above can be reviewed regularly to quickly identify unexpected and concerning changes that should be addressed and monitored explicitly in the future.

8. Route notifications to teams with ownership and accountability

Many companies initially route all of their data quality alerts to a single channel in Slack or Microsoft Teams. However, users in that channel will have to ignore many alerts they may not be interested in. A single channel can also reduce the accountability for addressing individual alerts, as they are easily lost in the channel noise.

Instead, a best practice is to set up separate channels for individual teams.

In each team channel, you can include users who depend upon or maintain the tables that are routed to that channel. As alerts arrive, they can use emoji reactions to classify their response to alerts.

Examples of common emoji reactions to alerts in Slack or Microsoft Teams.

Common reactions include:

✅ the issue has been fixed
🔥 an important alert
🛠️ a fix is underway
🆗 expected behavior, nothing needed
👀 under review

Or users can @ mention their colleagues in a thread to diagnose and resolve the underlying issue.

9. Provide actionable context for issues to accelerate triage

When an alert fires, it is frustrating to get a message like:

column user_id in table fact_table has NULL values

This alert puts the onus on the user to answer questions like:

Why does this alert matter?
What # and % of user_id values are affected?
How often has this alert failed in the recent past?
Who configured this alert, and why?
What dashboards or ML models depend on fact_table?
What raw data source contributed user_id to fact_table ?

Notifications should include this information directly or link to data catalog platforms that do.

Additionally, notifications should include samples of raw data that highlight good and bad values:

Sample bad rows (with NULL timestamp values) compared to good rows.

Advanced statistical methods can analyze the underlying data and produce root cause analyses that identify exactly where the issue is occurring.

A root cause analysis that identifies the segments of data (venuestate = ‘NY’ in this case) that most clearly identify where the underlying data quality issue has occurred.

10. Collect and learn from user feedback

Inevitably, your data quality solution will send alerts that are not useful. In these cases, it is important to collect that feedback.

An example of buttons used to provide feedback on an alert.

Over time, a data quality monitoring solution can be tuned using machine learning to suppress alerts that users do not find useful.

To effectively monitor your data, your system should produce comprehensive, targeted, and accurate alerts.

First, be sure to minimize false-positive alerts. Migrate static tests to more intelligent dynamic tests that adjust with your data. Ensure users can adjust alert priorities and subscribe to notifications they care about. Check only the latest data by default and make it easy for rules to be edited.

Next, reduce the burden on users of false-positive alerts. Cluster similar issues together and provide the right with alerts. Use API integrations to prevent bad data from continuing through pipelines. Then ensure your system can adapt to user feedback over time.

Finally, make your testing strategy comprehensive enough that you do not miss real data quality issues (false negatives). Use dynamic testing and user-friendly interfaces to make configuring alerts easy. And leverage row-level unsupervised monitoring to scan for issues your other alerts miss.

Combined, these solutions ensure your alerts are high quality, your users are productive and engaged, and the quality of the data you depend upon keeps increasing over time.

To learn more about how data teams use Anomalo to reduce false positive and false negative alerts, request a demo.

Effective Data Monitoring was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unsupervised Data Monitoring

Jeremy Stanley — Tue, 26 Jan 2021 16:33:08 GMT

Part 1 — Monitoring the quality of structured data at scale

To compete in a data-driven world, organizations must consolidate data into centralized warehouses and use it to enhance products and inform decisions.

Data is now a strategic asset. But how can organizations ensure they can trust the data underpinning these products and decisions?

Most data teams conclude that they need to begin testing their data — using a carefully maintained library of rules.

The Sisyphean task of creating and maintaining data quality rules in large warehouses. Credit: Josie Stanley.

But monitoring all of the data in an enterprise warehouse can be daunting. It is common for such warehouses to contain tens of thousands or even hundreds of thousands of tables.

Top 10 tables
Drive board-level metrics and company goals
Top 100 tables
Cross-functional data that inform product and operations
Top 1,000 tables
Detailed data owned by single teams driving process and products
Remaining 1,000+ tables
Special purpose project or feature data

The critical tables in an organization should be thoroughly tested and monitored. See our Airbnb-quality data for all post for details on how that can be achieved. But what about the rest?

If the average table contains 50 columns, and each column requires 5 rules or metrics to be well monitored, then a warehouse with 1,000 important tables requires managing 250,000 rules!

Even if each rule requires just 10 minutes to maintain each year, that would require a dedicated team of 20 highly trained data professionals (most of whom would quickly quit in protest of the drudgery):

Yet, an organization still depends upon the quality of that data. This is where unsupervised data monitoring can be critical.

Uber summarized this well in their post on Monitoring Data Quality at Scale:

Conventional wisdom says to use some variant of statistical modeling to explain away anomalies in large amounts of data. However, with Uber facilitating 14 million trips per day, the scale of the associated data defies this conventional wisdom. Hosting tens of thousands of tables, it is not possible for us to manually assess the quality of each piece of back-end data in our pipelines.

This first post in the series will demonstrate how Anomalo uses unsupervised learning to monitor data quality at scale. In subsequent posts, we will cover:

The key requirements of our system and why traditional time series and outlier detection approaches do not work
The architecture of the Anomalo unsupervised learning system — from modeling the data to explaining root causes of issues
How we define and minimize false positives, and how we benchmark our algorithm using our data chaos library

The role of unsupervised data monitoring

To illustrate how our unsupervised monitoring works, we will use a simple demo environment with just one table, public.fact_listing:

A single table of concert and sporting event ticket sales data configured in a demo warehouse.

This is a demo dataset of concert and sporting event ticket sales data. You can see that we have 3 checks passing on this table and that the checks run daily when the data is fresh.

Clicking into the table presents the table homepage:

The table home page, where new checks are added and check results are inspected.

As you can see, Data freshness has already run for this table, which determines when the data is complete each day, and automatically kicks off all other checks.

At the bottom are Key metrics and Validation rules sections where the user would leverage our time series models or library of custom validations to check their data.

In-between is the Table anomalies section, which contains two checks that are automatically configured for any monitored table:

The two unsupervised learning variants used to identify table anomalies.

Our machine learning model learns a representation of the typical data in the table. As new data arrives, it detects if that data is meaningfully different from what appeared in the table before.

We run two variants of this algorithm:

no increase in NULL values
A constrained model looking for significant increases in NULL values
no anomalous records
Our full machine learning algorithm, which identifies changes in continuous distributions, categorical values, time durations, or even relationships between columns

The first variant runs at a high priority level and notifies users when a sudden spike in NULL values is observed, as this may indicate missing data that should be fixed quickly. The second variant, no anomalous records, produces a log of meaningful changes in each table.

For example, in this Oxford COVID-19 Government Response dataset in BigQuery, in the column public_information_campaigns_flag, the value 1 almost entirely disappeared and was replaced with NULL values on October 30th.

Of course, not all columns are important, and so the user can control which tables and columns they wish to see notifications for.

Finding and characterizing data chaos with unsupervised learning

To illustrate how this works in practice, let’s introduce an anomaly into this dataset. We will use our command-line client to trigger one of our chaos operations: TimeColumnZero. This introduces zero values into a column at a specific point in time.

Causing chaos in a table by inserting zero values into a column on a given date for a subset of rows.

The column numtickets in fact_listing will now contain 30% zero values on 2021–01–17 whenever the venuestate is equal to “NY.” This illustrates a common data quality issue — an invalid value suddenly appears for a fraction of rows in a key data segment.

Next, we can re-run the table anomaly checks, and we find that the no anomalous records variant fails (as it is looking for any meaningful change):

A failing table anomaly check after having introduced chaos into the table.

We can click into view details to see the explanation:

Summarizing the anomaly in natural language and evaluating its severity.

The table anomaly check has correctly identified that the column numtickets has a sudden increase in 0 values. Note that we never told it to look at this column or to look for zero values.

It also scores the anomaly's severity (this one is strong) and compares that to a learned threshold for this table, which accounts for how much background noise there is in each column. Because the severity exceeds the threshold, this check fails and is highlighted in the user interface (and could notify the user in Slack or Teams).

Scrolling down, we see a custom visualization selected based on the anomaly type and data distribution:

The algorithm chooses a dynamic visualization to help contextualize the anomaly, in this case showing the distribution of the top values in the column on the prior and current dates as a tornado.

In this case, a top values chart shows the most common numtickets values and compares the distribution on 2021–01–16 (the left bars) to the distribution on 2021–01-17 (the right bars). You can see that the value 0 was not there before and is now suddenly 10.4% of records.

Scrolling further, we can see the Root Cause Explanation, which analyzes the raw data underlying our unsupervised model to identify if there are segments of the data where the issue is most prominent:

A root cause analysis performed to identify where the anomalous records occurred.

As you can see, the algorithm correctly identifies that NY is where the anomaly occurred. 100% of the anomalous rows are in that state, but only 29.8% of the population rows are in NY.

This automatic identification of where an issue occurs is powerful. Without it, a user would need to examine records, trace lineages, or repeatedly query and visualize the data to isolate the issue.

The algorithm can find even more complex issues, such as when the relationship changes between columns. For example, consider this chaos operation:

Introducing a more complex form of chaos, where a single column is randomly shuffled on a date.

Here we are shuffling the priceperticket column so that the values no longer correspond to the correct rows. The actual values remain the same (and have the same distribution and mean), but the relationship between that column and other columns in the table has been broken.

In this case, the anomaly cannot be easily summarized in natural language, but we can still score its severity.

In this case, the algorithm identifies that priceperticket is the most problematic column, and that the issue is related to listid and totalprice as well.

Examining the distribution of priceperticket values on 1/16 and 1/17 shows that the anomaly is the strongest in the lowest and highest priceperticket values:

This visualization bins the continuous column into deciles. While the distribution is unchanged between the two days, we can see that the anomaly severity (color) is more intense in extreme deciles.

We can examine a sample of individual rows, where we can see that the algorithm is scoring every value in the table for how anomalous it is:

The most anomalous row in a sample of the data — with color indicating how much credit each value contributed towards the anomaly.

In this example, the $1,960 price per ticket doesn’t make any sense in the context of 28 tickets for $1,568. Plus, the price per ticket is unusually high for the Vampire Weekend show.

Zooming out, we can see this effect across many rows:

The most anomalous 25 rows in a sample of the table, again colored by how much each value contributed to the anomaly.

Now you can more clearly see which specific values of the table the algorithm believes are contributing the most to the anomaly, as indicated by the severity color scale on the right-hand side.

This granular allocation of the anomaly into the table's specific values is key. It allows our system to visualize and explain the underlying issue clearly.

When embarking on a journey to monitor and test data quality in your warehouse, it makes sense to start simple. You can use open source libraries, write your own tests, or leverage a platform like ours at Anomalo to thoroughly test your data.

But as warehouses, organizations, and testing ambitions scale, simple rule and time series based approaches fall over. They cannot effectively cover the long tail of data quality issues that commonly occur.

That is where unsupervised data monitoring comes in. You leverage a machine to learn the structure of your data and monitor for significant unexpected changes. It notifies you when a meaningful negative change occurs and presents you with visual summaries and explanations that dramatically accelerate your triage and resolution times.

Stay tuned for our next post in this series, which will explain how our unsupervised algorithm works behind the scenes.

To get started with Anomalo, and begin monitoring your data at scale using our algorithms, be sure to request a demo.

Unsupervised Data Monitoring was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

Airbnb-quality data for all

Jeremy Stanley — Wed, 02 Dec 2020 16:33:00 GMT

How to build and maintain high quality data without raising billions

Airbnb has always been a data driven company.

Back in 2015, they were laying the foundation to ensure that data science was democratized at Airbnb. Meanwhile, they have grown to more than 6,000 people and have raised more than $6b of venture funding.

To stay data driven through this massive change has required making big investments in data quality, as outlined by their recent Data Quality at Airbnb series: Part 1 — Rebuilding at Scale and Part 2 — A New Gold Standard.

The first two Data Quality at Airbnb posts: Part 1 — Rebuilding at Scale and Part 2 — A New Gold Standard

Companies aspiring to be as data driven and successful as Airbnb will also need to prioritize data quality.

It does not matter how much data is collected, how fast it can be queried, how insightful analyses are or how intelligent a machine learning model is. If the underlying data is unreliable and of poor quality, then everything that depends upon it will also be suspect.

Fortunately, companies no longer need to reinvent the wheel or make massive investments to improve and maintain high quality data. New startups, such as ours at Anomalo, are building the technology needed to monitor, triage and root cause data quality issues efficiently at scale.

In their first post, Part 1 — Rebuilding at Scale, Airbnb set the following goals for themselves.

The five data quality goals set by Airbnb in Part 1 — Rebuilding at scale, and where Anomalo can help.

All of these investments are critical, and projects like dbt are making it easier to build high quality data pipelines, and there are many open source data discovery tools under development.

But in this article I want to focus on two of their goals in particular: ensuring important data meets SLAs and is trustworthy and routinely validated.

In their latest post, Part 2 — A New Gold Standard, Airbnb outlines the following automated data quality checks they run to validate important datasets:

The data quality checks Airbnb runs on important datasets, from Part 2 — A New Gold Standard.

To summarize, Airbnb has the following key requirements:

Row count time series analysis
Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?
Time series anomaly detection for key business metrics
Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?
Standard run-time data quality checks
Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?
Perfect metric consistency checks
Do columns and metrics satisfy critical relationships that can be expressed as simple equations?

For the rest of this post, we will use open source data in BigQuery to illustrate how each of these requirements is supported in Anomalo.

1. Row count time series analysis
Has the table row count dropped below a predicted range? Have the row counts plummeted for any key segments? Did fresh data arrive in the table on time?

The first question to answer for any production dataset is “has it been updated?” In particular, are there any records from the most recent date and is the row count within an expected range?

If not, then something probably broke upstream in the collection, loading or processing of the table. This must be fixed before anyone can trust the latest data.

In Anomalo, every table is automatically configured to monitor if the most recent period of data is complete.

For example, for the NY 311 service request data, here is the time series of row counts by day, and the predicted interval used to determine if the row counts are complete:

Monitoring the row count for NY 311 service request data by day, the grey points are the actual row counts, and the green band is the predicted interval.

For November 23rd, we expected at least 5,223 rows, and in fact there were 7,056.

You can also pick any column and check that the row counts haven’t plummeted for any important segments in that column. For example, we can evaluate the row counts by borough (neighborhood) in the NY 311 data:

Predicted row counts for November 23rd in the NY 311 service request data for every borough (neighborhood) value. If any borough falls below the 99% predicted interval, a notification would be sent.

Anomalo customers use this feature to ensure their data is complete for a wide variety of segments. Ranging from geography (country, state) to event or API call type to device platform (iOS v Android).

Finally, we can also tell if the data was significantly delayed. This can indicate that an upstream processing stage is taking longer than normal, and may eventually cause data to be incomplete when users query for it.

Here is how long it took for each of the last 8 days of data to arrive for the New York 311 dataset:

How long it typically takes NY 311 service request data to appear in BigQuery — such delays are common for 3rd party datasets, and can increase suddenly causing downstream query issues

On average, it takes around 25 hours for the New York 311 data to arrive in BigQuery, and you can easily set a threshold of when you would like to be notified for delayed data:

You control when you are notified if a given table isn’t complete yet.

For more on data completeness, why it happens so often and how tools like Anomalo can be used to monitor it check out our post on When Data Disappears.

2. Time series anomaly detection for key business metrics
Did a metric suddenly move outside of a predicted range? Is the movement unusual given seasonality, trend and holidays?

Once we know that data is complete in a timely manner, the next step is to ensure that any metrics we compute from the data are within expected ranges.

In Anomalo, you can easily configure new metrics, for example, we can easily monitor the mean score for posts on the Hacker News dataset in BigQuery:

It takes just a few clicks to begin monitoring a metric — in this case the mean of the score column in the Hacker News dataset in BigQuery

If the score ever moves sharply outside of the predicted interval, a notification is sent in Slack. This saves a lot of time spent watching metrics for suspicious changes.

Underneath the hood, Anomalo is building a sophisticated time series model, which decomposes the metric into an overall trend (blue), holiday spikes (orange), season of year (red) and day of week (teal) components:

Behind the scenes of how Anomalo monitors a time series, it is decomposed into overall trend (blue), holiday spikes (orange), season of year (red) and day of week (teal) components

It looks like hacker news scores have been trending up over time, are sensitive to holidays (Thanksgiving and Christmas look good, others bad), and are much higher on weekends. Hopefully someone posts this article on hacker news on a Sunday near Christmas 🎄.

In Anomalo, metrics can be defined using a variety of pre-defined aggregates:

The types of metrics that can be easily defined and monitored in Anomalo.

Or you can define a custom metric using any SQL aggregate statement available in your warehouse. For example, in the SF Fire Department service calls dataset in BigQuery, we can measure how many minutes on average it takes for the fire department to reach the scene of a call, and be alerted whenever it takes longer than 15 minutes:

Adding a metric that uses a custom SQL aggregate statement to define the average time between when a SF Fire department call was received, and when they arrived on the scene.

It looks like the average time for the SF Fire Department to respond to calls increased dramatically on October 19th:

It usually only takes 10 minutes to arrive to a scene, but that spiked to longer than an hour on October 19th.

To learn more about testing your data with time series and unsupervised learning models, check out our post on dynamic data testing.

3. Standard run-time data quality checks
Are basic data constraints satisfied: no unexpected NULL or blank values, no violations of unique constraints, strings match expected patterns, timestamps follow a defined order, etc.?

Ensuring that metrics are within expected ranges is important, but ultimately only tests the surface of your data. Even if all of your metrics are within expected tolerances, there could be cracks appearing in your data foundation.

That is where rule based data quality checks come in. These checks typically test that a condition never or always is satisfied.

For example, in Anomalo we can easily set up a number of foundational data quality checks on the Hacker News dataset in BigQuery. For example, to test that id is always unique:

Adding a new validation rule in Anomalo can be done in just a few clicks.

We can then see at a glance which rules have passed, and which have failed:

Every table has a dashboard showing the latest status for all of the configured validation rules.

For failing checks, such as “timestamp is never NULL”, we can see a summary of the issue:

When validation rules fail, we show exactly how many records are affected.

A sample of bad and good records we can scan through:

You can scan through good and bad records to better understand the context of the data quality issue.

And a statistical analysis identifying the segments of data where the issue is most prominent:

Anomalo performs automated statistical analyses to identify if the issue is more common in any subset of the table. This saves a lot of time in triaging and root causing data quality issues.

Depending upon the nature of the check, the summary visualization changes to provide the most useful context. In this case, it appears there are many stories with duplicate titles:

The most common duplicate Hacker News titles.

Knowing not just that the data is broken, but exactly how, where and when the issue occurs is critical to quickly triaging, root causing and fixing it.

4. Perfect metric consistency checks
Do columns and metrics satisfy critical relationships that can be expressed as simple equations?

Real world data is complicated. While key metric and fixed rule based checks may capture 95% of the common data quality use cases, the last 5% often require complex logic specific to an organization and dataset.

For example, in the New York Police Department dataset of motor vehicle collisions on BigQuery, there are multiple fields tracking how many people were injured or killed in an accident.

We can validate that the fatality counts are consistent with the following:

Setting up a custom validation to compare metrics — in this case fatalities for people killed in NYPD motor vehicle collisions.

When run, we find that there are 0.003% of records (45 of 1,734,004 as of 2020-11-27) that have inconsistent fatality counts.

When such a validation rule fails, we can also see sample bad and good records:

Specific examples of good and bad records often make it clear why an issue is occurring.

In this case, it appears that there are records where motorists were killed, and yet they are not appearing in the total. Again, we also show any segments of the data that indicate where the issue is occurring most often (limited in this case to just records with some non-zero fatality rows).

When the 3rd vehicle is unspecified this data quality issue happens more frequently.

In this case, when at least one of the fatality columns is non-zero this issue is most likely to happen when the contributing_factor_vehicle number 3 column is unspecified. This intelligence could be a meaningful hint towards identifying where and how this data quality issue arose.

Today, it is easier than ever to become a data driven organization.

The volume and diversity of data keeps growing. Warehouses like Snowflake and BigQuery make storing and querying it all easy. From BI and visualization through to machine learning, there are a plethora of tools that we can leverage to generate insights and create value.

But most companies have only just begun the journey to ensure the data they depend upon can be trusted. As demonstrated by Airbnb, investing in data quality is critical to staying data driven as data organizations mature.

At Anomalo, we have built a platform that will allow any company to achieve and sustain the same vision of high quality data. If you are interested in starting a trial, head to our site to learn more or request a demo.

Airbnb-quality data for all was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dynamic Data Testing

Jeremy Stanley — Wed, 18 Nov 2020 16:03:18 GMT

Data is rarely static, so why should data tests be?

When testing data, our first instinct is to reach for perfection. Can’t we write down a clear set of rules that govern exactly how our data should behave, just like we do when testing software?

Of course we can’t! Data isn’t software, and shouldn’t be tested in the same way.

When testing your data, there are far more factors that are out of your control.

The reality is that the factors affecting your data that are out of your control will usually far outweigh those that are in your control.

As your organization grows your business decisions, processes, products and code can all change your data in unexpected ways. And your data is truly at the mercy of many external factors. From how users behave, to what events occur, to the combined actions of competitors, suppliers or market forces.

To test data effectively we need tests that adapt with these forces.

In this post, we outline a framework for data testing, from static tests that can be written in SQL, to dynamic tests that require statistics or machine learning. Then we compare both approaches with an example from COVID-19 data in the EU.

In practice, data can be tested with the following four broad approaches:

The relative importance of static and dynamic data testing strategies.

Fixed rules make a statement in absolute terms about a dataset, such as “this column is never NULL” or “this string always matches a pattern.” These tests are great when your data must be perfect in some clear and known way.

Specified ranges require a computed number to be within a pre-determined interval, such as “values should be zero for 1–3% of records” or “the mean of a column should be between 13 and 16”. These can be used when you know in advance that a key metric or data statistic should never drift outside of a range.

Predicted ranges are just like specified ranges, except the range is predicted by a time series model. The user can control how much uncertainty should be in the predicted interval, such as “the mean is within a 95% predicted confidence interval”. These are more powerful tests that can find any significant change in key metrics or summary statistics.

Unsupervised detection is the most sophisticated approach, where anomalous changes are found in an important dataset. All that is required is that the user specify what data is important. Such tests can identify unexpected changes that you hadn’t thought to test for. Stay tuned for future posts on unsupervised detection.

An example of each of the four types of data testing strategies on EU CDC COVID-19 data.

Dynamic testing strategies such as predicted ranges or unsupervised detection have some significant advantages. They are easier to set up and easier to maintain over time. They can also be used to test any data for any condition, regardless of the current quality of the data.

Of course, there are still very good reasons to use static tests. They are powerful when you know exactly how your data should behave, and want to be alerted even if the data varies only slightly from this expectation.

But relying only on static tests leads either to poor test coverage — where the majority of important data is not well tested, or to a high maintenance burden that will prevent a testing strategy from being sustainable.

Let’s consider an example. The European CDC provides COVID-19 data hosted in BigQuery here. In addition to statistics like cases and deaths broken out by country and date, this dataset also tracks intensive care patients.

But many of the intensive care records appear to be NULL. For example, in the BigQuery console we find that 98% are NULL:

Using the BigQuery console to compute the fraction of NULL records in the cumulative_intensive_care_patients column of the covid19_open_data_eu dataset.

Suppose we are back at July 1st, and we want to set manual bounds for the percent of NULL values in the cumulative_intensive_care_patients column.

We review the percentages by day, and decide on a bound of 97.5% — 98.5%:

We begin testing the NULL % (grey line) with a tight expected range (green band).

Fast-forward to August 6th, and the NULL percentage has dropped below our initial guess.

The NULL % eventually drifts below our expected range.

We investigate and find that this is a natural trend due to expanding data collection. Worried about getting more false positive alerts, we widen our interval to 97% to 99%, and everything looks good for a few months:

We widen the range, and everything is fine for a few months.

But then a sudden spike occurs on November 8th that we miss entirely:

We entirely miss a concerning large spike in missing values. Note that this spike appears to have been a temporary issue, and has since been resolved in the BigQuery data.

Instead, if we had used a predicted range test, this data quality issue would have been caught immediately:

A predicted range test, which utilizes a time series model, effectively identifies the spike in NULL % without any manual configuration or maintenance.

Behind the scenes, this test uses a time series model which dynamically adjusts to the data. The model controls for changes in trend (blue) and seasonality (purple). It then produces a well calibrated predicted range (green). This allows us to clearly identify the anomaly (red):

The anatomy of a predicted range test, where trend (blue) and seasonality (purple) are controlled for, and a predicted interval (green) makes it clear that the most recent observation (red) is anomalous.

Predicted range tests should:

Control for changes in trend and seasonality, without over-reacting
Adjust for holidays, which can cause sudden spikes or dips in metrics
Identify and treat historical outliers, so they do not unduly influence future predictions
Accurately predict an interval of possible outcomes based on historical variance in the series

Once these factors are accounted for, predicted range tests are a very powerful data testing strategy.

To effectively test their data, companies should use a portfolio of testing strategies. Static tests such as fixed rules or specified ranges should be used only when there are clearly known expectations about data that is already of high quality.

The majority of data tests should be dynamic to ensure high data test coverage that adapts as your data changes without requiring constant maintenance.

We are building a data testing product with a strong emphasis on dynamic testing over at Anomalo. So, if you’re interested in easily enabling dynamic tests for your data, head to our site to learn more or request a demo.

Dynamic Data Testing was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

When Data Disappears

Jeremy Stanley — Tue, 10 Nov 2020 16:32:55 GMT

The most common data quality issue is no data at all

When we think of data quality, the first issues that come to mind are visible problems like duplicate rows, NULL values or corrupted records. But in fact the most common data quality issue is that data has simply disappeared.

In this post we will describe how data disappears, what the common causes are and what data teams can do to identify these issues.

Consider how companies process data into their warehouses:

Raw data is captured through logging systems or from external sources, then data loading systems pre-process the raw data and load it into a data warehouse. Then complex SQL pipelines filter data for important records, join multiple sources together and perform complex aggregations.

The resulting tables, often referred to as “fact tables”, are the golden datasets of an organization. Cross-functional teams leading strategic initiatives rely upon them, product managers make decisions using them and operations and sales teams are managed based on them.

It is hard to underestimate the importance of ensuring these tables are reliable.

But important fact tables can be the result of many transformations across disparate systems linking varied source datasets. This complexity increases the likelihood of incomplete data:

Data processing stages, popular platforms and examples of what breaks along the way

However, unlike some other data quality issues, there isn’t a single SQL query to validate that data is complete. This is because incomplete data can take many forms.

The simplest issue is that there is no recent data at all.

The data went entirely missing on the most recent date

Even this can be fraught with challenges, as we need to know when there should be data there. Ideally systems track how long a dataset typically takes to update, and alerts when new data is significantly delayed.

A more nuanced issue that can go undetected is that there are fewer records than expected, or that data disappeared for a small period of time.

The data appears to be incomplete on the most recent date.

In this case, you need a time series model to predict what range of row counts are expected. Such models need to control for trend, weekday seasonality, annual seasonality and holidays. Doing this reliably at scale can be challenging.

Even more difficult to handle is when data disappears for an important segment of the data.

An important segment of the data is almost entirely gone, but overall row counts remain plausible.

The missing segment might be small enough to not affect the overall row counts, but still mission critical for your business.

So what happens when data disappears? The consequences can be widespread:

Dashboards fail to update, or show misleading charts based on limited data
Machine learning models miss training on the latest observations or are biased by incomplete data
Data products deliver stale or unrepresentative data to internal or external consumers

Usually, a data organization’s first instinct is to rely upon the monitoring of the systems that produce the data. Infrastructure engineering is monitoring the logging system with production metrics. And data engineering is monitoring the data coordination and loading systems for outages or missing data.

But as an organization matures the way data is produced increases in complexity, and it becomes dangerous to rely upon monitoring of individual components:

Data processing flows almost always start simply, but become increasingly complex over time.

At Anomalo, we’ve found the only way to be certain your data is available is to test it independently from the systems producing it.

Data processing pipelines are complex, and require independent monitoring of the important data in the warehouse.

To ensure that data is available and complete, we run the following sequence of tests:

Are there any rows from yesterday?
Is the row count above a predicted lower bound?
Is there any missing data at the very end of the day?
Are there any key segments with far fewer than the expected number of rows?

To set this up for a new table only requires a few pieces of information:

How to configure a table to be monitored for missing data in Anomalo.

Then, if we ever discover an issue with incomplete data we send a notification to the relevant teams’ Slack, Teams, PagerDuty or e-mail:

A slack notification showing an incomplete data issue. The green band is the predicted row count range per day, and the dots and lines are the actual row counts.

For example, the above Slack message is for a Google BigQuery Public dataset of San Francisco transit stops. We saw a big decrease in records on 2020–10–29. Our model predicted that there should be at least 10,053 rows, but only 3,672 rows exist. What happened to the other 6,500 rows?

We also note that this was the first time this table had failed to load data on time in the last 34 runs, providing users with a sense of just how unusual or extreme this behavior is.

Digging deeper, we show exactly what happened on the 29th:

Predicted row counts per hour range (green band) versus actual (lines and dots).

It appears that the data disappeared by 8am, and never returned that day.

When it is relevant, we show a breakdown of exactly which segments are appearing less frequently than expected:

A detailed view by segment, showing the predicted range of rows as green bars, and red dots indicating what actually happened for 4 types of content.

This chart is from a COVID-19 online news dataset, and is showing that on November 4th there were fewer articles online about Cases, Quarantine, Prices and Ventilators than expected.

These details help data teams rapidly triage issues, identify root causes, and communicate to the affected teams internally. In many cases, they even help accelerate the development and deployment of resolutions. Moving quickly and confidently to identify and resolve such issues greatly reduces the chances they negatively impact the rest of the company.

Before any dataset is used for mission critical decisions or products, data-driven organizations should validate the quality of the data using an automated system. The first step to get right is to ensure the data hasn’t disappeared.

To learn more about Anomalo, request a demo.

When Data Disappears was originally published in Anomalo on Medium, where people are continuing the conversation by highlighting and responding to this story.

3 NIPS Papers We Loved

Jeremy Stanley — Thu, 14 Dec 2017 18:10:01 GMT

Know your model’s limits, interpret it’s behavior and learn from variable length sets.

One of two “breakout sessions” with presenter and GIANT screen for scale.

At NIPS 2017 what surprised me the most was not the size of the crowds (they were huge), the extravagance of the parties (I sleep early) or the controversy of the “rigor police” debate (it was entertaining).

No, what surprised me the most was the number of papers I saw that (when combined with talks and posters) were both relatively easy to understand and of immediate practical use.

In this post, I will briefly explain three of our favorites:

Knowing your model’s limits
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Lakshminarayanan et. al 2017, paper & video (1:00:10)
Interpreting model behavior
A Unified Approach to Interpreting Model Predictions
Lundberg et al. 2017, paper, video (17:45) & github
Learning from variable length sets
Deep Sets
Zaheer et al. 2017, paper & video (16:00)

I’d like to extend a huge thank-you to Balaji Lakshminarayanan, Scott Lundberg, Manzil Zaheer and their co-authors for doing this work and presenting at NIPS. Their cogent presentations and detailed answers to my many questions at their poster sessions enabled and inspired this post.

Knowing your model’s limits

Lakshminarayanan et. al 2017, paper & video (1:00:10)

Deep learning models can be surprisingly brittle. They can fail to generalize on data drawn from slightly different distributions and can give very different predictions given minor changes in the learning algorithm or initialization.

This begs the question — can we know when our deep learning models are uncertain about their predictions?

If so, this would help in many applications at Instacart, such as:

How uncertain are we about an item being in stock at a store location?
How much risk is there in a grocery delivery being late?
Is there a chance we should explore showing a rare item for a search?
What range of delivery demand should we anticipate at a store location?

In particular, anytime you make a decision based upon many noisy predictions, you risk favoring observations with large noise values (common in ranking for search or ads, or in optimization for pricing or logistics applications). Ensuring you control for prediction uncertainty to avoid this effect can be important.

Other methods can be used to quantify uncertainty, but have drawbacks. For example, bayesian methods require assumptions about priors and are computationally expensive.

This paper provides an elegant method to quantify the uncertainty in deep learning models:

Lakshminarayanan et. al 2017 (video)

In practice you:

Choose a distribution for your output (gaussian if you are optimizing for MSE, poisson for counts, etc.)
Change the final layer in your deep network to output a variance estimate (or other distribution parameters) in addition to an estimate for the mean
Minimize the negative log-likelihood for the output distribution (e.g., with a custom loss function in Keras)
Train M networks in this way, each with a different random initialization
Let your final predicted distribution be the evenly weighted mixture of distributions from the M networks

While the paper also adds adversarial training (hard to implement for discrete inputs), some of their experiments showed that this was less important.

What is critical is that your network must produce an estimate of mean and variance, and then optimize the negative log likelihood loss function. If you assume your errors are gaussian distributed, then your loss function is:

Lakshminarayanan et. al 2017 (paper)

Where 𝜇 is the network’s estimate of the mean (conditioned on weights θ and input 𝒙), and σ² is the networks’ estimate of the variance. If you assume a constant σ, this can be simplified to classical regression with MSE.

For an example on implementing a similar loss function in Keras, see the WTTE package, which uses a Weibull distribution instead of a Gaussian.

The following toy example from their paper illustrates the impact, where each red point is drawn from y = x³ + ε where ε ∼ N(0, 32), the blue line is y = x³ and the grey range is the method’s variance estimate conditioned on x:

Lakshminarayanan et. al 2017 (paper)

The leftmost plot shows the variance of training M=5 simple networks which only output the mean and were optimized for MSE. Each model produces only a point estimate, and there is little variance observed over the ensemble.

The second plot shows the results of following the above recipe but with M=1. In this case, the network produces a distribution, but it’s level of uncertainty remains constant even when generalizing outside of it’s domain.

The third plot includes adversarial training (note how little difference it makes) with M=1, and the final plot does everything (mean and variance outputs, adversarial training and M=5.) Only the final plot does a reasonable job of estimating uncertainty outside of the range of the training data.

The authors then show that an ensemble of networks trained in this way on digit classification with MNIST data do a far better job of estimating their uncertainty than other techniques like monte-carlo dropout:

Lakshminarayanan et. al 2017 (paper)

In the above visualization, they vary the number of networks in the ensemble, and compare monte-carlo dropout (green) to a simple ensemble (red) to an ensemble with adversarial training (blue). The grey curves use random data augmentation (rather than adversarial), and show that using the adversarial approach is what adds incremental value to a simple ensemble.

Finally, and perhaps most impressive of all, the authors show that their method responds appropriately when presented with data from an entirely different domain (letters rather than numbers):

Lakshminarayanan et. al 2017 (video)

The blue plots show the uncertainty (measured in entropy given this is a classification problem) for digit classification when presented with numbers. The bottom red plots show the uncertainty when presented with letters.

When using just 1 network in the ensemble (how most deep learning models are deployed), the model trained only on numbers gives equally confident (but obviously wrong) classification results for letters! But increasing to even just 5 networks produces significantly less confident predictions.

Interpreting model behavior

Lundberg et al. 2017, paper, video (17:45) & github

Most complex machine learning models are black boxes — we simply cannot fully understand how they work. However, we can gain deeper insight locally into the predictions that they make, and through this insight can better understand our data and models.

This understanding can be used to:

Build intuition for how our algorithms behave
Alter end user experiences to provide more context for predictions
Debug model building issues arising from data quality, model fit or generalization ability
Measure the value of different features in a model, and inform decisions for future data collection and engineering

At Instacart, we often want to deeply understand models we build such as:

The expected time until a user places their next order, as a function of their past order, delivery, site and rating behavior
What product pairs are good replacements for each-other in case we cannot find what the customer originally requested
How our customers react to limited delivery availability options or busy pricing

The SHAP (SHapley Additive exPlanations) paper and package provides an elegant way to decompose a model’s predictions into additive effects, which can then be easily visualized.

For example, here is a visualization that explains a Light GBM prediction of the chance a household earns $50k or more from a UCI census dataset:

Lundberg et al. 2017 (github)

In this case, the log-odds likelihood of high income is -1.94, and the largest factor depressing this chance is young age (blue), and the largest factor increasing income is marital status (red).

Furthermore, you can visualize the aggregate impact of features on model predictions over an entire dataset with visualizations like these:

Lundberg et al. 2017 (github)

Here they find that Age is most predictive, but really because there is a group (young) which is separated and low income. Capital Gain is the next most predictive, in part because of both very high and very low contributions.

This is a huge improvement over the typical information gain based variable importance visualizations commonly used with packages like XGBoost and LightGBM, which only show the relative importance of each feature:

R XGBoost Vignette

The package can also provide rich partial dependence plots which show the range of impact that a feature has across the training dataset population:

Lundberg et al. 2017 (github)

Note that the vertical spread of values in the above plot represent interaction effects between Age and other variables (the effect of Age changes with other variables). This is in contrast to traditional partial dependence plots which show only the effect of varying Age in isolation.

To understand how the SHAP algorithm works, consider this example for a single observation:

Lundberg et al. 2017 (video)

Their model is predicting the chance of high income, and on average predicts a base rate of 20% for the entire population, denoted by E[f(x)]. For this specific example (named John in the talk), they predict a 55% probability, denoted by f(x).

The SHAP values answer the question of how they got from 20% to 50% for John.

Lundberg et al. 2017 (video)

They begin by ordering the features randomly, perhaps starting with Age, and ask how much the average prediction of 20% changes for users whose age is the same as John’s, denoted E[f(x) | x₁]. This can be found by integrating f(x) over all other features besides x₁ in the training dataset (a process that can be done efficiently in trees).

Suppose that they find that the prediction goes up to 35%, and so this gives them an estimate for the effect of Age, ϕ₁=15%. They then iteratively repeat this process through the remaining variables (concluding with marital status), to estimate ϕ₂, ϕ₃ and ϕ₄ for each of the other three features in this example:

Lundberg et al. 2017 (video)

However, unless a model is purely additive, the estimates for ϕ will vary with the ordering of features chosen. The SHAP algorithm solves this by averaging over all possible 2ᴺ orderings. The computational burden of computing all such orderings is alleviated by sampling M of them and using a regression model to attribute the impact from the samples to each feature.

The paper justifies the above approach using game theory, and further shows that this theory unifies other interpretation methodologies such as LIME and DeepLIFT:

Lundberg et al. 2017 (video)

And finally, because no NIPS paper would be complete without an MNIST example, they show that the SHAP algorithm does a better job at explaining what part of an 8 represents the essence of an 8 (as opposed to a 3):

Lundberg et al. 2017 (paper)

This shows that their approach can work well even for deep learning models.

Learning from variable length sets

Zaheer et al. 2017, paper & video (16:00)

Established deep learning architectures exist for modeling sparse categorical data (embeddings), sequence data (LSTMs) and image data (CNNs). But what do you do if you want your model to depend upon a variable length unordered set of inputs?

This was precisely the question we asked ourselves at Instacart a year ago while pondering our work on sorting grocery shopping lists in our Deep Learning with Emojis (Not Math) post.

I was overjoyed (and humbled) to see this paper at the NIPS poster session Wednesday night, which generalizes our work, and immediately reminded me of this tweet by Rachel Thomas:

Tweet by Rachael Thomas

In the Deep Sets paper, the authors explain that set based modeling problems fall into two classes:

Zaheer et al. 2017 (video)

In the permutation invariant case, you want to be able to re-order the inputs into your model without affecting the prediction (which is often into a space of a different dimension from your input).

For example, at Instacart we could predict:

How much time it will take to pick a basket of groceries at a store location
Will a user add to cart any item given a query and a set of product search results
How efficient will we be in a city given a set of deliveries and their location and due times, and a set of shoppers and their locations and current status

In the permutation equivariant case, you will produce a predicted value for every input in the set, and you want to be able to re-order the inputs and ensure that the ordering of the outputs changes accordingly.

For example, at Instacart we could predict:

The probability that each item in a set will be picked by an in-store shopper next given the previous item and store (our Deep Learning with Emojis (Not Math) use case)
Which of the products a user has purchased in the past they will re-purchase in their next order (our 3 Million Instacart Orders, Open Sourced use case)

The paper proves that any such set based architecture must take the following form:

Zaheer et al. 2017 (video)

For the permutation invariant case, the architecture will look like this:

Zaheer et al. 2017 (video)

Where ϕ is an arbitrary neural network architecture applied iteratively over every set element 𝒙 (for example, using the Keras TimeDistributed layer wrapper). The outputs of ϕ must then be summed along the set dimension, and can then be passed into yet another arbitrary neural network ⍴, which can produce the final output predictions.

For the permutation equivariant case, the architecture is the same as above, but instead of using ⍴ you use DeepSets layers:

Zaheer et al. 2017 (video)

Where you can see that the output is invariant to the ordering of the input given the symmetry in weight sharing.

The paper provides an obligatory MNIST example, where they seek to learn an architecture that can sum hand-written digits:

Zaheer et al. 2017 (video)

In this case you want the architecture to be permutation invariant, so that sum(1, 2) = sum(2, 1), and to handle variable length input such as sum(1, 2, 7).

Two simple alternative approaches both fail:

Zaheer et al. 2017 (video)

On the left hand side, they concatenate the digits and pass them into a hidden layer, but this fails to handle variable sequence length inputs. On the right hand side, they pass them into a recurrent layer, but the results will not be order invariant.

How big of a deal is that? In practice, they found that both GRU and LSTM layers failed dramatically to generalize to sequence lengths longer than 10:

Zaheer et al. 2017 (video)

This paper is particularly rich with application examples, ranging from image tagging, to outlier detection, to point-cloud classification:

Zaheer et al. 2017 (paper)

Summary

Beyond all the hype, NIPS 2017 was an amazing event, and these three papers demonstrate how practically useful these conferences are for applied AI and Machine Learning work. In each case, the author’s work provided mathematical rigor, practical advice, and experimental validation for questions we have been pondering at Instacart.

I hope that you are now as excited by these ideas as we are! If you are interested in working on one of the many challenging problems we have at Instacart, check out our careers page at careers.instacart.com.

Again, I’d like to thank Balaji Lakshminarayanan, Scott Lundberg, Manzil Zaheer and their co-authors for their work, and to everyone involved in organizing NIPS 2017. I’d also like to thank Jeremy Howard for his feedback on this post.

3 NIPS Papers We Loved was originally published in tech-at-instacart on Medium, where people are continuing the conversation by highlighting and responding to this story.

700 Women Founders

Jeremy Stanley — Sat, 24 Jun 2017 22:06:10 GMT

Analyzing 700 women founders, and the VCs that invested in them.

On Friday, I awoke to a trickle of twitter updates about an article in The Information, detailing the unwanted advances of a VC, Justin Caldbeck of Binary Capital. The article was shocking, with three women on the record (and three others off) accusing him of blatantly abusing his position of power with women founders.

Over the course of the day, the trickle intensified, with reports from Pando, Axios and finally a reaction from Reid Hoffman. The day concluded with another Axios article covering the VC’s indefinite leave of absence.

Much has been written about the gender and minority diversity of employees at technology companies (this Fortune article, for example). Yet little has been written about the gender diversity of VC investments.

This surprised me — as founders play crucial roles in building diverse and inclusive companies.

So I downloaded the freely available Crunchbase © 2013 Snapshot database of companies, their employees and their funding rounds. I extracted companies who raised capital in 2009–2013 in Seed, A, B or C rounds.

Then, using the Gender package in R I identified the likelihood that each founder was a woman or man. While gender analysis from first names is far from perfect, for over 80% of founder names we can be 95% certain they are female or male, and so aggregate conclusions should be reliable.

Not surprisingly, I found that a woefully small percentage of founders are women. Of the 17,961 investments by 2,435 investors in 6,771 founders from 3,867 companies analyzed with gendered names, only 10.5% were women.

Women are even less likely to be represented in late funding rounds, and the percentage of women founders increased until 2012, and then regressed in 2013. Some investment regions (Berlin, Tel Aviv) have higher percentage of women founders, while others (Philadelphia, Austin) have materially lower percentages.

Women are also more likely to be founders in companies tagged as fashion, e-commerce or gaming startups. They are much less likely to be founders in companies tagged as real-time, big data or marketing startups.

I then ranked the top 100 VCs (by companies invested in) over these 5 years by their percentage of women founders, and found that the 10 least diverse VC portfolios all had 5% or fewer women founders, and that only the top two most diverse VC portfolios had just over 20%.

The Analysis

Gender is not a field that is reported by Crunchbase, so I used the first names of the founders (identified by searching for ‘founder’ or ‘Founder’ in their title) to estimate gender.

This word cloud shows the top 500 founder first names, and colors predominantly male names in orange, female names in green, and unidentifiable names (e.g., ‘J.’) in grey.

Top 500 most common founder first names by inferred gender (orange = male, green = female, grey = unknown)

There are many Michael and David founders, but it’s immediately clear that female names are few and far between.

The percent of founders that are women drops significantly as you progress from seed to a, b and c rounds of funding.

Percent of women founders by round

Over time, the percentage of women founders rose from roughly 7% in 2009 to almost 13% in 2012, but receded back to only 10% in 2013.

Percent of women founders over time

Some regions (Berlin and Tel Aviv) have 16% or more women founders, while others (Philadelphia and Austin) have fewer than 3%. But these regions also have small sample sizes (70 or fewer companies).

New York, however, has a significantly higher percent of women founders (13%) as compared to the SF Bay area (10%).

Percent of women founders by region and region size.

There are also tags that are much more or less likely to be associated with startups founded by women.

Percent of companies invested in by tag that have one or more women founders

The VC Rankings

Last, but not least, are the rankings of the top 100 VCs.

Top 100 investors sorted by percent of women founders

All of these firms have 24+ founders in this sample (and some as many as 374), and so the variation in results is clearly significant, but data points from firms with fewer investments may be less meaningful. Note that the size of the circles represents the firm’s number of portfolio companies.

The discrepancy between the top and bottom VCs becomes starkly clear when we compare the names of their founders:

Founder first names for the top 5 VCs (left) and bottom 5 VCs (right), where orange are men and green are women

Keep in mind that these findings are simply a summary of readily available data. A reputation for abusive behavior might cause a VC to have fewer women founders, but so could their investment focus, their networks, coincidence and countless other factors. Let’s not jump to any conclusions on specific VCs.

I would also hope that some VCs have made significant progress since 2013, but for now we’ll have to wait for Crunchbase to make more recent data available without subscription to find out.

My Opinion

I believe the women who have said that there are other, as yet unreported examples of sexism and abuse of power in both the VC and broader technology community. I hope more brave women speak out and ignite fires of indignation that, with sufficient attention, will lead to meaningful change.

But I also believe in the power of data, of transparency, and of holding our community accountable to outcomes.

I hope that this analysis can begin that for VCs.

The Code & Data

This post was made significantly better because of feedback from Michelle Suwannukul and Daniel Tunkelang, thank you both!

Space, Time and Groceries

Jeremy Stanley — Tue, 13 Jun 2017 17:59:27 GMT

Grocery delivery visualized in python with datashader.

At Instacart, we deliver a lot of groceries. By the end of next year, 80% of American households will be able to use Instacart. Our challenge: complete every delivery on-time, with the right groceries.

Over the course of a week, we traverse cities all over the United States many times over while delivering groceries:

Routes followed by shoppers in SF, Austin, Boston and Miami

How do we bring order to the chaos?

In the remainder of this post, we’ll first introduce the logistics problem Instacart is solving, outline the architecture of our systems and describe the GPS data we collect. Then we will conclude by touring a series of datashader visualizations:

Example datashader visualizations at Instacart

Visualizations like these help us to build intuition about our system, generate hypotheses for improvements, sanity check our changes, identify best practices and improve our operations.

But before we get too caught up in these visualizations, let’s first quickly cover the problem we are solving.

Logistics @ Instacart

When using our app to order groceries, you first choose a retailer, and then shop for groceries to be delivered. Over the course of a few hours, we have thousands of such orders to deliver. Doing this efficiently is the job of our logistics systems.

At it’s simplest, our logistics problem can be viewed as solving a TSP (traveling salesman problem) where the shopper must go to the store first. For example, the shopper drives to the store, picks your groceries (along with two other orders), and then delivers them in a sequence:

There are many algorithms for solving TSPs, and perfect solutions can be found for up to tens of thousands of deliveries. Even with millions of deliveries, heuristics can come within 2–3% of the optimal solution.

But in practice we have a fleet of shoppers to fulfill orders. Each will be given a batch of orders to shop for, and will then deliver those orders in sequence:

This problem is called a VRP (Vehicle Routing Problem), which generalizes the traveling salesman problem. (Generalizes is a mathy euphemism for ‘even harder to solve optimally’.)

But we can’t stop there. Instacart is named Instacart for a reason — we commit to narrow delivery windows for our customers (usually 1 hour long). So only a subset of assigned routes will be viable, and we must jointly optimize the expected timeliness of our deliveries with the speed of our movement.

This is called a VRPTW (Vehicle Routing Problem with Time Windows):

If only life were so simple!

In reality, not all of our shoppers are equivalent. Some have large vehicles, others have small ones. Some have club-cards for retailers like Costco, while others do not. Some can fulfill alcohol orders, while others may not. This means our problem is capacitated, and so we can add a big C to the front of our acronym.

Furthermore, each vehicle can take more than one trip, which lets us append the letters MT (for Multiple Trips). So really, we have a CVRPTWMT:

Oh, and everything evolves continuously under many sources of uncertainty. New orders are placed. Shoppers come on and off of shift. Weather, traffic and other events wreak havoc on plans. Such problems are referred to as being stochastic, which gives us one more letter — an S!

So in the end, we are left to solve a SCVRPTWMT (😰). They say that the longer the acronym, the harder the problem is to solve.

But don’t fret, all is not lost.

Naïve to Novel

A simple system can be implemented for routing shoppers that accomplishes some of our goals without a great deal of complexity:

Sort orders by when they are due
Find the shopper who is free that can do the first order the fastest
Search remaining orders for any that can be added without being late
Dispatch the orders found to this shopper
Repeat

This will optimize for fulfilling the most urgent orders in the most timely fashion, and seek efficiencies where possible as a secondary objective.

We began with a variation of this kind of simple greedy algorithm, and have since introduced novel approaches that have had a dramatic impact on our speed, without compromising on late deliveries or order quality:

We have halved the number of minutes per delivery in San Francisco, and continue to set aggressive goals.

Some of the changes we have introduced include:

Machine learning to predict the distribution of time expected for any given shopper and assignment
Decomposing the CVRPTW into sub-problems (clustering deliveries, shopper assignment) and solving these sub-problems to near optimality
Applying heuristics for limiting search spaces, dealing with anomalies, fine-tuning solutions and adapting under uncertainty
Re-computing batch plans every minute and making dispatching decisions just in time

The application that decides what orders each shopper should fulfill is called our ‘fulfillment engine’, and it is just one component of our overall logistics system, which also forecasts demand and shopper behavior, manages capacity and busy pricing and plans and adapts our staffing:

These systems are highly interdependent, and we are increasingly using simulations to optimize them jointly under many sources of uncertainty.

In the months to come we will publish more detailed posts about these systems and the fun engineering, machine learning, optimization and operations challenges they present, so stay tuned!

The Data

In order to optimize the assignment and routing of our shoppers, and to communicate effectively with our consumers, we collect a stream of GPS location data.

For example, these are what ten updates might look like for a fictional shopper:

Every ~10 seconds, we collect the timestamp, latitude and longitude, speed, direction and accuracy reported by the device. The latitude and longitude are shown here rounded to 4 digits, but are collected to 6 digits in production. The speed is measured in miles per hour (this fictional shopper might be walking to their car). The direction is in degrees, and is -1 when a shopper is at a halt. The accuracy is in meters, and indicates the expected error of the measurement from the real position.

Over the course of a single day, we collect 10s of millions of these updates across the country.

Datashader

Datashader provides the ability to quickly and interactively visualize millions, or even billions of points.

An animation of interactively zooming into a SF datashader plot

For more information on datashader, I recommend you start with their plotting pitfalls notebook. Many of the visualizations in this post are modeled after their NYC Taxi and OpenSky notebooks.

Note that for these visualizations, we show only data points where shoppers are moving quickly while driving and delivering groceries, or we are zoomed into store locations. This is to protect the privacy of our shoppers and our customers.

Accuracy

First, let’s inspect the accuracy of the GPS data we collect:

GPS updates in San Francisco highlighting accurate (blue) measurements and inaccurate (red) measurements

The red points represent inaccurate measurements (more than 10 meters), whereas those in blue are accurate measurements (10 meters or less). We can immediately see that accuracy is poorer in the financial district (upper-right), where tall buildings obstruct the GPS. But even there the data accumulates to clearly show an outline of the streets. The measurements are also less accurate inside of any city block, where presumably the shoppers are indoors and the GPS signal is obstructed.

We can zoom into one of our store locations and see the shoppers moving through the parking lot with highly accurate GPS locations, but losing that signal within the stores themselves:

GPS updates around a store location highlighting accurate (blue) measurements and inaccurate (red) measurements

Furthermore, there appear to be buildings or other obstructions that ‘shade’ the GPS accuracy over certain parts of this parking lot (see the left hand side).

Speed

We can filter the the data to just the accurate observations, and then color the observations based on the speed the shopper is moving at:

Speed of movement in San Francisco (dark blue is slow, yellow is fast)

This clearly shows our shoppers moving fastest on the highways in San Francisco, and slowest in the financial district. It also shows that moving quickly from the south to the north side of the city is difficult, as there are no fast routes making that connection.

Direction

If we instead color each point by the direction the shopper was moving in, we can clearly see the organization of the city streets:

GPS updates while moving with color mapped to direction of movement

One way roads alternate from one block to the next, some roads have traffic moving both ways, and other roads switch directions at certain intersections. The roundabouts and circular exit and entrance ramps make nice color wheels.

Store Location

When shoppers are delivering groceries, we know the store location they originated from, and so can color the map to visualize what stores frequently deliver to each neighborhood:

Paths taken when delivering from 10 store locations in SF (deliveries per location sampled to be constant)

Some stores dominate large areas, especially on the edge of the city. In more congested neighborhoods many streets are frequently traversed by shoppers from multiple stores, and the colors blend together into mixed hues.

Shopper State

We also measure where each shopper is in their workflow at any given moment. This lets us see what shoppers are doing inside the store locations (when measurement is accurate enough):

High accuracy GPS updates from within a store location

The checkout area (brown) is near the shopping area (purple), but the staging area (pink) is on another side of the store.

Or, by visualizing paths instead of points, we can clearly see the movement of shoppers through store parking lots:

Paths followed in the parking lot of a store location while in different states in the app (color)

Each lane is (mostly) one way (pink or orange), and shoppers enter the store to pick up groceries on one side of the building (blue). You can even see where shoppers typically park while waiting for their next order (yellow).

Visualizations like these help us to:

Build intuition for how our logistics system functions at scale
Generate hypotheses for ways to improve our algorithms or operations
Confirm that changes to production have the expected behavior
Make better operational decisions about parking spaces, store locations and our product offering

If you are interested in joining the team to help us engineer, optimize or analyze our logistics systems, or to work on any of the other many challenging problems we have at Instacart, check out our careers page at careers.instacart.com.

Space, Time and Groceries was originally published in tech-at-instacart on Medium, where people are continuing the conversation by highlighting and responding to this story.