Apply Regular Expressions in a data project

A journey of applying Regular Expressions in one of our projects, and the lessons I learned.

Andrew Zhu (Shudong Zhu)
Data Science at Microsoft

--

Imagine for a moment that you have gigabytes — or even terabytes — of text data in front of you. Now suppose that from all that data, you must extract specific text in a fuzzy pattern in a limited amount of processing time. How would you proceed?

In the Azure Advisor Score service that we built for our customers, we applied Regular Expressions to handle not just one, but three massive text data scenarios: 1) Extracting specific information from records including an Azure Resource ID; 2) Extracting action descriptions from Azure Advisor Recommendation raw text; and 3) Extracting the Advisor Recommendation potential cost-saving number (123) and currency code (USD) from the raw string. We designed these Regular Expression applications around three goals:

  • Fast enough so that the solution can handle 100 million rows in one run and return data in less than 10 minutes.
  • Flexible enough so that the extraction solution works both in the data pipeline and in the analysis script.
  • Extensible enough so that we can easily add new ingredients when new requirements arise.

--

--