How To Extract Data With RegEx In UiPath

Abhimanyu Thite
Globant
Published in
6 min readJun 22, 2023
RegEx helps UiPath extract data.

In this blog, we will discuss what a RegEx is, why we require it, and how we can utilize it in string manipulation in UiPath. We will also learn the two ways to do it, either using the UiPath studio’s inbuilt activity “Matches” or using expressions that support VB.Net syntax for regex.match in activities where assignments of values can be done.

Why is RegEx required?

In UiPath, regular expressions (regex) are used for text manipulation, data extraction, and pattern matching. They let you extract specific information from unstructured or semi-structured text, validate data formats, modify the text, clean data, automate workflows, and scrape web pages. Regex aids in effectively processing and extracting useful information from textual data, hence boosting UiPath’s capabilities in text-based automation applications.

What is RegEx?

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in the text. Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
Here are some examples of texts along with potential use cases for regular expressions:

Email Addresses:
Text: “Contact us at info@example.com for more information.”
Use Case: Extracting email addresses from text.
Regex:

\b[A-Za-z0–9._%+-]+@[A-Za-z0–9.-]+\.[A-Za-z]{2,}\b

Phone Numbers:
Text: “Call us at +1 (123) 456–7890 for assistance.”
Use Case: Extracting phone numbers from text.
Regex:

\+\d{1,}\s?\(\d{3}\)\s?\d{3}-\d{4}

Dates:
Text: “The event will take place on 2023–05–15.”
Use Case: Extracting dates in a specific format from the text.
Regex:

\d{4}-\d{2}-\d{2}

RegEx in UiPath

Regex Data Extractions in UiPath are effortless. The extraction can be done in the built-in activity Matches or directly with VB.NET code.

Let’s take an example of an email body to find out the Invoice Number, Product, Payment, and Due date:

Hi Recipient_name,

I hope you’re well.

Please see attached Invoice Number ABC-000002 for Product ABC with Payment 14000/- Rupees which is due on 31/01/2023. Don’t hesitate to reach out if you have any questions.

Kind regards,
Sender_name

1. Using the Matches activity

Read your text into a variable of the type String. For example, here below Image gets email body text in in_InputText which is a type of string. With the help of RegEx Builder, we can create the expression for a searched pattern.

By selecting “Advance” we can directly use the expression:

Properties:

  • Input: Use your text String.
  • Pattern: Use an expression for the RegEx pattern. Make sure it is encoded with double quotes. In this case, expression is
(?≤Product)(.*)(?=with)
  • Result: type IEnumerableand variable can be created using Ctrl+k
  • RegEx options:

Retrieve values from the RegEx output:

2. Using an Assign

Using the build-in activities ‘Matches’ and ‘Is Match’ is often helpful. However, there is a much more straightforward solution (at least for many cases) to extract data from text with RegEx in UiPath.

System.Text.RegularExpressions.Regex.Match(in_InputText, "(\w+)-(\d*)").Value
  • in_InputText: String in the Email body is there
  • strInvoiceNumber: String variable in which going to store the InvoiceNumber
  • System.Text.RegularExpressions is the Microsoft .NET namespace

Patterns used to extract information from the Email body mentioned in the above section

  • Invoice Number :
(\w+)-(\d*)

Explanation:

  • (\w+) This part of the regular expression matches one or more word characters. Word characters include letters (a-z, A-Z), digits (0-9), and underscores (_). The parentheses () are used to capture and group the matched characters.
  • - This matches a hyphen character - literally.
  • (\d*) This part of the regular expression matches zero or more digits. The parentheses ()are again used to capture and group the matched digits.

Product:

(?<=Product)(.*)(?=with)

Explanation:

  • * (?<=Product) This part of the regular expression uses a positive look-behind assertion. It asserts that the match should be preceded by “Product”. In other words, it checks if the string being matched is preceded by the word “Product” without including it in the final match.
  • (.*?) This part of the regular expression captures any character (except newline characters) zero or more times, lazily. The .*?. pattern matches any sequence of characters, but the ? makes it match as few characters as possible.
  • (?=with) This part of the regular expression uses a positive lookahead assertion. It asserts that the match should be followed by the string “with”. In other words, it checks if the string being matched is followed by the word “with” without including it in the final match.
  • Payment:
(?<=Payment )(\d*)

Explanation:

  • (?<=Payment) This part of the regular expression uses a positive look-behind assertion. It asserts that the match should be preceded by “Payment”. In other words, it checks if the string being matched is preceded by the word “Payment” without including it in the final match.
  • (\d*) This part of the regular expression captures zero or more digits. The parentheses () are used to capture and group the matched digits.

Due Date:

(?<=due on )((\d*(/)*)*)

Explanation:

  • (?<=due on) This part of the regular expression uses a positive look-behind assertion. It asserts that the match should be preceded by the “due on” string. In other words, it checks if the string being matched is -preceded by the phrase “due on” without including it in the final match.
  • ((\d*(/)*)*)This part of the regular expression captures zero or more occurrences of digits and slashes. The outermost set of parentheses (())is used to capture the entire pattern. Within that, the pattern (\d*(/)*) captures zero or more digits followed by zero or more slashes. The asterisk * quantifier matches zero or more occurrences of the preceding pattern.

Here is the output that we get; in the below screenshot left-hand side is a label for what is extracted, and on the right-hand side, values are getting from the above-explained RegEx pattern, which is applied to the email body text one after the other and their result has been displayed.
<Label for what is extracted>: <Result of RegEx pattern>

Conclusion

Efforts on doing string manipulation on the unstructured data will be reduced with the help of Matches activity / VB.NET Code which uses the RegEx pattern. It enables pattern matching, validation, and extraction of specific information from text data.

References

--

--