Extracting text from a string with regex groups in Dart

Suragch
Suragch
Jan 26 · 5 min read
Image for post
Image for post
Text extraction from an LRC lyrics file

Text manipulation is a common programming problem. However, I generally try to avoid regular expressions (regex) because they are completely unreadable:

RegExp _email = RegExp(r"^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$");

Every now and then, though, it’s easier to use them than to come up with a custom parser. One such case is extracting groups of text from a longer string. The rules for regular expressions are generally the same in any language, but I’ll show the specifics of how to do this in Dart.

Defining the problem

Say you are trying to extract the time and lyrics from an LRC text file:

[00:12.30]Twinkle, twinkle, little star 
[00:17.60]How I wonder what you are
[00:21.50]Up above the world so high
[00:25.30]Like a diamond in the sky

You can get each line pretty easily like this:

List<String> rawLines = text.split('\n');print(rawLines[0]); // [00:12.30]Twinkle, twinkle, little star

But what you’d really like to do is get the time and words for each line, something like this:

final time = Duration(minutes: 0, seconds: 12, milliseconds: 300);
final words = 'Twinkle, twinkle little star';

That means there are four separate groups to extract:

Image for post
Image for post

There is a regex to do that, but just giving it to you directly would be another one of those unreadable regexes, so let’s solve the problem one step at a time.

The basic regex

To create a regex matcher, you use the RegExp class in Dart.

final regex = RegExp(r'');

You put the regular expression inside the quote marks. It’s useful to use a raw string (that is, one starting with r) so that you don't have to escape so many things later.

Match the start and end of the line

This isn’t strictly necessary, but if you’re working with a whole line anyway, matching the start and end of the line might prevent some surprises that you could get by matching something else.

  • ^ matches the start of a line
  • $ matches the end of a line

That means so far your regex should look like this:

final regex = RegExp(r'^$');
Image for post
Image for post

Match the constant parts

The characters that won’t vary in the string are [, :, ., and ]:

Image for post
Image for post

But [, ., and ] all have special meanings in regex so you have to escape them by prefixing them with \:

  • \[
  • \.
  • \]

That makes the regex look like this:

final regex = RegExp(r'^\[:\.\]$');
Image for post
Image for post

That doesn’t actually match our line at all right now because we still need to add the variable text.

Match the variable parts

Again, the four groups that you want to capture are minutes, seconds, fractional seconds, and words:

Image for post
Image for post

You can use the following patterns to match:

  • [0-9]+ — Match one or more digits. The brackets matches one of whatever items are in the range and + is a wildcard meaning one or more matches. (Alternatively you could use \d instead of [0–9], but I find [0-9] easier to remember.)
  • .* — Match zero or more characters. The . matches any single character and * is a wildcard to match zero or more occurrences of whatever character precedes it. We’ll use this for the song lyrics in order to allow some songs to have blank lines while still containing a time stamp.

That makes the regex look like this:

final regex = RegExp(r'^\[[0-9]+:[0-9]+\.[0-9]+\].*$');
Image for post
Image for post

The match is actually complete, but you don’t have a way to extract the variable parts. You use groups for that.

Capture the groups

You can capture groups by surrounding them with parentheses:

final regex = RegExp(r'^\[([0-9]+):([0-9]+)\.([0-9]+)\](.*)$');
Image for post
Image for post

Now you’re ready to extract the parts inside the parentheses.

Pulling it all together

Here is how you extract the text you want:

final line = '[00:12.30]Twinkle, twinkle little star';
final regex = RegExp(r'^\[([0-9]+):([0-9]+)\.([0-9]+)\](.*)$');
final match = regex.firstMatch(line);
final everything = match.group(0); // [00:12.30]Twinkle, twinkle little star
final minutes = match.group(1); // 00
final seconds = match.group(2); // 12
final fraction = match.group(3); // 30
final words = match.group(4); // Twinkle, twinkle little star

Notes:

  • The way you actually perform the matching is to call firstMatch on the regex. You can use firstMatch because you’re already matching the full line. If you hadn’t split the entire text into lines first then you could call regex.allMatches, which would give you an interable collection of matches that you could then loop over.
  • As you can see, group(0) matches everything, while group(1) to group(4) matches the parts you surrounded with parentheses.

The extracted text time groups are still strings so if you want to convert them to a Duration, then you’ll need to do the conversion:

final time = Duration(
minutes: int.parse(minutes),
seconds: int.parse(seconds),
milliseconds: int.parse(fraction.padLeft(3, '0')),
);

That’s it for regex groups in Dart. If you can get past the poor readability of the regex matching patterns, they can be a convenient way to extract what you need from text strings.

I posted the original version of this article on Stack Overflow, but they aren’t exactly happy to see more regex questions over there. I’ve expanded my answer here for Medium.

See also

Flutter Community

Articles and Stories from the Flutter Community

Suragch

Written by

Suragch

A Flutter and Dart developer. Follow me on Twitter @suragch1 to get updates of new articles.

Flutter Community

Articles and Stories from the Flutter Community

Suragch

Written by

Suragch

A Flutter and Dart developer. Follow me on Twitter @suragch1 to get updates of new articles.

Flutter Community

Articles and Stories from the Flutter Community

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store