nerdErg

Regex - quick guide

05 April 2024

What is Regex

Regex stands for Regular Expression. It is a string of text that allows you to create search patterns that match text. For example this regular expression will find text that starts with "Australian" and contains the number "2023": ^Australian.*2023.

The beginning of that example has an anchor, ^ which anchors the "Australian" to the start of the text.

Just after "Australian" is a dot . which is a character class that matches any character.

The dot is followed by a star * which is a quantifier, which says 0 or more of the previous character.

Finally there is the "2023" which matches the exact character sequence 2023. The 2023 can be anywhere in the text after "Australian".

When we say character we mean any letter, number or symbol. Any thing you can type is a character including, for example a space. A Regex will normally only match on a string up to a new line (or character class \n).

The regular expression ^Australian.*2023 will match the following text (often referred to as strings, or strings of characters):

  • Australian Open 2023

  • Australian Parliament Committee hearings 2023-2024, Canberra.

  • Australian zimmerflex-#202344777662AQ-z

It won’t match:

  • The Australian Open 2023.

  • 2023 Australian of the year.

  • Australian of the year 2022.

Regex Components

Anchors

anchors are used at the begining and the end of a string or expression.

  • ^ Use to specify the beginig of the string or expression.

  • $ Use to specify the end of the string or expression.

e.g. ^Fred Fintstone$ would match a string that is only "Fred Fintstone" with nothing before or after.

Quantifiers

Quantifiers specify how many of the previous character, character class or group of characters you want:

  • * Finds 0 to more. e.g. fa*b finds "fb", "fab", "faaaaaaaaaab"

  • + Finds 1 to more. e.g. fa+b finds "fab", "faaaaaaaaaab"

  • ? Finds 0 or 1. e.g. e.g. fa?b finds "fb", "fab"

  • {n} Finds exactly n characters, e.g. a{3} finds "aaa".

  • {x,n} Makes a limit of characters (From x to n). e.g. fa{1,3}b finds "fab", "faab", "faaab"

Grouping

You can group a sequence of characters together for a purpose. Regex has a notion of a capture where the pattern within a capture group can be used in a result, for example when you want to find and replace a string, we won’t cover that.

Groups are used for matching a sequence a number of times, or creating a set of sequences that could be matched. A simple set of Parenthesis around a sequence of characters e.g. (cat), creates a capture group that can then have a quantity added. Examples:

  • shrodinger’s (cat)? was here matches "shrodinger’s cat was here" and "shrodinger’s was here"

  • the (.at)+ sat on the mat matches "the cat sat on the mat", "the ratcat sat on the mat", "the #atgatpatcatmatfat6atbat sat on the mat"

Groups can be split into OR blocks using the pipe special character |, for example:

  • the (cat|dog) was here matches "the cat was here" and "the dog was here"

  • the cats? (was|were) here matches "the cat was here", "the cats were here", "the cats was here"

Sets of Characters

You can define a range or set of characters that could be matched by putting them in square brackets. Unlike a group that defines a specific sequence of characters this says any of these characters, so [CRM]at matches "Cat", "Rat" and "Mat".

The set of characters can be represented as a range, for example:

  • [1-3]0 all the numbers from 0-3, matches "10","20","30"

  • [a-z] matches all lower case letters from a to z

  • [a-zA-Z] matches all lower and upper case letters from a to z

You can put a ^ at the beginning of the set to not match this set of characters, for example:

  • ^[^:]* match everything from the beginning of the string except : which is useful for search and replace. If you had a string "2023-05-23 12:52:45" it would match "2023-05-23 12"

Shorthand Character Classes

There are shortcuts for sets of characters called character classes which just define a set of characters to match. We met the . class in the introduction, which is every character.

  • . every character except for new lines

  • \n the new line

  • \t a tab

  • \d a digit (0-9) same as [0-9]

  • \D NOT a digit (0-9) same as [^0-9]

  • \w a word character (latin) same as [a-zA-Z0-9_]

  • \W NOT a word character same as [^a-zA-Z0-9_]

  • \s spaces of any kind (space, Tab, new line)

  • \S NOT a space (space, Tab, new line)

A complete guide to Regular Expressions can be found at https://www.regular-expressions.info/