Regex - quick guide
05 April 2024
What is Regex
Regex stands for Regular Expression. It is a string of text that allows you to create search patterns that match text. For example this regular expression will find text that starts with "Australian" and contains the number "2023": ^Australian.*2023.
The beginning of that example has an anchor, ^ which anchors the "Australian" to the start of the text.
Just after "Australian" is a dot . which is a character class that matches any character.
The dot is followed by a star * which is a quantifier, which says 0 or more of the previous character.
Finally there is the "2023" which matches the exact character sequence 2023. The 2023 can be anywhere in the text after "Australian".
| When we say character we mean any letter, number or symbol. Any thing you can type is a character including, for example a space. A Regex will normally only match on a string up to a new line (or character class \n). |
The regular expression ^Australian.*2023 will match the following text (often referred to as strings, or strings of characters):
-
Australian Open 2023
-
Australian Parliament Committee hearings 2023-2024, Canberra.
-
Australian zimmerflex-#202344777662AQ-z
It won’t match:
-
The Australian Open 2023.
-
2023 Australian of the year.
-
Australian of the year 2022.
Regex Components
Anchors
anchors are used at the begining and the end of a string or expression.
-
^Use to specify the beginig of the string or expression. -
$Use to specify the end of the string or expression.
e.g. ^Fred Fintstone$ would match a string that is only "Fred Fintstone" with nothing before or after.
Quantifiers
Quantifiers specify how many of the previous character, character class or group of characters you want:
-
*Finds 0 to more. e.g.fa*bfinds "fb", "fab", "faaaaaaaaaab" -
+Finds 1 to more. e.g.fa+bfinds "fab", "faaaaaaaaaab" -
?Finds 0 or 1. e.g. e.g.fa?bfinds "fb", "fab" -
{n}Finds exactly n characters, e.g.a{3}finds "aaa". -
{x,n}Makes a limit of characters (From x to n). e.g.fa{1,3}bfinds "fab", "faab", "faaab"
Grouping
You can group a sequence of characters together for a purpose. Regex has a notion of a capture where the pattern within a capture group can be used in a result, for example when you want to find and replace a string, we won’t cover that.
Groups are used for matching a sequence a number of times, or creating a set of sequences that could be matched. A simple set of Parenthesis around a sequence of characters e.g. (cat), creates a capture group that can then have a quantity added. Examples:
-
shrodinger’s (cat)? was herematches "shrodinger’s cat was here" and "shrodinger’s was here" -
the (.at)+ sat on the matmatches "the cat sat on the mat", "the ratcat sat on the mat", "the #atgatpatcatmatfat6atbat sat on the mat"
Groups can be split into OR blocks using the pipe special character |, for example:
-
the (cat|dog) was herematches "the cat was here" and "the dog was here" -
the cats? (was|were) herematches "the cat was here", "the cats were here", "the cats was here"
Sets of Characters
You can define a range or set of characters that could be matched by putting them in square brackets. Unlike a group that defines a specific sequence of characters this says any of these characters, so [CRM]at matches "Cat", "Rat" and "Mat".
The set of characters can be represented as a range, for example:
-
[1-3]0all the numbers from 0-3, matches "10","20","30" -
[a-z]matches all lower case letters from a to z -
[a-zA-Z]matches all lower and upper case letters from a to z
You can put a ^ at the beginning of the set to not match this set of characters, for example:
-
^[^:]*match everything from the beginning of the string except:which is useful for search and replace. If you had a string "2023-05-23 12:52:45" it would match "2023-05-23 12"
Shorthand Character Classes
There are shortcuts for sets of characters called character classes which just define a set of characters to match. We met the . class in the introduction, which is every character.
-
.every character except for new lines -
\nthe new line -
\ta tab -
\da digit (0-9) same as[0-9] -
\DNOT a digit (0-9) same as[^0-9] -
\wa word character (latin) same as[a-zA-Z0-9_] -
\WNOT a word character same as[^a-zA-Z0-9_] -
\sspaces of any kind (space, Tab, new line) -
\SNOT a space (space, Tab, new line)
There are many more see https://www.regular-expressions.info/shorthand.html
A complete guide to Regular Expressions can be found at https://www.regular-expressions.info/