One of the core principles of Unix systems is the extensive use of text data: configuration files, as well as input and output data in *nix systems, are often organized as plain text. Regular expressions are a powerful tool for manipulating text data. This guide delves into the intricacies of using regular expressions in Bash, helping you fully harness the power of the command line and scripts in Linux.
Regular expressions are specially formatted strings used to search for character patterns in text. They resemble shell wildcards in some ways, but their capabilities are much broader. Many text-processing utilities in Linux and programming languages include a regular expression engine. However, different programs and languages often employ different regular expression dialects. This article focuses on the POSIX standard to which most Linux utilities adhere.
The grep
program is the primary tool for working with regular expressions. grep
reads data from standard input, searches for matches to a specified pattern, and outputs all matching lines. grep
is typically pre-installed on most distributions.
You can try the commands in a virtual machine or a VPS to practice using regular expressions.
The syntax of grep
is as follows:
grep [options] regular_expression [file...]
The simplest use case for grep
is finding lines that contain a fixed substring. In the example below, grep
outputs all lines that contain the sequence nologin
:
grep nologin /etc/passwd
Output:
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
games:x:5:60:games:/usr/games:/usr/sbin/nologin
...
grep
has many options, which are detailed in the documentation. Here are some useful options for working with regular expressions:
-v
— Inverts the match criteria. With this option, grep
outputs lines that do not contain matches:
ls /bin | grep -v zip
# Output:
411toppm 7z 7za 7zr ...
-i
— Ignores case.
-o
— Outputs only the matches, not the entire lines:
ls /bin | grep -o zip
# Output:
zip zip zip zip ...
-w
— Searches for lines containing whole words matching the pattern.
ls /bin | grep -w zip
# Output:
gpg-zip
zip
For comparison, the same command without the -w
option also includes lines where the pattern appears as a substring within a word.
ls /bin | grep zip
# Output:
bunzip2
bzip2
bzip2recover
funzip
As previously mentioned, there are multiple dialects of regular expressions. The POSIX standard defines two main types of implementations: Basic Regular Expressions (BRE), which are supported by almost all POSIX-compliant programs, and Extended Regular Expressions (ERE), which allow for more complex patterns but aren't supported by all utilities. We'll start by exploring the features of BRE.
We've already encountered simple regular expressions. For example, the expression “zip” represents a string with the following criteria: it must contain at least three characters; it includes the characters “z”, “i”, and “p” in that exact order; and there are no other characters in between. Characters that match themselves (like “z”, “i”, and “p”) are called literals. Another category is metacharacters, which are used to define various search criteria. Metacharacters in BRE include:
^ $ . [ ] * \ -
To use a metacharacter as a literal, you need to escape it with a backslash (\
). Note that some metacharacters have special meanings in the shell, so enclose it in quotes when passing a regular expression as a command argument.
The dot (.
) metacharacter matches any character in that position. For example:
ls /bin | grep '.zip'
Output:
bunzip2
bzip2
bzip2recover
funzip
gpg-zip
gunzip
gzip
mzip
p7zip
pbzip2
preunzip
prezip
prezip-bin
streamzip
unzip
unzipsfx
One important detail: the zip program itself isn’t included in the output because the dot (.
) metacharacter increases the required match length to four characters.
The caret (^
) and dollar sign ($
) in regular expressions serve as anchors. This means that, when included, a match can only occur at the start of a line (^
) or at the end ($
).
ls /bin | grep '^zip'
# Output:
zip
zipcloak
zipdetails
zipgrep
…
ls /bin | grep 'zip$'
# Output:
funzip
gpg-zip
gunzip
...
ls /bin | grep '^zip$'
# Output:
zip
The regular expression ^$
matches empty lines.
Besides matching any character in a given position (.
), regular expressions allow for matching a character from a specific set. This is done with square brackets. The following example searches for strings matching bzip
or gzip
:
ls /bin | grep '[bg]zip'
# Output:
bzip2
bzip2recover
gzip
All metacharacters lose their special meaning within square brackets, except two.
If a caret (^
) is placed immediately after the opening bracket, the characters in the set are treated as excluded from that position. For example:
ls /bin | grep '[^bg]zip'
Output:
bunzip2
funzip
gpg-zip
gunzip
mzip
p7zip
preunzip
prezip
prezip-bin
streamzip
unzip
unzipsfx
With negation, we get a list of filenames containing zip
but preceded by any character other than b
or g
. Note that zip is not included here; the negation requires the presence of some character in that position. The caret serves as a negation only if it appears immediately after the opening bracket; otherwise, it loses its special meaning.
Using a hyphen (-
), you can specify character ranges. This lets you match a range of characters or even multiple ranges. For instance, to find all filenames that start with a letter or a number:
ls ~ | grep '^[A-Za-z0-9]'
Output:
backup
bin
Books
Desktop
docker
Documents
Downloads
GNS3
...
When using character ranges, one challenge is that ranges can be interpreted differently based on locale settings. For instance, the range [A-Z]
may sometimes be interpreted lexicographically, potentially excluding lowercase a. To address this, the POSIX standard provides several classes that represent various character sets. Some of these classes include:
[:alnum:]
— Alphanumeric characters; equivalent to [A-Za-z0-9]
in ASCII.
[:alpha:]
— Alphabetic characters; equivalent to [A-Za-z]
in ASCII.
[:digit:]
— Digits from 0 to 9.
[:lower:]
and [:upper:]
— Lowercase and uppercase letters, respectively.
[:space:]
— Whitespace characters, including space, tab, carriage return, newline, vertical tab, and form feed.
Character classes don’t provide an easy way to express partial ranges, like [A-M]
. Here’s an example of using a character class:
ls ~ | grep '[[:upper:]].*'
Output:
Books
Desktop
Documents
Downloads
GNS3
GOG Games
Learning
Music
...
Most POSIX-compliant applications and those using BRE (such as grep
and the stream editor sed
) support the features discussed above. The POSIX ERE standard allows for more expressive regular expressions, though not all programs support it. The egrep
program traditionally supported the ERE dialect, but the GNU version of grep
also supports ERE when run with the -E
option.
In ERE, the set of metacharacters is expanded to include:
( ) { } ? + |
Alternation allows for a match with one of multiple expressions. Similar to square brackets that allow a character to match one of several characters, alternation allows for matching one of multiple strings or regular expressions. Alternation is represented by the pipe (|
):
echo "AAA" | grep -E 'AAA|BBB'
# Output:
AAA
echo "BBB" | grep -E 'AAA|BBB'
# Output:
BBB
echo "CCC" | grep -E 'AAA|BBB'
# Output: (no match)
You can group elements of regular expressions and treat them as a single unit using parentheses. The following expression matches filenames starting with bz
, gz
, or zip
. Without the parentheses, the regular expression would change meaning to match filenames starting with bz
or containing gz
or zip
.
ls /bin | grep -E '^(bz|gz|zip)'
Output:
bzcat
bzgrep
bzip2
bzip2recover
bzless
bzmore
gzexe
gzip
zip
zipdetails
zipgrep
zipinfo
zipsplit
Quantifiers specify the number of times an element can occur. BRE supports several quantifiers:
?
— Matches the preceding element zero or one time, meaning the previous element is optional:
echo "tet" | grep -E 'tes?t'
# Output:
tet
echo "test" | grep -E 'tes?t'
# Output:
test
echo "tesst" | grep -E 'tes?t'
# Output: (no match)
*
— Matches the preceding element zero or more times. Unlike ?
, this element can appear any number of times:
echo "tet" | grep -E 'tes*t'
# Output:
tet
echo "test" | grep -E 'tes*t'
# Output:
test
echo "tesst" | grep -E 'tes*t'
# Output:
tesst
+
— Similar to *
, but requires at least one match of the preceding element:
echo "tet" | grep -E 'tes+t'
# Output: (no match)
echo "test" | grep -E 'tes+t'
# Output:
test
echo "tesst" | grep -E 'tes+t'
# Output:
tesst
In BRE, special metacharacters {
and }
allow you to specify minimum and maximum match counts for the preceding element in four possible ways:
{n}
— Matches if the preceding element occurs exactly n
times.
{n,m}
— Matches if the preceding element occurs at least n
and at most m
times.
{n,}
— Matches if the preceding element occurs n
or more times.
{,m}
— Matches if the preceding element occurs no more than m
times.
Example:
echo "tet" | grep -E "tes{1,3}t"
# Output: (no match)
echo "test" | grep -E "tes{1,3}t"
# Output:
test
echo "tesst" | grep -E "tes{1,3}t"
# Output:
tesst
echo "tessst" | grep -E "tes{1,3}t"
# Output:
tessst
echo "tesssst" | grep -E "tes{1,3}t"
# Output: (no match)
Only the lines where s
appears one, two, or three times match the pattern.
To conclude, let’s look at a couple of practical examples of how regular expressions can be applied.
Suppose we have a list of phone numbers where the correct format is (nnn) nnn-nnnn
. Out of a list of 10 numbers, three are incorrectly formatted.
cat phonenumbers.txt
Output:
(185) 136-1035
(95) 213-1874
(37) 207-2639
(285) 227-1602
(275) 298-1043
(107) 204-2197
(799) 240-1839
(218) 750-7390
(114) 776-2276
(7012) 219-3089
The task is to identify the incorrect numbers. We can use the following command:
grep -Ev '^\([0-9]{3}\) [0-9]{3}-[0-9]{4}$' phonenumbers.txt
Output:
(95) 213-1874
(37) 207-2639
(7012) 219-3089
Here, we used the -v
option to invert the match and output only lines that do not match the specified format. Since parentheses are considered metacharacters in ERE, we escaped them with backslashes to treat them as literals.
The find
command supports checking paths with regular expressions. It’s important to note that, unlike grep
which matches parts of lines, find
requires the whole path to match. Suppose we want to identify files and directories containing spaces or potentially problematic characters.
find . -regex '.*[^-_./0-9a-zA-Z].*'
The .*
sequences at the beginning and end represent any number of any characters, which is necessary because find
expects the entire path to match. Inside the square brackets, we use negation to exclude valid filename characters, meaning any file or directory name containing characters other than hyphens, underscores, digits, or Latin letters will appear in the output.
This article has covered a few practical examples of Bash regular expressions. Creating complex regular expressions might seem challenging at first. But over time, you’ll gain experience and skill in using regular expressions for searches across various applications that support them.