Chapter 3. Regular expressions

20. Regular Expression Fundamentals‌

Regular expressions (or regex) are used in many *nix commands, including sed.

Beginning of line ( ^ )

The Caret Symbol ^ matches at the start of a line.

Display lines which start with 103:

$ sed -n '/^103/ p' employee.txt
103,Raj Reddy,Sysadmin

Note that ^ matches the expression at the beginning of a line, only if it is the first character in a regular expression. In this example, ^N matches all the lines that begins with N.

End of line ( $)

The dollar symbol $ matches the end of a line.

Display lines which end with the letter r:

$ sed -n '/r$/ p' employee.txt
102,Jason Smith,IT Manager
104,Anand Ram,Developer
105,Jane Miller,Sales Manager

Single Character (.)

The special meta-character “.” (dot) matches any character except the end of the line character.

  • . matches single character
  • .. matches two characters
  • ... matches three characters
  • etc.

In the following example, the pattern "J followed by three characters and a space" will be replaced with "Jason followed by a space". So, "J... " matches both "John " and "Jane " from employee.txt, and these two lines are replaced accordingly as shown below.

$ sed -n 's/J... /Jason /p' employee.txt
101,Jason Doe,CEO
105,Jason Miller,Sales Manager

Zero or more Occurrences (*)

The special character “*” (star) matches zero or more occurrences of the previous character. For example, the pattern ’1*’ matches zero or more ’1'.

For this example create the following log.txt file:

$ vi log.txt
log: Input Validated
log:
log: testing resumed
log:
log:output created

Suppose you would like to view only the lines that contain "log:" followed by a message. The message might immediately follow the log: or might have some spaces. You don't want to view the lines that contain "log:" without anything.

Display all the lines that contain "log:" followed by one or more spaces followed by a character:

$ sed -n '/log: *./ p' log.txt
log: Input Validated
log: testing resumed
log:output created

Note: In the above example the dot at the end is necessary. If not included, sed will also print all the lines containing "log:" only.

One or more Occurrence (\+)

The special character “\+” matches one or more occurrence of the previous character. For example, a space before “\+”, i.e ” \+” matches at least one or more space character. Let us use the same log.txt as an example file.

Display all the lines that contain "log:" followed by one or more spaces:

$ sed -n '/log: \+/ p' log.txt
log: Input Validated
log: testing resumed

Note: In addition to not matching the "log:" only lines, the above example also didn't match the line "log:output created", as there is no space after "log:" in this line.

Zero or one Occurrence (\?)

The special character “?” matches zero or one occurrences of the previous character as shown below.

$ sed -n '/log: \?/ p' log.txt
log: Input Validated
log:
log: testing resumed
log:
log:output created

Escaping the Special Character (\)

If you want to search for special characters (for example: * , dot) in the content you have to escape the special character in the regular expression.

$ sed -n '/127\.0\.0\.1/ p' /etc/hosts
127.0.0.1 localhost.localdomain localhost

Character Class ([0-9])

The character class is nothing but a list of characters mentioned within a square bracket; this is used to match only one out of several characters.

Match any line that contains 2 or 3 or 4:

$ sed -n '/[234]/ p' employee.txt
102,Jason Smith,IT Manager
103,Raj Reddy,Sysadmin
104,Anand Ram,Developer

Within the square bracket, you can use a hyphen you can specify a range of characters. For example, [0123456789] can be represented by [0-9], and alphabetic ranges can be specified such as [a-z],[A-Z] etc.

Match any line that contains 2 or 3 or 4 (alternate form):

$ sed -n '/[2-4]/ p' employee.txt
102,Jason Smith,IT Manager
103,Raj Reddy,Sysadmin
104,Anand Ram,Developer

21. Additional Regular Expressions

OR Operation (|)

The pipe character (|) is used to specify that either of two whole subexpressions could occur in a position. “subexpression1| subexpression2” matches either subexpression1 or subexpression2.

Print lines containing either 101 or 102:

$ sed -n '/101\|102/ p' employee.txt
101,John Doe,CEO
102,Jason Smith,IT Manager

Please note that the | symbol is escaped with a /.

Print lines that contain a character from 2 to 3 or that contain the string 105:

$ sed -n '/[2-3]\|105/ p' employee.txt
102,Jason Smith,IT Manager
103,Raj Reddy,Sysadmin
105,Jane Miller,Sales Manager

Exactly M Occurrences ({m})

A Regular expression followed by {m} matches exactly m occurrences of the preceding expression.

For this example create the following numbers.txt file.

$ vi numbers.txt
1
12
123
1234
12345
123456

Print lines that contain any digit (will print all lines):

$ sed -n '/[0-9]/ p' numbers.txt
1
12
123
1234
12345
123456

Print lines consisting of exactly 5 digits:

$ sed -n '/^[0-9]\{5\}$/ p' numbers.txt
12345

M to N Occurrences ({m,n})

A regular expression followed by {m,n} indicates that the preceding item must match at least m times, but not more than n times. The values of m and n must be non-negative and smaller than 255.

Print lines consisting of at least 3 but not more than 5 digits:

$ sed -n '/^[0-9]\{3,5\}$/ p' numbers.txt
123
1234
12345

A Regular expression followed by {m,} is a special case that matches m or more occurrences of the preceding expression.

Word Boundary (\b)

\b is used to match a word boundary. \b matches any character(s) at the beginning (\bxx) and/or end (xx\b) of a word, thus \bthe\b will find the but not they. \bthe will find the or they.

Create the following sample file for testing.

$ cat words.txt
word matching using: the
word matching using: thethe
word matching using: they

Match lines containing the whole word "the":

$ sed -n '/\bthe\b/ p' words.txt
word matching using: the

Please note that if you don't specify the \b at the end, it will match all lines.

Match lines containing words that start with “the”:

$ sed -n '/\bthe/ p' words.txt
word matching using: the
word matching using: thethe
word matching using: they

Back References (\n)

Back references let you group expressions for further use.

Match only the line that has the word "the" repeated twice:

$ sed -n '/\(the\)\1/ p' words.txt

Using the same logic, the regular expression "\([0-9]\)\1" matches two digit number in which both the digits are same number—like 11,22,33
...

22. Sed Substitution Using Regular Expression

The following are few sed substitution examples that uses regular expressions.

Replace the last two characters in every line of employee.txt with ",Not Defined":

$ sed 's/..$/,Not Defined/' employee.txt
101,John Doe,C,Not Defined
102,Jason Smith,IT Manag,Not Defined
103,Raj Reddy,Sysadm,Not Defined
104,Anand Ram,Develop,Not Defined
105,Jane Miller,Sales Manag,Not Defined

Delete the rest of the line starting from “Manager”:

$ sed 's/Manager.*//' employee.txt
101,John Doe,CEO
102,Jason Smith,IT
103,Raj Reddy,Sysadmin
104,Anand Ram,Developer
105,Jane Miller,Sales

Delete all lines that start with "#" :

sed -e 's/#.*// ; /^$/ d' employee.txt

Create the following test.html for the next example:

$ vi test.html
<html><body><h1>Hello World!</h1></body></html>

Strip all html tags from test.html:

$ sed -e 's/<[^>]*>//g' test.html
Hello World!

Remove all comments and blank lines:

sed -e 's/#.*//' -e '/^$/ d' /etc/profile

Remove only the comments. Leave the blank lines:

sed -e '/^#.*/ d' /etc/profile

You can convert DOS newlines (CR/LF) to Unix format Using sed. When you copy the DOS file to Unix, you could find \r\n in the end of each line.

Convert the DOS file format to Unix file format using sed:

sed 's/.$//' filename