Python Regular Expression – Special Characters

Spread the love

There are various special Characters or sequences in Regular Expression. Let’s look at them one by one.

1 . \d – Any digit character

The backslash d matches any digits from 0 to 9.

Let’s say we have a phone number in a text document and we want to search for it.

In [1]: import re

In [2]: text = 'My number is 5348482075'

In [3]: re.findall('\d', text)
Out[3]: ['5', '3', '4', '8', '4', '8', '2', '0', '7', '5']

To match all the digits we can use kleene plus + which match 1 or more of the character that it follows.

In [4]: re.findall('\d+', text)
Out[4]: ['5348482075']

2 . \D – Any Non- digit character –

The backslash uppercase D matches any non digit character.

In [5]: re.findall('\D', text)
Out[5]: ['M', 'y', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ' ', 'i', 's', ' ']

In [6]: re.findall('\D+', text)
Out[6]: ['My number is ']

3. \w – Any alphanumeric characters

The back slash lowercase w matches any alpha numeric characters i.e a-z, A-Z, 0-9. It also matches the underscore _.

In [7]: re.findall('\w+', text)
Out[7]: ['My', 'number', 'is', '5348482075']

4. \W – Any Non- alphanumeric characters

The backslash Uppercase W matches any non -alpha numeric characters. It is the negation of \w.

In [8]: re.findall('\W', text)
Out[8]: [' ', ' ', ' ']

5. \s – Any whitespace characters –

The backslash lowercase s matches any whitespace characters i.e. space ( __ ) , newline ( \n ) and tab ( \t ) and carriage return ( \r )

In [9]: re.findall('\s', text)
Out[9]: [' ', ' ', ' ']

In [10]: re.findall('My\snumber', text)
Out[10]: ['My number']

6. \S – Any Non-whitespace characters –

In [11]: re.findall('\S+', text)
Out[11]: ['My', 'number', 'is', '5348482075']

7 . \b – word boundary

The backslash lowercase b matches if a word begins or ends with the given characters. It is used to isolate words.

Let’s say we have a dog and dogecoin in a text and we only want to match with the word dog, not dogecoin.

In [12]: re.findall('dog', 'dog dogecoin')
Out[12]: ['dog', 'dog']

The above pattern will matches the dog and as well as the characters dog in dogecoin. To only match with the word dog we can use the word boundary.

In [15]: re.findall(r'\bdog\b', 'dog dogecoin')
Out[15]: ['dog']

In [16]: re.findall('\\bdog\\b', 'dog dogecoin')
Out[16]: ['dog']

If you look above carefully, you can see that I have used raw string r before ‘\bdog\b’ because in python \b is shorthand for backspace character. So, If I write it without converting it into a raw string, we will get an empty list.

In [17]: re.findall('\bdog\b', 'dogecoin')
Out[17]: []

Because python is looking for backspace followed by dog and then another backspace, which we do not have here.

Raw string helps us treat backslash as normal character. Another way to escape a backslash is to add another backslash as we did above.

Rating: 1 out of 5.

Leave a Reply