Python Regular Expression – Greedy vs Non Greedy quantifiers

Spread the love

So far we talked about various quantifier in regular expression like Asterisk , Plus , Question mark and curly braces . In this post we will go one step further and try to understand the difference between greedy and non greedy quantifiers.

Greedy Match –

A greedy match in regular expression tries to match as many characters as possible.

For example [0-9]+ will try to match as many digits as possible. It gets never enough of it. It’s too greedy.

In [2]: re.findall('[0-9]+', '12345678910')
Out[2]: ['12345678910']

By default all quantifiers are greedy. They will try to match as many characters as possible.

In [3]: # zero or more occurrences

In [4]: re.findall('[0-9]*', '12345678910')
Out[4]: ['12345678910', '']

In [5]: # one or more occurrences

In [6]: re.findall('[0-9]+', '12345678910')
Out[6]: ['12345678910']

In [7]: # zero or one occurrences

In [8]: re.findall('[0-9]?', '12345678910')
Out[8]: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '1', '0', '']

In [9]: # exactly 5 occurrences

In [10]: re.findall('[0-9]{5}', '12345678910')
Out[10]: ['12345', '67891']

In [11]: # at least 2 but not greater than 5

In [12]: re.findall('[0-9]{2,5}', '12345678910')
Out[12]: ['12345', '67891']

Non Greedy Match –

A Non-Greedy match tries match as few characters as possible. You can make default quantifiers *, +, ?, {}, {m,n}, non greedy by appending a question marks after them like this – *?, +?, ??, {m}?, {m,n}?

Non Greedy Asterisk (*?) –

In [15]: re.findall('[0-9]*', '12345678910')
Out[15]: ['12345678910', '']

In [16]: re.findall('[0-9]*?', '12345678910')
Out[16]: 
['',
 '1',
 '',
 '2',
 '',
 '3',
 '',
 '4',
 '',
 '5',
 '',
 '6',
 '',
 '7',
 '',
 '8',
 '',
 '9',
 '',
 '1',
 '',
 '0',
 '']

The greedy version of asterisk [0-9]* matches zero or more occurrences of the number. The non greedy version of asterisk [0-9]*? matches zero or one occurrences of the number.

Non Greedy Plus (+? ) –


In [17]: re.findall('[0-9]+', '12345678910')
Out[17]: ['12345678910']

In [18]: re.findall('[0-9]+?', '12345678910')
Out[18]: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '1', '0']

The greedy version of plus [0-9]+ matches one or more occurrences of the number. The non greedy version [0-9]+? matches only once occurrences of the number.

Non Greedy Question mark ( ?? ) –


In [19]: re.findall('[0-9]?', '12345678910')
Out[19]: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '1', '0', '']

In [20]: re.findall('[0-9]??', '12345678910')
Out[20]: 
['',
 '1',
 '',
 '2',
 '',
 '3',
 '',
 '4',
 '',
 '5',
 '',
 '6',
 '',
 '7',
 '',
 '8',
 '',
 '9',
 '',
 '1',
 '',
 '0',
 '']

In [21]: 

The greedy version of question mark [0-9]? matches zero or one occurrences of the number. So, it first consumed 1 then 2 then 3 and so on and at last an empty string. The non greedy version of question marks [0-9]?? consumes an empty string, then a number, then again an empty string followed by a number as so on. It is trying to match as few numbers as possible that is why we are seeing this kind of pattern.

Non Greedy curly braces –

In [25]: re.findall('[0-9]{5}', '12345678910')
Out[25]: ['12345', '67891']

In [26]: re.findall('[0-9]{5}?', '12345678910')
Out[26]: ['12345', '67891']

The greedy and non greedy versions both matches 5 digits as curly braces matches exactly the specified number of occurrences.

In [27]: re.findall('[0-9]{2,5}', '12345678910')
Out[27]: ['12345', '67891']

In [28]: re.findall('[0-9]{2,5}?', '12345678910')
Out[28]: ['12', '34', '56', '78', '91']

Here the greedy version matches 5 digits but the non greedy second version matches only two digits.

Rating: 1 out of 5.

Leave a Reply