Regular expression¶

A regular expression (regex) is a pattern describing a certain amount of text.
A literal text like owl is a simply regular expressoin pattern.

import re
# the pattern here is `Mohammed`
print( re.findall("Mohammed", "My name is Mohammed"))

['Mohammed']

Character Sets (or Classes)¶

The character class [ ] allows you to choose a single character out of the included set of characters.
- A way to understand [ ] is to see it as a dash _____ that must be filled with a single character from the characters inside the[ ].

# the pattern here is `Mohammed`
print( re.findall("gr[ea]y", "In American English, it is grey, but in British English it is gray."))

['grey', 'gray']

The character class [ ] allow you to choose ONLY one character.

print(re.findall("gr[aa]y", "graay"))
print(re.findall("gr[aae]y", "graey"))

[]
[]

range

A hyphen inside a character class allows you to specify a range.
- [0-9] matches a single digit from 0 to 9
- [a-f] is similar to [abcdef]

print(re.findall("0x[0-9a-f]", "0x1f"))
print(re.findall("0x[0-9][a-f]", "0x1f"))

['0x1']
['0x1f']

negation

The caret symbol at the beginning of the character set means all characters are match except the following characters.
[^a] means all characters match except a

print(re.findall("gr[^a]y", "grey"))

['grey']

Shorthand Character Classes¶

\d matches a single digit [0-9]
\w matches a single string (letter, number, underscore) [a-zA-Z0-9_]
\s matches a single whitespace
. matches all characters except line break

print(re.findall("class_\d", "class_1"))
print(re.findall("class\s\d", "class 10"))
print(re.findall("\w\w\w\w\w\s\d\d", "class 10"))
print(re.findall(".......", "1%0=NaN"))

['class_1']
['class 1']
['class 10']
['1%0=NaN']

Anchors¶

Anchors match positions and not characters.

^ matches at the beginning of a string
$ matches at the end of a string
\b matches a word boundary TODO

print(re.findall("^Ali", "My name is Ali"))
print(re.findall("Ali$", "My name is Ali"))
print(re.findall("^Ali", "Ali is my name"))

[]
['Ali']
['Ali']

Alternation¶

Alternation | is similar to or. It allows choosing between many patters

pattern = 'Mohammed|Ali|boxer'
string = "Mohammed is my name. Mohammed Ali is my full name, and I'm a boxer"
print( re.search(pattern, string).group() )

Mohammed

group¶

To capture grey is nice color or gray is a nice color you need to group the alternations using ( ). For example, (pattern1| pattern1|pattern2|...).

pattern = '(grey|gray) is a nice color'
string = 'Gray is a nice color'
print(re.search(pattern, string, re.I).group()) # re.I ignores letter case

Gray is a nice color

Repetition¶

+ matches the precesing token 1 or more times
? makes the preceding token in the regular expression optional (0 or 1)
* matches the precesing token 0 or more times
{n,m} = n (m) is the minimum (maximum) number of repetitions.

print(re.findall("\w+\s\d", "class 10"))
print(re.findall("colou?r", "color or colour"))
print(re.findall("\d*\stimes", "2 times 33 times 9999 times"))
print(re.findall("\w{5}\s?\d+", "class 10 class_10 class_123"))

['class 1']
['color', 'colour']
['2 times', '33 times', '9999 times']
['class 10', 'lass_10', 'lass_123']
data
dataset

we can apply the quantifiers to a group

print(re.search("data(set)?", "data" ).group())
print(re.search("data(set)?", "dataset" ).group())

data
dataset

Greedy and Lazy Repetition¶

The repetition quantifiers expand the match as far as they can (greedy). In the following example the pattern matches <em> and </em> and <em>website</em>. The default behavior is to capture the longest substring

print(re.search("<.+>", "This is my first <em>website</em>").group())

<em>website</em>

To make it lazy (stops at the first match) add ? after the repetition operator, in this case after the +.

print(re.search("<.+?>", "This is my first <em>website</em>").group())

<em>