# Regular expression¶

• A regular expression (regex) is a pattern describing a certain amount of text.
• A literal text like owl is a simply regular expressoin pattern.
In [24]:
import re
# the pattern here is Mohammed
print( re.findall("Mohammed", "My name is Mohammed"))

['Mohammed']


### Character Sets (or Classes)¶

• The character class [ ] allows you to choose a single character out of the included set of characters.
• A way to understand [ ] is to see it as a dash _____ that must be filled with a single character from the characters inside the[ ].
In [33]:
# the pattern here is Mohammed
print( re.findall("gr[ea]y", "In American English, it is grey, but in British English it is gray."))

['grey', 'gray']

• The character class [ ] allow you to choose ONLY one character.
In [36]:
print(re.findall("gr[aa]y", "graay"))
print(re.findall("gr[aae]y", "graey"))

[]
[]


range

• A hyphen inside a character class allows you to specify a range.
• [0-9] matches a single digit from 0 to 9
• [a-f] is similar to [abcdef]
In [51]:
print(re.findall("0x[0-9a-f]", "0x1f"))
print(re.findall("0x[0-9][a-f]", "0x1f"))

['0x1']
['0x1f']


negation

• The caret symbol at the beginning of the character set means all characters are match except the following characters.
• [^a] means all characters match except a
In [53]:
print(re.findall("gr[^a]y", "grey"))

['grey']


### Shorthand Character Classes¶

• \d matches a single digit [0-9]
• \w matches a single string (letter, number, underscore) [a-zA-Z0-9_]
• \s matches a single whitespace
• . matches all characters except line break
In [79]:
print(re.findall("class_\d", "class_1"))
print(re.findall("class\s\d", "class 10"))
print(re.findall("\w\w\w\w\w\s\d\d", "class 10"))
print(re.findall(".......", "1%0=NaN"))

['class_1']
['class 1']
['class 10']
['1%0=NaN']


### Anchors¶

Anchors match positions and not characters.

• ^ matches at the beginning of a string
• $ matches at the end of a string • \b matches a word boundary TODO In [121]: print(re.findall("^Ali", "My name is Ali")) print(re.findall("Ali$", "My name is Ali"))
print(re.findall("^Ali", "Ali is my name"))

[]
['Ali']
['Ali']


### Alternation¶

• Alternation | is similar to or. It allows choosing between many patters
In [102]:
pattern = 'Mohammed|Ali|boxer'
string = "Mohammed is my name. Mohammed Ali is my full name, and I'm a boxer"
print( re.search(pattern, string).group() )

Mohammed


### group¶

• To capture grey is nice color or gray is a nice color you need to group the alternations using ( ). For example, (pattern1| pattern1|pattern2|...).
In [103]:
pattern = '(grey|gray) is a nice color'
string = 'Gray is a nice color'
print(re.search(pattern, string, re.I).group()) # re.I ignores letter case

Gray is a nice color


### Repetition¶

• + matches the precesing token 1 or more times
• ? makes the preceding token in the regular expression optional (0 or 1)
• * matches the precesing token 0 or more times
• {n,m} = n (m) is the minimum (maximum) number of repetitions.
In [163]:
print(re.findall("\w+\s\d", "class 10"))
print(re.findall("colou?r", "color or colour"))
print(re.findall("\d*\stimes", "2 times 33 times 9999 times"))
print(re.findall("\w{5}\s?\d+", "class 10 class_10 class_123"))

['class 1']
['color', 'colour']
['2 times', '33 times', '9999 times']
['class 10', 'lass_10', 'lass_123']
data
dataset

• we can apply the quantifiers to a group
In [164]:
print(re.search("data(set)?", "data" ).group())
print(re.search("data(set)?", "dataset" ).group())

data
dataset


### Greedy and Lazy Repetition¶

• The repetition quantifiers expand the match as far as they can (greedy). In the following example the pattern matches <em> and </em> and <em>website</em>. The default behavior is to capture the longest substring
In [143]:
print(re.search("<.+>", "This is my first <em>website</em>").group())

<em>website</em>

• To make it lazy (stops at the first match) add ? after the repetition operator, in this case after the +.
In [144]:
print(re.search("<.+?>", "This is my first <em>website</em>").group())

<em>