Regular expression

  • A regular expression (regex) is a pattern describing a certain amount of text.
  • A literal text like owl is a simply regular expressoin pattern.
In [24]:
import re
# the pattern here is `Mohammed`
print( re.findall("Mohammed", "My name is Mohammed"))

Character Sets (or Classes)

  • The character class [ ] allows you to choose a single character out of the included set of characters.
    • A way to understand [ ] is to see it as a dash _____ that must be filled with a single character from the characters inside the[ ].
In [33]:
# the pattern here is `Mohammed`
print( re.findall("gr[ea]y", "In American English, it is grey, but in British English it is gray."))
['grey', 'gray']
  • The character class [ ] allow you to choose ONLY one character.
In [36]:
print(re.findall("gr[aa]y", "graay"))
print(re.findall("gr[aae]y", "graey"))


  • A hyphen inside a character class allows you to specify a range.
    • [0-9] matches a single digit from 0 to 9
    • [a-f] is similar to [abcdef]
In [51]:
print(re.findall("0x[0-9a-f]", "0x1f"))
print(re.findall("0x[0-9][a-f]", "0x1f"))


  • The caret symbol at the beginning of the character set means all characters are match except the following characters.
  • [^a] means all characters match except a
In [53]:
print(re.findall("gr[^a]y", "grey"))

Shorthand Character Classes

  • \d matches a single digit [0-9]
  • \w matches a single string (letter, number, underscore) [a-zA-Z0-9_]
  • \s matches a single whitespace
  • . matches all characters except line break
In [79]:
print(re.findall("class_\d", "class_1"))
print(re.findall("class\s\d", "class 10"))
print(re.findall("\w\w\w\w\w\s\d\d", "class 10"))
print(re.findall(".......", "1%0=NaN"))
['class 1']
['class 10']


Anchors match positions and not characters.

  • ^ matches at the beginning of a string
  • $ matches at the end of a string
  • \b matches a word boundary TODO
In [121]:
print(re.findall("^Ali", "My name is Ali"))
print(re.findall("Ali$", "My name is Ali"))
print(re.findall("^Ali", "Ali is my name"))


  • Alternation | is similar to or. It allows choosing between many patters
In [102]:
pattern = 'Mohammed|Ali|boxer'
string = "Mohammed is my name. Mohammed Ali is my full name, and I'm a boxer"
print(, string).group() )


  • To capture grey is nice color or gray is a nice color you need to group the alternations using ( ). For example, (pattern1| pattern1|pattern2|...).
In [103]:
pattern = '(grey|gray) is a nice color'
string = 'Gray is a nice color'
print(, string, re.I).group()) # re.I ignores letter case
Gray is a nice color


  • + matches the precesing token 1 or more times
  • ? makes the preceding token in the regular expression optional (0 or 1)
  • * matches the precesing token 0 or more times
  • {n,m} = n (m) is the minimum (maximum) number of repetitions.
In [163]:
print(re.findall("\w+\s\d", "class 10"))
print(re.findall("colou?r", "color or colour"))
print(re.findall("\d*\stimes", "2 times 33 times 9999 times"))
print(re.findall("\w{5}\s?\d+", "class 10 class_10 class_123"))
['class 1']
['color', 'colour']
['2 times', '33 times', '9999 times']
['class 10', 'lass_10', 'lass_123']
  • we can apply the quantifiers to a group
In [164]:
print("data(set)?", "data" ).group())
print("data(set)?", "dataset" ).group())

Greedy and Lazy Repetition

  • The repetition quantifiers expand the match as far as they can (greedy). In the following example the pattern matches <em> and </em> and <em>website</em>. The default behavior is to capture the longest substring
In [143]:
print("<.+>", "This is my first <em>website</em>").group())
  • To make it lazy (stops at the first match) add ? after the repetition operator, in this case after the +.
In [144]:
print("<.+?>", "This is my first <em>website</em>").group())