Regular expression

  • A regular expression (regex) is a pattern describing a certain amount of text.
  • A literal text like owl is a simply regular expressoin pattern.
In [24]:
import re
# the pattern here is `Mohammed`
print( re.findall("Mohammed", "My name is Mohammed"))
['Mohammed']

Character Sets (or Classes)

  • The character class [ ] allows you to choose a single character out of the included set of characters.
    • A way to understand [ ] is to see it as a dash _____ that must be filled with a single character from the characters inside the[ ].
In [33]:
# the pattern here is `Mohammed`
print( re.findall("gr[ea]y", "In American English, it is grey, but in British English it is gray."))
['grey', 'gray']
  • The character class [ ] allow you to choose ONLY one character.
In [36]:
print(re.findall("gr[aa]y", "graay"))
print(re.findall("gr[aae]y", "graey"))
[]
[]

range

  • A hyphen inside a character class allows you to specify a range.
    • [0-9] matches a single digit from 0 to 9
    • [a-f] is similar to [abcdef]
In [51]:
print(re.findall("0x[0-9a-f]", "0x1f"))
print(re.findall("0x[0-9][a-f]", "0x1f"))
['0x1']
['0x1f']

negation

  • The caret symbol at the beginning of the character set means all characters are match except the following characters.
  • [^a] means all characters match except a
In [53]:
print(re.findall("gr[^a]y", "grey"))
['grey']

Shorthand Character Classes

  • \d matches a single digit [0-9]
  • \w matches a single string (letter, number, underscore) [a-zA-Z0-9_]
  • \s matches a single whitespace
  • . matches all characters except line break
In [79]:
print(re.findall("class_\d", "class_1"))
print(re.findall("class\s\d", "class 10"))
print(re.findall("\w\w\w\w\w\s\d\d", "class 10"))
print(re.findall(".......", "1%0=NaN"))
['class_1']
['class 1']
['class 10']
['1%0=NaN']

Anchors

Anchors match positions and not characters.

  • ^ matches at the beginning of a string
  • $ matches at the end of a string
  • \b matches a word boundary TODO
In [121]:
print(re.findall("^Ali", "My name is Ali"))
print(re.findall("Ali$", "My name is Ali"))
print(re.findall("^Ali", "Ali is my name"))
[]
['Ali']
['Ali']

Alternation

  • Alternation | is similar to or. It allows choosing between many patters
In [102]:
pattern = 'Mohammed|Ali|boxer'
string = "Mohammed is my name. Mohammed Ali is my full name, and I'm a boxer"
print( re.search(pattern, string).group() )
Mohammed

group

  • To capture grey is nice color or gray is a nice color you need to group the alternations using ( ). For example, (pattern1| pattern1|pattern2|...).
In [103]:
pattern = '(grey|gray) is a nice color'
string = 'Gray is a nice color'
print(re.search(pattern, string, re.I).group()) # re.I ignores letter case
Gray is a nice color

Repetition

  • + matches the precesing token 1 or more times
  • ? makes the preceding token in the regular expression optional (0 or 1)
  • * matches the precesing token 0 or more times
  • {n,m} = n (m) is the minimum (maximum) number of repetitions.
In [163]:
print(re.findall("\w+\s\d", "class 10"))
print(re.findall("colou?r", "color or colour"))
print(re.findall("\d*\stimes", "2 times 33 times 9999 times"))
print(re.findall("\w{5}\s?\d+", "class 10 class_10 class_123"))
['class 1']
['color', 'colour']
['2 times', '33 times', '9999 times']
['class 10', 'lass_10', 'lass_123']
data
dataset
  • we can apply the quantifiers to a group
In [164]:
print(re.search("data(set)?", "data" ).group())
print(re.search("data(set)?", "dataset" ).group())
data
dataset

Greedy and Lazy Repetition

  • The repetition quantifiers expand the match as far as they can (greedy). In the following example the pattern matches <em> and </em> and <em>website</em>. The default behavior is to capture the longest substring
In [143]:
print(re.search("<.+>", "This is my first <em>website</em>").group())
<em>website</em>
  • To make it lazy (stops at the first match) add ? after the repetition operator, in this case after the +.
In [144]:
print(re.search("<.+?>", "This is my first <em>website</em>").group())
<em>