L548: Session 7

1) Regular Expressions

A regular expression is a pattern that a string is searched for. Unix commands such as "rm *.*" are similar to regular expressions, but the syntax of regular expressions is more elaborated. Several Unix programs (grep, sed, awk, ed, vi, emacs) use regular expressions and many modern programming languages (such as Java) also support them.

In Python, a regular expression is first compiled:

keyword = re.compile(r"the ")

(A string "the " is compiled and stored in the variable keyword.) Then the search for the regular expression is executed:

keyword.search(line)

To search for all lines that do not contain the regular expression, use:

not keyword.search(line)

To search for a string variable, use:

keyword = re.compile(variable)

For a case insensitive search use:

keyword = re.compile(r"the ",re.I)

Example:

#!/usr/bin/env python
import re

# open a file
file = open("alice.txt","r")
text = file.readlines()
file.close()

# compiling the regular expression:
keyword = re.compile(r"the ")

# searching the file content line by line:
for line in text:
if keyword.search (line):
print line,

To print both the string that was searched for and the line in which it was found, the following script can be used: (But note that this version only works when searching for lines that do contain a regular expression. It does not work, when searching for lines that do not contain a regular expression.)

#!/usr/bin/env python
import re

# open a file
file = open("alice.txt","r")
text = file.readlines()
file.close()

# searching the file content line by line:
keyword = re.compile(r"the ")

for line in text:
result = keyword.search (line)
if result:
print result.group(), ":", line,

Exercises

List of special characters

1.1 Retrieve all lines from alice.txt that do not contain "the ". Retrieve all lines that contain "the" with lower or upper case letters (hint: use the ignore case option).

2.1 Retrieve lines that have two consecutive o's.
2.2 Retrieve lines that contain a three letter string consisting of "s", then any character, then "e", such as "she".
2.3 Retrieve lines with a three letter word that starts with s and ends with e.
2.4 Retrieve lines that contain a word of any length that starts with s and ends with e. Modify this so that the word has at least four characters.
2.5 Retrieve lines that start with a. Retrieve lines that start with a and end with n.
2.6 Retrieve blank lines. Think of at least two ways of doing this.
2.7 Retrieve lines that do not contain the blank space character.
2.8 Retrieve lines that contain more than one blank space character.

3 Add a few lines with numbers etc. to the end of the alice.txt file so that you can search for the following regular expressions:

3.1 an odd digit followed by an even digit (eg. 12 or 74)
3.2 a letter followed by a non-letter followed by a number
3.3 a word that starts with an upper case letter
3.4 the word "yes" in any combination of upper and lower cases letters
3.5 one or more times the word "the"
3.6 a date in the form of one or two digits, a dot, one or two digits, a dot, two digits
3.7 a punctuation mark

4.1 Write a script that asks users for their name, address and phone number. Test each input for accuracy, for example, there should be no letters in a phone number. A phone number should have a certain length. An address should have a certain format, etc. Ask the user to repeat the input in case your script identfies it as incorrect.

4.2 Concerning your projects: what kind of checking is needed to ensure that users fill in the forms in a sensible manner? Make certain that your form can handle all kinds of input. For example, users can have several first names, middle initials, several last names (which may or may not be hyphenated).

Special Characters

.	Any single character except a newline
^	The beginning of the line or string
$	The end of the line or string
*	Zero or more of the last character
+	One or more of the last character
?	Zero or one of the last character
{5,10}	Five to ten times the previous character
	for example: * equals {0, }; + equals {1, }
	? equals {0,1}

More special characters

[qjk] Either q or j or k
[^qjk] Neither q nor j nor k
[a-z] Anything from a to z inclusive
[^a-z] No lower case letters
[a-zA-Z] Any letter
[a-z]+ Any non-zero sequence of lower case letters
jelly|cream Either jelly or cream
(eg|le)gs Either eggs or legs
(da)+ Either da or dada or dadada or...
\n A newline
\t A tab
\w Any alphanumeric (word) character.
The same as [a-zA-Z0-9_]
\W Any non-word character.
The same as [^a-zA-Z0-9_]
\d Any digit. The same as [0-9]
\D Any non-digit. The same as [^0-9]
\s Any whitespace character: space,
tab, newline, etc
\S Any non-whitespace character
\b A word boundary, outside [] only
\B No word boundary

Escapes for special characters

\| Vertical bar
\[ An open square bracket
\) A closing parenthesis
\* An asterisk
\^ A carat symbol
\/ A slash
\\ A backslash

[qjk]	Either q or j or k
[^qjk]	Neither q nor j nor k
[a-z]	Anything from a to z inclusive
[^a-z]	No lower case letters
[a-zA-Z]	Any letter
[a-z]+	Any non-zero sequence of lower case letters
jelly\|cream	Either jelly or cream
(eg\|le)gs	Either eggs or legs
(da)+	Either da or dada or dadada or...
\n	A newline
\t	A tab
\w	Any alphanumeric (word) character.
	The same as [a-zA-Z0-9_]
\W	Any non-word character.
	The same as [^a-zA-Z0-9_]
\d	Any digit. The same as [0-9]
\D	Any non-digit. The same as [^0-9]
\s	Any whitespace character: space,
	tab, newline, etc
\S	Any non-whitespace character
\b	A word boundary, outside [] only
\B	No word boundary

\\|	Vertical bar
\[	An open square bracket
\)	A closing parenthesis
\*	An asterisk
\^	A carat symbol
\/	A slash
\\	A backslash