Showing posts with label regex. Show all posts

Wednesday, 1 May 2013

Learning python regular expressions

https://developers.google.com/edu/python/regular-expressions

Python Regular Expressions

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python "re" module provides regular expression support.

In Python a regular expression search is typically written as:

  match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
  if match:                      
    print 'found', match.group() ## 'found word:cat'
  else:
    print 'did not find'

The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)
. (a period) -- matches any single character except newline '\n'
\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.
\b -- boundary between word and non-word
\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- decimal digit [0-9] (some older regex utilities do not support but \d, but they all support \w and \s)
^ = start, $ = end -- match the start or end of the string
\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

Basic Examples

Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

The search proceeds through the string from start to end, stopping at the first match found
All of the pattern must be matched, but not all of the string
If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
  match = re.search(r'iii', 'piiig') =>  found, match.group() == "iii"
  match = re.search(r'igs', 'piiig') =>  not found, match == None

  ## . = any char but \n
  match = re.search(r'..g', 'piiig') =>  found, match.group() == "iig"

  ## \d = digit char, \w = word char
  match = re.search(r'\d\d\d', 'p123g') =>  found, match.group() == "123"
  match = re.search(r'\w\w\w', '@@abcd!!') =>  found, match.group() == "abc"

Thursday, 10 May 2012

Awk - with loops and regex

Awk built-in variables:

FNR         The input record number in the current input file.
NF          The number of fields in the current input record.
NR          The total number of input records seen so far.

// an awk command to match the regex "UST" in the line
cat matrace_20120430234915.dat | awk -F'[,]' '{print FNR; i=1;for(i=1;i<(NF);i++) { print i,$i, /UST/;} }'

// an awk command which finds the column matching the regular expression of date in format "yyyy-mm-dd" and splits the date and adds 1 to the date and then prints the same.
awk -F'[,]' '{print FNR; i=1;for(i=1;i<(NF);i++) { print i,$i; if(match($i,/(....)\-(..)\-(..)/)) { split($i,a,"-"); printf("%d-%d-%d",a[1],a[2],(a[3]+1));};} }'

// awk using sprintf
awk -F'[,]' '{x=sprintf("%s",$1); i=1; for(i=2;i<(NF);i++) { if(match($i,/(....)\-(..)\-(..)/)) { split($i,a,"-"); x=sprintf("%s,%s-%s-%d",x,a[1],a[2],(a[3]+1));} else { x=sprintf("%s,%s",x,$i); } } x=sprintf("%s,%s",x,$(NF)); print x;}'

Regex - regular expressions

http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=regularExpressions

A regular expression is a special string that describes a search pattern.

A regular expression(regex) is one or more non-empty branches, separated by '|'.
'|' - It matches one of the branches

A branch is one or more atoms, concatenated.

An atom is possibly followed by a '*', '+', '?', or bound.
* - 0 or more atom
+ - 1 or more atom
? - 0 or 1 atom
a{1,3} - atom a occurs between 1 to 3 times

An atom is a regular expression enclosed in '()' (matching a match for the regular expression), a bracket expression (see below),
'.' (matching any single character)
'^' (matching the null string at the beginning of a line)
'$' (matching the null string at the end of a line)
a `\' followed by one of the characters `^.[$()|*+?{\' (matching that character taken as an ordinary character) or a single character with no other significance (matching that character).
There is one more type of atom, the back reference: `\' followed by a non-zero decimal digit d matches the same sequence of characters matched by the d-th parenthesized subexpression (numbering subexpressions by the positions of their opening parentheses, left to right), so that (e.g.) `([bc])\1' matches `bb' or `cc' but not `bc'.

[0-9] or [a-z] for specify the character range
A bracket expression is a list of characters enclosed in '[]'. It normally matches any single character from the list.
If the list begins with '^', it matches any single character not from the rest of the list.
If two characters in the list are separated by `-', this is shorthand for the full range of characters between those two inclusive (e.g. '[0-9]' matches any decimal digit).
With the exception of ']','^','-' all other special characters, including `\', lose their special significance within a bracket expression.