Regular expression: match start or whitespace

Can a regular expression match whitespace or the start of a string?

I’m trying to replace currency the abbreviation GBP with a £ symbol. I could just match anything starting GBP, but I’d like to be a bit more conservative, and look for certain delimiters around it.

>>> import re
>>> text = u'GBP 5 Off when you spend GBP75.00'

>>> re.sub(ur'GBP([Wd])', ur'£g<1>', text) # matches GBP with any prefix
u'xa3 5 Off when you spend xa375.00'

>>> re.sub(ur'^GBP([Wd])', ur'£g<1>', text) # matches at start only
u'xa3 5 Off when you spend GBP75.00'

>>> re.sub(ur'(W)GBP([Wd])', ur'g<1>£g<2>', text) # matches whitespace prefix only
u'GBP 5 Off when you spend xa375.00'

Can I do both of the latter examples at the same time?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use the OR “|” operator:

>>> re.sub(r'(^|W)GBP([Wd])', u'g<1>£g<2>', text)
u'xa3 5 Off when you spend xa375.00'

Method 2

b is word boundary, which can be a white space, the beginning of a line or a non-alphanumeric symbol (bGBPb).

Method 3

This replaces GBP if it’s preceded by the start of a string or a word boundary (which the start of a string already is), and after GBP comes a numeric value or a word boundary:

re.sub(u'bGBP(?=b|d)', u'£', text)

This removes the need for any unnecessary backreferencing by using a lookahead. Inclusive enough?

Method 4

A left-hand whitespace boundary – a position in the string that is either a string start or right after a whitespace character – can be expressed with

(?<!S)   # A negative lookbehind requiring no non-whitespace char immediately to the left of the current position
(?<=s|^) # A positive lookbehind requiring a whitespace or start of string immediately to the left of the current position
(?:s|^)  # A non-capturing group matching either a whitespace or start of string 
(s|^)    # A capturing group matching either a whitespace or start of string

See a regex demo. Python 3 demo:

import re
rx = r'(?<!S)GBP([Wd])'
text = 'GBP 5 Off when you spend GBP75.00'
print( re.sub(rx, r'£1', text) )
# => £ 5 Off when you spend £75.00

Note you may use 1 instead of g<1> in the replacement pattern since there is no need in an unambiguous backreference when it is not followed with a digit.

BONUS: A right-hand whitespace boundary can be expressed with the following patterns:

(?!S)   # A negative lookahead requiring no non-whitespace char immediately to the right of the current position
(?=s|$) # A positive lookahead requiring a whitespace or end of string immediately to the right of the current position
(?:s|$)  # A non-capturing group matching either a whitespace or end of string 
(s|$)    # A capturing group matching either a whitespace or end of string

Method 5

I think you’re looking for '(^|W)GBP([Wd])'

Method 6

You can always trim leading and trailing whitespace from the token before you search if it’s not a matching/grouping situation that requires the full line.

Method 7

Yes, why not?

re.sub(u'^W*GBP...

matches the start of the string, 0 or more whitespaces, then GBP…

edit: Oh, I think you want alternation, use the |:

re.sub(u'(^|W)GBP...

Method 8

It works in Perl:

$text = 'GBP 5 off when you spend GBP75';
$text =~ s/(W|^)GBP([Wd])/$1$$2/g;
printf "$textn";

The output is:

$ 5 off when you spend $75

Note that I stipulated that the match should be global, to get all occurrences.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating