Python regex: splitting on pattern match that is an empty string

With the re module, it seems that I am unable to split on pattern matches that are empty strings:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
['foobarbarbazbar']

In other words, even if a match is found, if it’s the empty string, even re.split cannot split the string.

The docs for re.split seem to support my results.

A “workaround” was easy enough to find for this particular case:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarbazbar').split('qux')
['foobar', 'barbaz', 'bar']

But this is an error-prone way of doing it because then I have to beware of strings that already contain the substring that I’m splitting on:

>>> re.sub(r'(?<!foo)(?=bar)', 'qux', 'foobarbarquxbar').split('qux')
['foobar', 'bar', '', 'bar']

Is there any better way to split on an empty pattern match with the re module? Additionally, why does re.split not allow me to do this in the first place? I know it’s possible with other split algorithms that work with regex; for example, I am able to do this with JavaScript’s built-in String.prototype.split().

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

It is unfortunate that the split requires a non-zero-width match, but it hasn’t been to fixed yet, since quite a lot incorrect code depends on the current behaviour by using for example [something]*as the regex. Use of such patterns will now generate a FutureWarning and those that never can split anything, throw a ValueError from Python 3.5 onwards:

>>> re.split(r'(?<!foo)(?=bar)', 'foobarbarbazbar')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/re.py", line 212, in split
    return _compile(pattern, flags).split(string, maxsplit)
ValueError: split() requires a non-empty pattern match.

The idea is that after a certain period of warnings, the behaviour can be changed so that your regular expression would work again.

If you can’t use the regex module, you can write your own split function using re.finditer():

def megasplit(pattern, string):
    splits = list((m.start(), m.end()) for m in re.finditer(pattern, string))
    starts = [0] + [i[1] for i in splits]
    ends = [i[0] for i in splits] + [len(string)]
    return [string[start:end] for start, end in zip(starts, ends)]

print(megasplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))
print(megasplit(r'o', 'foobarbarbazbar'))

If you are sure that the matches are zero-width only, you can use the starts of the splits for easier code:

import re

def zerowidthsplit(pattern, string):
    splits = list(m.start() for m in re.finditer(pattern, string))
    starts = [0] + splits
    ends = splits + [ len(string) ]
    return [string[start:end] for start, end in zip(starts, ends)]

print(zerowidthsplit(r'(?<!foo)(?=bar)', 'foobarbarbazbar'))

Method 2

import regex
x="bazbarbarfoobar"
print regex.split(r"(?<!baz)(?=bar)",x,flags=regex.VERSION1)

You can use regex module here for this.

(.+?(?<!foo))(?=bar|$)|(.+?foo)$

Use re.findall .

See demo

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating