Split string with multiple delimiters in Python

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ‘;’ or ‘, ‘
That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]')

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Luckily, Python has this built-in 🙂

import re
re.split('; |, ',str)

Update:
Following your comment:

>>> a='Beautiful, is; better*thannugly'
>>> import re
>>> re.split('; |, |*|n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

Method 2

Do a str.replace('; ', ', ') and then a str.split(', ')

Method 3

Here’s a safe way for any iterable of delimiters, using regular expressions:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regexPattern = '|'.join(map(re.escape, delimiters))
>>> regexPattern
'a|\.\.\.|\(c\)'
>>> re.split(regexPattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]

re.escape allows to build the pattern automatically and have the delimiters escaped nicely.

Here’s this solution as a function for your copy-pasting pleasure:

def split(delimiters, string, maxsplit=0):
    import re
    regexPattern = '|'.join(map(re.escape, delimiters))
    return re.split(regexPattern, string, maxsplit)

If you’re going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.


If you’d like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regexPattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regexPattern
'(?<=a)|(?<=\.\.\.)|(?<=\(c\))'
>>> re.split(regexPattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]

(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)

Method 4

In response to Jonathan’s answer above, this only seems to work for certain delimiters. For example:

>>> a='Beautiful, is; better*thannugly'
>>> import re
>>> re.split('; |, |*|n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']

By putting the delimiters in square brackets it seems to work more effectively.

>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']

Method 5

This is how the regex look like:

import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")

# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")

print pattern.split(text)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x