What is the pythonic way to split a string before the occurrences of a given set of characters?
For example, I want to split
'TheLongAndWindingRoad'
at any occurrence of an uppercase letter (possibly except the first), and obtain
['The', 'Long', 'And', 'Winding', 'Road'].
Edit: It should also split single occurrences, i.e.
from 'ABC' I’d like to obtain
['A', 'B', 'C'].
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Unfortunately it’s not possible to split on a zero-width match in Python. But you can use re.findall instead:
>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']
Method 2
Here is an alternative regex solution. The problem can be reprased as “how do I insert a space before each uppercase letter, before doing the split”:
>>> s = "TheLongAndWindingRoad ABC A123B45" >>> re.sub( r"([A-Z])", r" 1", s).split() ['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
This has the advantage of preserving all non-whitespace characters, which most other solutions do not.
Method 3
>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']
>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']
If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"
Method 4
Use a lookahead:
In Python 3.7, you can do this:
re.split('(?=[A-Z])', 'theLongAndWindingRoad')
And it yields:
['the', 'Long', 'And', 'Winding', 'Road']
Method 5
A variation on @ChristopheD ‘s solution
s = 'TheLongAndWindingRoad' pos = [i for i,e in enumerate(s+'A') if e.isupper()] parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)] print parts
Method 6
I think that a better answer might be to split the string up into words that do not end in a capital. This would handle the case where the string doesn’t start with a capital letter.
re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')
example:
>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']
Method 7
import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))
or
[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]
Method 8
src = 'TheLongAndWindingRoad' glue = ' ' result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)
Method 9
Another without regex and the ability to keep contiguous uppercase if wanted
def split_on_uppercase(s, keep_contiguous=False):
"""
Args:
s (str): string
keep_contiguous (bool): flag to indicate we want to
keep contiguous uppercase chars together
Returns:
"""
string_length = len(s)
is_lower_around = (lambda: s[i-1].islower() or
string_length > (i + 1) and s[i + 1].islower())
start = 0
parts = []
for i in range(1, string_length):
if s[i].isupper() and (not keep_contiguous or is_lower_around()):
parts.append(s[start: i])
start = i
parts.append(s[start:])
return parts
>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']
Method 10
Pythonic way could be:
"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']
Works good for Unicode, avoiding re/re2.
"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']
Method 11
Alternative solution (if you dislike explicit regexes):
s = 'TheLongAndWindingRoad'
pos = [i for i,e in enumerate(s) if e.isupper()]
parts = []
for j in xrange(len(pos)):
try:
parts.append(s[pos[j]:pos[j+1]])
except IndexError:
parts.append(s[pos[j]:])
print parts
Method 12
An alternative way without using regex or enumerate:
word = 'TheLongAndWindingRoad'
list = [x for x in word]
for char in list:
if char != list[0] and char.isupper():
list[list.index(char)] = ' ' + char
fin_list = ''.join(list).split(' ')
I think it is clearer and simpler without chaining too many methods or using a long list comprehension that can be difficult to read.
Method 13
This is possible with the more_itertools.split_before tool.
import more_itertools as mit iterable = "TheLongAndWindingRoad" [ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())] # ['The', 'Long', 'And', 'Winding', 'Road']
It should also split single occurrences, i.e. from
'ABC'I’d like to obtain['A', 'B', 'C'].
iterable = "ABC" [ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())] # ['A', 'B', 'C']
more_itertools is a third-party package with 60+ useful tools including implementations for all of the original itertools recipes, which obviates their manual implementation.
Method 14
An alternate way using enumerate and isupper()
Code:
strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
if val.isupper():
new_lst.append(strs[ind:index])
ind=index
if ind<len(strs):
new_lst.append(strs[ind:])
print new_lst
Output:
['The', 'Long', 'And', 'Winding', 'Road']
Method 15
Sharing what came to mind when I read the post. Different from other posts.
strs = 'TheLongAndWindingRoad'
# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]
# create empty list
strs_list = []
# initiate counter
cnt = 1
for pos in start_idx:
start_pos = pos
# use counter to grab next positional element and overlook IndexeError
try:
end_pos = start_idx[cnt]
except IndexError:
continue
# append to empty list
strs_list.append(strs[start_pos:end_pos])
cnt += 1
Method 16
Replace every uppercase letter ‘L’ in the given with an empty space plus that letter ” L”. We can do this using list comprehension or we can define a function to do it as follows.
s = 'TheLongANDWindingRoad ABC A123B45' ''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split() >>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
If you choose to go by a function, here is how.
def splitAtUpperCase(text):
result = ""
for char in text:
if char.isupper():
result += " " + char
else:
result += char
return result.split()
In the case of the given example:
print(splitAtUpperCase('TheLongAndWindingRoad'))
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']
But most of the time that we are splitting a sentence at upper case letters, it is usually the case that we want to maintain abbreviations that are typically a continuous stream of uppercase letters. The code below would help.
def splitAtUpperCase(s):
for i in range(len(s)-1)[::-1]:
if s[i].isupper() and s[i+1].islower():
s = s[:i]+' '+s[i:]
if s[i].isupper() and s[i-1].islower():
s = s[:i]+' '+s[i:]
return s.split()
splitAtUpperCase('TheLongANDWindingRoad')
>>> ['The', 'Long', 'AND', 'Winding', 'Road']
Thanks.
Method 17
You might also wanna do it this way
def camelcase(s):
words = []
for char in s:
if char.isupper():
words.append(':'+char)
else:
words.append(char)
words = ((''.join(words)).split(':'))
return len(words)
This will output as follows
s = 'oneTwoThree' print(camecase(s) //['one', 'Two', 'Three']
Method 18
def solution(s):
st = ''
for c in s:
if c == c.upper():
st += ' '
st += c
return st
Method 19
I’m using list
def split_by_upper(x):
i = 0
lis = list(x)
while True:
if i == len(lis)-1:
if lis[i].isupper():
lis.insert(i,",")
break
if lis[i].isupper() and i != 0:
lis.insert(i,",")
i+=1
i+=1
return "".join(lis).split(",")
OUTPUT:
data = "TheLongAndWindingRoad" print(split_by_upper(data))` >> ['The', 'Long', 'And', 'Winding', 'Road']
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0