Split a string at uppercase letters

What is the pythonic way to split a string before the occurrences of a given set of characters?

For example, I want to split
'TheLongAndWindingRoad'
at any occurrence of an uppercase letter (possibly except the first), and obtain
['The', 'Long', 'And', 'Winding', 'Road'].

Edit: It should also split single occurrences, i.e.
from 'ABC' I’d like to obtain
['A', 'B', 'C'].

Contents hide

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Unfortunately it’s not possible to split on a zero-width match in Python. But you can use re.findall instead:

>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']

Method 2

Here is an alternative regex solution. The problem can be reprased as “how do I insert a space before each uppercase letter, before doing the split”:

>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" 1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

This has the advantage of preserving all non-whitespace characters, which most other solutions do not.

Method 3

>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']

>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']

>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']

If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"

Method 4

Use a lookahead:

In Python 3.7, you can do this:

re.split('(?=[A-Z])', 'theLongAndWindingRoad')

And it yields:

['the', 'Long', 'And', 'Winding', 'Road']

Method 5

A variation on @ChristopheD ‘s solution

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]

print parts

Method 6

I think that a better answer might be to split the string up into words that do not end in a capital. This would handle the case where the string doesn’t start with a capital letter.

 re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')

example:

>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']

Method 7

import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))

[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]

Method 8

src = 'TheLongAndWindingRoad'
glue = ' '

result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)

Method 9

Another without regex and the ability to keep contiguous uppercase if wanted

def split_on_uppercase(s, keep_contiguous=False):
    """

    Args:
        s (str): string
        keep_contiguous (bool): flag to indicate we want to 
                                keep contiguous uppercase chars together

    Returns:

    """

    string_length = len(s)
    is_lower_around = (lambda: s[i-1].islower() or 
                       string_length > (i + 1) and s[i + 1].islower())

    start = 0
    parts = []
    for i in range(1, string_length):
        if s[i].isupper() and (not keep_contiguous or is_lower_around()):
            parts.append(s[start: i])
            start = i
    parts.append(s[start:])

    return parts

>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']

Method 10

Pythonic way could be:

"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']

Works good for Unicode, avoiding re/re2.

"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']

Method 11

Alternative solution (if you dislike explicit regexes):

s = 'TheLongAndWindingRoad'

pos = [i for i,e in enumerate(s) if e.isupper()]

parts = []
for j in xrange(len(pos)):
    try:
        parts.append(s[pos[j]:pos[j+1]])
    except IndexError:
        parts.append(s[pos[j]:])

print parts

Method 12

An alternative way without using regex or enumerate:

word = 'TheLongAndWindingRoad'
list = [x for x in word]

for char in list:
    if char != list[0] and char.isupper():
        list[list.index(char)] = ' ' + char

fin_list = ''.join(list).split(' ')

I think it is clearer and simpler without chaining too many methods or using a long list comprehension that can be difficult to read.

Method 13

This is possible with the more_itertools.split_before tool.

import more_itertools as mit


iterable = "TheLongAndWindingRoad"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['The', 'Long', 'And', 'Winding', 'Road']

It should also split single occurrences, i.e. from 'ABC' I’d like to obtain ['A', 'B', 'C'].

iterable = "ABC"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['A', 'B', 'C']

more_itertools is a third-party package with 60+ useful tools including implementations for all of the original itertools recipes, which obviates their manual implementation.

Method 14

An alternate way using enumerate and isupper()

Code:

strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
    if val.isupper():
        new_lst.append(strs[ind:index])
        ind=index
if ind<len(strs):
    new_lst.append(strs[ind:])
print new_lst

Output:

['The', 'Long', 'And', 'Winding', 'Road']

Method 15

Sharing what came to mind when I read the post. Different from other posts.

strs = 'TheLongAndWindingRoad'

# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]

# create empty list
strs_list = []

# initiate counter
cnt = 1

for pos in start_idx:
    start_pos = pos

    # use counter to grab next positional element and overlook IndexeError
    try:
        end_pos = start_idx[cnt]
    except IndexError:
        continue

    # append to empty list
    strs_list.append(strs[start_pos:end_pos])

    cnt += 1

Method 16

Replace every uppercase letter ‘L’ in the given with an empty space plus that letter ” L”. We can do this using list comprehension or we can define a function to do it as follows.

s = 'TheLongANDWindingRoad ABC A123B45'
''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split()
>>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']

If you choose to go by a function, here is how.

def splitAtUpperCase(text):
    result = ""
    for char in text:
        if char.isupper():
            result += " " + char
        else:
            result += char
    return result.split()

In the case of the given example:

print(splitAtUpperCase('TheLongAndWindingRoad')) 
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']

But most of the time that we are splitting a sentence at upper case letters, it is usually the case that we want to maintain abbreviations that are typically a continuous stream of uppercase letters. The code below would help.

def splitAtUpperCase(s):
    for i in range(len(s)-1)[::-1]:
        if s[i].isupper() and s[i+1].islower():
            s = s[:i]+' '+s[i:]
        if s[i].isupper() and s[i-1].islower():
            s = s[:i]+' '+s[i:]
    return s.split()

splitAtUpperCase('TheLongANDWindingRoad')

>>> ['The', 'Long', 'AND', 'Winding', 'Road']

Thanks.

Method 17

You might also wanna do it this way

def camelcase(s):
    
    words = []
    
    for char in s:
        if char.isupper():
            words.append(':'+char)
        else:
            words.append(char)
    words = ((''.join(words)).split(':'))
    
    return len(words)

This will output as follows

s = 'oneTwoThree'
print(camecase(s)
//['one', 'Two', 'Three']

Method 18

def solution(s):
   
    st = ''
    for c in s:
        if c == c.upper():
            st += ' '   
        st += c    
       
    return st

Method 19

I’m using list

def split_by_upper(x): 
i = 0       
lis = list(x)
while True:
    if i == len(lis)-1:
        if lis[i].isupper():
            lis.insert(i,",")
        break
    if lis[i].isupper() and i != 0:
        lis.insert(i,",")
        i+=1
    i+=1
return "".join(lis).split(",")

OUTPUT:

data = "TheLongAndWindingRoad"
print(split_by_upper(data))`
>> ['The', 'Long', 'And', 'Winding', 'Road']

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating