I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('d*D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
d+matches 1-or-more digits.d*D+matches 0-or-more digits followed by 1-or-more non-digits.d+|D+matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python’s regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a “capturing group”), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('d+|D+', s) instead of re.split('(d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('d+|D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
Method 2
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?d+.d+)|([-+]?d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
Method 3
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you’ll obviously need to put a line at a time in input
Method 4
To partition on the first digit
parts = re.split('(d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Method 5
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('d', string[k]))):
result = k
break
return result
Method 6
Try this code it will work fine
import re text = "MARIA APARECIDA 99223-2000 / 98450-8026" parts = re.split(r' (?=d)',text, 1) print(parts)
Output:
[‘MARIA APARECIDA’, ‘99223-2000 / 98450-8026’]
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0