What is an efficient way to check that a string s in Python consists of just one character, say 'A'? Something like all_equal(s, 'A') which would behave like this:
all_equal("AAAAA", "A") = True
all_equal("AAAAAAAAAAA", "A") = True
all_equal("AAAAAfAAAAA", "A") = False
Two seemingly inefficient ways would be to: first convert the string to a list and check each element, or second to use a regular expression. Are there more efficient ways or are these the best one can do in Python? Thanks.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This is by far the fastest, several times faster than even count(), just time it with that excellent mgilson’s timing suite:
s == len(s) * s[0]
Here all the checking is done inside the Python C code which just:
- allocates len(s) characters;
- fills the space with the first character;
- compares two strings.
The longer the string is, the greater is time bonus. However, as mgilson writes, it creates a copy of the string, so if your string length is many millions of symbols, it may become a problem.
As we can see from timing results, generally the fastest ways to solve the task do not execute any Python code for each symbol. However, the set() solution also does all the job inside C code of the Python library, but it is still slow, probably because of operating string through Python object interface.
UPD: Concerning the empty string case. What to do with it strongly depends on the task. If the task is “check if all the symbols in a string are the same”, s == len(s) * s[0] is a valid answer (no symbols mean an error, and exception is ok). If the task is “check if there is exactly one unique symbol”, empty string should give us False, and the answer is s and s == len(s) * s[0], or bool(s) and s == len(s) * s[0] if you prefer receiving boolean values. Finally, if we understand the task as “check if there are no different symbols”, the result for empty string is True, and the answer is not s or s == len(s) * s[0].
Method 2
>>> s = 'AAAAAAAAAAAAAAAAAAA' >>> s.count(s[0]) == len(s) True
This doesn’t short circuit. A version which does short-circuit would be:
>>> all(x == s[0] for x in s) True
However, I have a feeling that due the the optimized C implementation, the non-short circuiting version will probably perform better on some strings (depending on size, etc)
Here’s a simple timeit script to test some of the other options posted:
import timeit
import re
def test_regex(s,regex=re.compile(r'^(.)1*$')):
return bool(regex.match(s))
def test_all(s):
return all(x == s[0] for x in s)
def test_count(s):
return s.count(s[0]) == len(s)
def test_set(s):
return len(set(s)) == 1
def test_replace(s):
return not s.replace(s[0],'')
def test_translate(s):
return not s.translate(None,s[0])
def test_strmul(s):
return s == s[0]*len(s)
tests = ('test_all','test_count','test_set','test_replace','test_translate','test_strmul','test_regex')
print "WITH ALL EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="AAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("AAAAAAAAAAAAAAAAA") != True:
print globals()[test]("AAAAAAAAAAAAAAAAA")
raise AssertionError
print
print "WITH FIRST NON-EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="FAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("FAAAAAAAAAAAAAAAA") != False:
print globals()[test]("FAAAAAAAAAAAAAAAA")
raise AssertionError
On my machine (OS-X 10.5.8, core2duo, python2.7.3) with these contrived (short) strings, str.count smokes set and all, and beats str.replace by a little, but is edged out by str.translate and strmul is currently in the lead by a good margin:
WITH ALL EQUAL test_all 5.83863711357 test_count 0.947771072388 test_set 2.01028490067 test_replace 1.24682998657 test_translate 0.941282987595 test_strmul 0.629556179047 test_regex 2.52913498878 WITH FIRST NON-EQUAL test_all 2.41147494316 test_count 0.942595005035 test_set 2.00480484962 test_replace 0.960338115692 test_translate 0.924381017685 test_strmul 0.622269153595 test_regex 1.36632800102
The timings could be slightly (or even significantly?) different between different systems and with different strings, so that would be worth looking into with an actual string you’re planning on passing.
Eventually, if you hit the best case for all enough, and your strings are long enough, you might want to consider that one. It’s a better algorithm … I would avoid the set solution though as I don’t see any case where it could possibly beat out the count solution.
If memory could be an issue, you’ll need to avoid str.translate, str.replace and strmul as those create a second string, but this isn’t usually a concern these days.
Method 3
You could convert to a set and check there is only one member:
len(set("AAAAAAAA"))
Method 4
Try using the built-in function all:
all(c == 'A' for c in s)
Method 5
Adding another solution to this problem
>>> not "AAAAAA".translate(None,"A") True
Method 6
If you need to check if all the characters in the string are same and is equal to a given character, you need to remove all duplicates and check if the final result equals the single character.
>>> set("AAAAA") == set("A")
True
In case you desire to find if there is any duplicate, just check the length
>>> len(set("AAAAA")) == 1
True
Method 7
Interesting answers so far. Here’s another:
flag = True
for c in 'AAAAAAAfAAAA':
if not c == 'A':
flag = False
break
The only advantage I can think of to mine is that it doesn’t need to traverse the entire string if it finds an inconsistent character.
Method 8
not len("AAAAAAAAA".replace('A', ''))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0