I have:
df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})
col1 col2
0 asdf 1
1 xy 2
2 q 3
I’d like to take the “combinatoric product” of each letter from the strings in col1, with each elementwise int in col2. I.e.:
col1 col2 0 a 1 1 s 1 2 d 1 3 f 1 4 x 2 5 y 2 6 q 3
Current method:
from itertools import product
pieces = []
for _, s in df.iterrows():
letters = list(s.col1)
prods = list(product(letters, [s.col2]))
pieces.append(pd.DataFrame(prods))
pd.concat(pieces)
Any more efficient workarounds?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Using list + str.join and np.repeat –
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
A generalised solution for any number of columns is easily achievable, without much change to the solution –
i = list(''.join(df.col1))
j = df.drop('col1', 1).values.repeat(df.col1.str.len(), axis=0)
df = pd.DataFrame(j, columns=df.columns.difference(['col1']))
df.insert(0, 'col1', i)
df
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# MaxU's solution
%%timeit
df.col1.str.extractall(r'(.)')
.reset_index(level=1, drop=True)
.join(df['col2'])
.reset_index(drop=True)
1 loop, best of 3: 1.98 s per loop
# piRSquared's solution
%%timeit
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)
1 loop, best of 3: 1.68 s per loop
# Wen's solution
%%timeit
v = df.col1.apply(list)
pd.DataFrame({'col1':np.concatenate(v.values),'col2':df.col2.repeat(v.apply(len))})
1 loop, best of 3: 835 ms per loop
# Alexander's solution
%%timeit
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
1 loop, best of 3: 316 ms per loop
%%timeit
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})
10 loops, best of 3: 124 ms per loop
I tried timing Vaishali’s, but it took too long on this dataset.
Method 2
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)
Method 3
Trick from the list 🙂
df.col1=df.col1.apply(list)
df
Out[489]:
col1 col2
0 [a, s, d, f] 1
1 [x, y] 2
2 [q] 3
pd.DataFrame({'col1':np.concatenate(df.col1.values),'col2':df.col2.repeat(df.col1.apply(len))})
Out[490]:
col1 col2
0 a 1
0 s 1
0 d 1
0 f 1
1 x 2
1 y 2
2 q 3
Method 4
In [86]: df.col1.str.extractall(r'(.)')
.reset_index(level=1, drop=True)
.join(df['col2'])
.reset_index(drop=True)
Out[86]:
0 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Method 5
One more:)
df.set_index('col2').col1.apply(lambda x: pd.Series(list(x))).stack()
.reset_index(1,drop = True).reset_index(name = 'col1')
col2 col1
0 1 a
1 1 s
2 1 d
3 1 f
4 2 x
5 2 y
6 3 q
Method 6
General solution with a list comprehension and clever unpacking:
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Method 7
Using Explode (pandas>=0.25)
df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})
df.col1=df.col1.apply(list)
df = df.explode('col1')
Result:
col1 col2 0 a 1 0 s 1 0 d 1 0 f 1 1 x 2 1 y 2 2 q 3
Method 8
You can also try to itertools.chain and itertools.repeat functions to achieve similar results.
An example would be
import pandas as pd
from itertools import chain, repeat
d = {'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]}
expanded_d = {
"col1": list(chain(*[list(item) for item in d["col1"]])),
"col2": list(chain(*[list(repeat(d["col2"][idx], len(list(d["col1"][idx])))) for idx in range(len(d["col1"])) ]))
}
result = pd.DataFrame(data=expanded_d)
col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3
Hope it helps.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0