Unnest (explode) a Pandas Series

I have:

df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})

   col1  col2
0  asdf     1
1    xy     2
2     q     3

I’d like to take the “combinatoric product” of each letter from the strings in col1, with each elementwise int in col2. I.e.:

  col1  col2
0    a    1
1    s    1
2    d    1
3    f    1
4    x    2
5    y    2
6    q    3

Current method:

from itertools import product

pieces = []
for _, s in df.iterrows():
    letters = list(s.col1)
    prods = list(product(letters, [s.col2]))
    pieces.append(pd.DataFrame(prods))

pd.concat(pieces)

Any more efficient workarounds?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Using list + str.join and np.repeat –

pd.DataFrame(
{
     'col1' : list(''.join(df.col1)), 
     'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})

  col1  col2
0    a     1
1    s     1
2    d     1
3    f     1
4    x     2
5    y     2
6    q     3

A generalised solution for any number of columns is easily achievable, without much change to the solution –

i = list(''.join(df.col1))
j = df.drop('col1', 1).values.repeat(df.col1.str.len(), axis=0)

df = pd.DataFrame(j, columns=df.columns.difference(['col1']))
df.insert(0, 'col1', i)

df

  col1 col2
0    a    1
1    s    1
2    d    1
3    f    1
4    x    2
5    y    2
6    q    3

Performance

df = pd.concat([df] * 100000, ignore_index=True)

# MaxU's solution

%%timeit
df.col1.str.extractall(r'(.)') 
           .reset_index(level=1, drop=True) 
           .join(df['col2']) 
           .reset_index(drop=True)

1 loop, best of 3: 1.98 s per loop

# piRSquared's solution

%%timeit
pd.DataFrame(
     [[x] + b for a, *b in df.values for x in a],
     columns=df.columns
)

1 loop, best of 3: 1.68 s per loop

# Wen's solution

%%timeit
v = df.col1.apply(list)
pd.DataFrame({'col1':np.concatenate(v.values),'col2':df.col2.repeat(v.apply(len))})

1 loop, best of 3: 835 ms per loop

# Alexander's solution

%%timeit
pd.DataFrame([(letter, i) 
              for letters, i in zip(df['col1'], df['col2']) 
              for letter in letters],
             columns=df.columns)

1 loop, best of 3: 316 ms per loop

%%timeit
pd.DataFrame(
{
     'col1' : list(''.join(df.col1)), 
     'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})

10 loops, best of 3: 124 ms per loop

I tried timing Vaishali’s, but it took too long on this dataset.

Method 2

pd.DataFrame([(letter, i) 
              for letters, i in zip(df['col1'], df['col2']) 
              for letter in letters],
             columns=df.columns)

Method 3

Trick from the list 🙂

df.col1=df.col1.apply(list)
df
Out[489]: 
           col1  col2
0  [a, s, d, f]     1
1        [x, y]     2
2           [q]     3
pd.DataFrame({'col1':np.concatenate(df.col1.values),'col2':df.col2.repeat(df.col1.apply(len))})
Out[490]: 
  col1  col2
0    a     1
0    s     1
0    d     1
0    f     1
1    x     2
1    y     2
2    q     3

Method 4

In [86]: df.col1.str.extractall(r'(.)') 
           .reset_index(level=1, drop=True) 
           .join(df['col2']) 
           .reset_index(drop=True)
Out[86]:
   0  col2
0  a     1
1  s     1
2  d     1
3  f     1
4  x     2
5  y     2
6  q     3

Method 5

One more:)

df.set_index('col2').col1.apply(lambda x: pd.Series(list(x))).stack()
.reset_index(1,drop = True).reset_index(name = 'col1')

    col2    col1
0   1       a
1   1       s
2   1       d
3   1       f
4   2       x
5   2       y
6   3       q

Method 6

General solution with a list comprehension and clever unpacking:

pd.DataFrame(
    [[x] + b for a, *b in df.values for x in a],
    columns=df.columns
)

  col1  col2
0    a     1
1    s     1
2    d     1
3    f     1
4    x     2
5    y     2
6    q     3

Method 7

Using Explode (pandas>=0.25)

df = pd.DataFrame({'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]})

df.col1=df.col1.apply(list)
df = df.explode('col1')

Result:

Method 8

You can also try to itertools.chain and itertools.repeat functions to achieve similar results.

An example would be

import pandas as pd
from itertools import chain, repeat

d = {'col1': ['asdf', 'xy', 'q'], 'col2': [1, 2, 3]}

expanded_d = {
    "col1": list(chain(*[list(item) for item in d["col1"]])),
    "col2": list(chain(*[list(repeat(d["col2"][idx], len(list(d["col1"][idx])))) for idx in range(len(d["col1"])) ]))
    }

result = pd.DataFrame(data=expanded_d)

  col1  col2
0    a     1
1    s     1
2    d     1
3    f     1
4    x     2
5    y     2
6    q     3

Hope it helps.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating