I have a Pandas DataFrame with one column:
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
How can split this column of lists into two columns?
Desired result:
team1 team2 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use the DataFrame constructor with lists created by to_list:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]
df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
teams team1 team2
0 [SF, NYG] SF NYG
1 [SF, NYG] SF NYG
2 [SF, NYG] SF NYG
3 [SF, NYG] SF NYG
4 [SF, NYG] SF NYG
5 [SF, NYG] SF NYG
6 [SF, NYG] SF NYG
And for a new DataFrame:
df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2']) print (df3) team1 team2 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG
A solution with apply(pd.Series) is very slow:
#7k rows df2 = pd.concat([df2]*1000).reset_index(drop=True) In [121]: %timeit df2['teams'].apply(pd.Series) 1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2']) 1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Method 2
Much simpler solution:
pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])
Yields,
team1 team2 ------------- 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG 7 SF NYG
If you wanted to split a column of delimited strings rather than lists, you could similarly do:
pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
columns=['team1', 'team2'])
Method 3
This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():
df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']
Here’s the result:
team1 team2 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG
Method 4
There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I’m assuming that the column is called ‘meta’ in a dataframe df:
df2 = pd.DataFrame(df['meta'].str.split().values.tolist())
Method 5
The previous solutions didn’t work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:
object of type 'float' has no len()
I solve this using a list comprehension. Here is the replicable example:
import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2
Output:
teams 0 [SF, NYG] 1 [SF, NYG] 2 NaN 3 [SF, NYG] 4 NaN 5 [SF, NYG] 6 [SF, NYG] df2['team1']=np.nan df2['team2']=np.nan
Solving with a list comprehension,
for i in [0,1]:
df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]
df2
yields:
teams team1 team2 0 [SF, NYG] SF NYG 1 [SF, NYG] SF NYG 2 NaN NaN NaN 3 [SF, NYG] SF NYG 4 NaN NaN NaN 5 [SF, NYG] SF NYG 6 [SF, NYG] SF NYG
Method 6
List comprehension
A simple implementation with list comprehension (my favorite)
df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]
Timing on output:
CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 2.71 ms
Output:
team_1 team_2 0 SF NYG 1 SF NYG 2 SF NYG 3 SF NYG 4 SF NYG 5 SF NYG 6 SF NYG
Method 7
Here’s another solution using df.transform and df.set_index:
>>> from operator import itemgetter
>>> df['teams'].transform({'item1': itemgetter(0), 'item2': itemgetter(1)})
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
Which of course can be generalized as:
>>> indices = range(len(df['teams'][0]))
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG
This approach has the added benefit of extracting the desired indices:
>>> df
teams
0 [SF, NYG, XYZ, ABC]
1 [SF, NYG, XYZ, ABC]
2 [SF, NYG, XYZ, ABC]
3 [SF, NYG, XYZ, ABC]
4 [SF, NYG, XYZ, ABC]
5 [SF, NYG, XYZ, ABC]
6 [SF, NYG, XYZ, ABC]
>>> indices = [0, 2]
>>> df['teams'].transform({f'team{i+1}': itemgetter(i) for i in indices})
team1 team3
0 SF XYZ
1 SF XYZ
2 SF XYZ
3 SF XYZ
4 SF XYZ
5 SF XYZ
6 SF XYZ
Method 8
Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:
pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
Timings:
In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)
In [2]: %timeit df2['teams'].apply(pd.Series)
8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)
35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 9
I would like to recommend a more efficient and Pythonic way.
First define the DataFrame as original post:
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
My solution:
%%timeit
df['team1'], df['team2'] = zip(*list(df['teams'].values))
>> 761 µs ± 8.35 µs per loop
In comparison, the most upvoted solution:
%%timeit
df[['team1','team2']] = pd.DataFrame(df.teams.tolist(), index=df.index)
df = pd.DataFrame(df['teams'].to_list(), columns=['team1','team2'])
>> 1.31 ms ± 11.2 µs per loop
My solution saves 40% time and is much shorter. The only thing you need to remember is how to unpack and reshape a two-dimension list by using zip(*list).
Method 10
you can try to use two times of apply to create new column ‘team1’ and ‘team2’ in your df
df = pd.DataFrame({"teams": [["SF", "NYG"] for _ in range(7)]})
df["team1"]=df['teams'].apply(lambda x: x[0] )
df["team2"]=df['teams'].apply(lambda x: x[1] )
df
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0