I am trying to split a column into multiple columns based on comma/space separation.
My dataframe currently looks like
KEYS 1 0 FIT-4270 4000.0439 1 FIT-4269 4000.0420, 4000.0471 2 FIT-4268 4000.0419 3 FIT-4266 4000.0499 4 FIT-4265 4000.0490, 4000.0499, 4000.0500, 4000.0504,
I would like
KEYS 1 2 3 4 0 FIT-4270 4000.0439 1 FIT-4269 4000.0420 4000.0471 2 FIT-4268 4000.0419 3 FIT-4266 4000.0499 4 FIT-4265 4000.0490 4000.0499 4000.0500 4000.0504
My code currently removes The KEYS column and I’m not sure why. Could anyone improve or help fix the issue?
v = dfcleancsv[1]
#splits the columns by spaces into new columns but removes KEYS?
dfcleancsv = dfcleancsv[1].str.split(' ').apply(Series, 1)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
In case someone else wants to split a single column (deliminated by a value) into multiple columns – try this:
series.str.split(',', expand=True)
This answered the question I came here looking for.
Credit to EdChum’s code that includes adding the split columns back to the dataframe.
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
Note: The first argument df[[0]] is DataFrame.
The second argument df[1].str.split is the series that you want to split.
Method 2
Using Edchums answer of
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
I was able to solve it by substituting my variables.
dfcleancsv = pd.concat([dfcleancsv['KEYS'], dfcleancsv[1].str.split(', ', expand=True)], axis=1)
Method 3
The OP had a variable number of output columns.
In the particular case of a fixed number of output columns another elegant solution to name the resulting columns is to use a multiple assignation.
Load a sample dataset and reshape it to long format to obtain a variable
called organ_dimension.
import seaborn
iris = seaborn.load_dataset('iris')
df = iris.melt(id_vars='species', var_name='organ_dimension', value_name='value')
Split the organ_dimension variable in 2 variables organ and dimension based on the _ separator.
df[['organ', 'dimension']] = df['organ_dimension'].str.split('_', expand=True)
df.head()
Out[10]:
species organ_dimension value organ dimension
0 setosa sepal_length 5.1 sepal length
1 setosa sepal_length 4.9 sepal length
2 setosa sepal_length 4.7 sepal length
3 setosa sepal_length 4.6 sepal length
4 setosa sepal_length 5.0 sepal length
Based on this answer “How to split a column into two columns?”
Method 4
The simplest way to use is, vectorization
df = df.apply(lambda x:pd.Series(x))
Method 5
maybe this should work:
df = pd.concat([df['KEYS'],df[1].apply(pd.Series)],axis=1)
Method 6
Check this out
Responder_id LanguagesWorkedWith
0 1 HTML/CSS;Java;JavaScript;Python
1 2 C++;HTML/CSS;Python
2 3 HTML/CSS
3 4 C;C++;C#;Python;SQL
4 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
... ... ...
87564 88182 HTML/CSS;Java;JavaScript
87565 88212 HTML/CSS;JavaScript;Python
87566 88282 Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...
87567 88377 HTML/CSS;JavaScript;Other(s):
87568 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...`
###Split the LanguagesWorkedWith column into multiple columns by using` data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)`.###
` data1 = pd.read_csv('data.csv', sep=',')
data1.set_index('Responder_id',inplace=True)
data1
data1.loc[1,:]
data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)
data.head()`
Method 7
You may also want to try datar, a package ports dplyr, tidyr and related R packages to python:
>>> df
i j A
<object> <int64> <object>
0 AR 5 Paris,Green
1 For 3 Moscow,Yellow
2 For 4 NewYork,Black
>>> from datar import f
>>> from datar.tidyr import separate
>>> separate(df, f.A, ['City', 'Color'])
i j City Color
<object> <int64> <object> <object>
0 AR 5 Paris Green
1 For 3 Moscow Yellow
2 For 4 NewYork Black
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0