Splitting a pandas dataframe column by delimiter

i have a small sample data:

import pandas as pd

df = {'ID': [3009, 129, 119, 120, 121, 122, 130, 3014, 266, 849, 174, 844],
  'V': ['IGHV7-B*01', 'IGHV7-B*01', 'IGHV6-A*01', 'GHV6-A*01', 'IGHV6-A*01',
        'IGHV6-A*01', 'IGHV4-L*03', 'IGHV4-L*03', 'IGHV5-A*01', 'IGHV5-A*04',
        'IGHV6-A*02','IGHV6-A*02'],
  'Prob': [1, 1, 0.8, 0.8056, 0.9, 0.805, 1, 1, 0.997, 0.401, 1, 1]}

df = pd.DataFrame(df)

looks like

df    

Out[25]: 
      ID    Prob           V
0    3009  1.0000  IGHV7-B*01
1     129  1.0000  IGHV7-B*01
2     119  0.8000  IGHV6-A*01
3     120  0.8056  IGHV6-A*01
4     121  0.9000  IGHV6-A*01
5     122  0.8050  IGHV6-A*01
6     130  1.0000  IGHV4-L*03
7    3014  1.0000  IGHV4-L*03
8     266  0.9970  IGHV5-A*01
9     849  0.4010  IGHV5-A*04
10    174  1.0000  IGHV6-A*02
11    844  1.0000  IGHV6-A*02

I want to split the column ‘V’ by the ‘-‘ delimiter and move it to another column named ‘allele’

    Out[25]: 
      ID    Prob      V    allele
0    3009  1.0000  IGHV7    B*01
1     129  1.0000  IGHV7    B*01
2     119  0.8000  IGHV6    A*01
3     120  0.8056  IGHV6    A*01
4     121  0.9000  IGHV6    A*01
5     122  0.8050  IGHV6    A*01
6     130  1.0000  IGHV4    L*03
7    3014  1.0000  IGHV4    L*03
8     266  0.9970  IGHV5    A*01
9     849  0.4010  IGHV5    A*04
10    174  1.0000  IGHV6    A*02
11    844  1.0000  IGHV6    A*02

the code i have tried so far is incomplete and didn’t work:

df1 = pd.DataFrame()
df1[['V']] = pd.DataFrame([ x.split('-') for x in df['V'].tolist() ])

or

df.add(Series, axis='columns', level = None, fill_value = None)
newdata = df.DataFrame({'V':df['V'].iloc[::2].values, 
                        'Allele': df['V'].iloc[1::2].values})

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use vectoried str.split with expand=True:

In [42]:
df[['V','allele']] = df['V'].str.split('-',expand=True)
df

Out[42]:
      ID    Prob      V allele
0   3009  1.0000  IGHV7   B*01
1    129  1.0000  IGHV7   B*01
2    119  0.8000  IGHV6   A*01
3    120  0.8056   GHV6   A*01
4    121  0.9000  IGHV6   A*01
5    122  0.8050  IGHV6   A*01
6    130  1.0000  IGHV4   L*03
7   3014  1.0000  IGHV4   L*03
8    266  0.9970  IGHV5   A*01
9    849  0.4010  IGHV5   A*04
10   174  1.0000  IGHV6   A*02
11   844  1.0000  IGHV6   A*02

Method 2

For storing data into a new dataframe use the same approach, just with the new dataframe:

tmpDF = pd.DataFrame(columns=['A','B'])
tmpDF[['A','B']] = df['V'].str.split('-', expand=True)

Eventually (and more usefull for my purposes) if you would need get only a part of the string value (i.e. text before ‘-‘), you could use .str.split(…).str[idx] like:

df['V'] = df['V'].str.split('-').str[0]
df
    ID      V       Prob
0   3009    IGHV7   1.0000
1   129     IGHV7   1.0000
2   119     IGHV6   0.8000
3   120     GHV6    0.8056

– splits ‘V’ values into list according to separator ‘-‘ and stores 1st item back to the column

Method 3

Use the below:

df['allele'] = [x.split('-')[-1] for x in df['V']]

The above first part retains any values after the ‘-‘ sign

df['V'] = [x.split('-')[-0] for x in df['V']]

The above second part retains any values before the ‘-‘ sign and automatically replaces the main column

df.head(3)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x