Remove a substring from a pandas dataframe column

I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I’d like to avoid using apply if at all possible.

Here’s a sample dataset:

df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})

drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})

My current solution is to use:

(
df.merge(
    drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
    how="left",
    on=["ID", "S"],
    indicator="_dropS",
).assign(
    S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))

But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S by the delimiter “/”, remove the matching entry, and join the list back together again (there may be more than two entries in the S column), but I can’t seem to make that work within the DataFrame without using apply.

Edited to add goal state: Column S should have the entries: ‘n’, ‘o’, ”. The final could be NaN as well.

Is there a reasonable way to do this without a separate function call?

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.

#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})

pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})

res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()

print(res)

    ID  T   S
0   30  C   n
1   40  D   o
2   50  E

Didn’t use apply 🙂

Method 2

remove specific values from specific columns,you can use .str.replace

df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
df.loc[:,'S'] = df['S'].str.replace(r'[/p]','')

the result :

   ID  T  S
0  30  C  n
1  40  D  o
2  50  E

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating