So I have a DataFrame with 180000+ values and I need to (1) replace duplicate and certain values in cells by row and (2) rearrange. Here is my DataFrame, df:
key sellyr brand makrc item1 item2 item3 item4 item5 item6 0 da12 2013 imp apt furi apt nan nan nan nan 1 da32 2013 sa rye rye app nan nan nan nan 2 da14 2013 sa pro not pro pan fan nan nan ........
nan values represent np.nan. And forbidden string is ‘not’.
So what I need to do is check columns item1~6 replace strings that are contained in the makrc column with nan. As well, I also want to replace ‘not’s’ with nan’s. After replacing strings to np.nan, I need to rearrange the item1~6 to left justify non-nan data to the leftmost empty cell, as shown below, (expected output):
key sellyr brand makrc item1 item2 item3 item4 item5 item6 0 da12 2013 imp apt furi nan nan nan nan nan 1 da32 2013 sa rye app nan nan nan nan nan 2 da14 2013 sa pro pan fan nan nan nan nan ........
So as you can see in a first index, I have removed apt string in item2 and changed to np.nan because same string is in makrc column. In index 1, I have removed rye and replace with np.nan. But this time, I rearranged the ‘app’ string from item2 to item1 because np.nan values should come after the values. In index 2, I have replaced pro and not since I need to replace every ‘not’string in the item columns to np.nan. Also I have rearranged the items.
I’ve tried combining all item columns as a list and replacing it, but there are a few rows with only np.nan items. Can you guys recommend an ideal process to solve my problem? Thank you so much.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
First, extract a slice of columns beginning with item –
m = df.columns.str.contains('item')
i = df.iloc[:, m]
Mask all values which meet your criteria. Use isin –
j = i[~i.isin(df.makrc.tolist() + ['not'])]
Now. sort values based on NaNs and assign back –
df.loc[:, m] = j.apply(sorted, key=pd.isnull, axis=1)
df
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi NaN NaN NaN NaN NaN
1 da32 2013 sa rye app NaN NaN NaN NaN NaN
2 da14 2013 sa pro pan fan NaN NaN NaN NaN
Details
i item1 item2 item3 item4 item5 item6 0 furi apt NaN NaN NaN NaN 1 rye app NaN NaN NaN NaN 2 not pro pan fan NaN NaN
j item1 item2 item3 item4 item5 item6 0 furi NaN NaN NaN NaN NaN 1 NaN app NaN NaN NaN NaN 2 NaN NaN pan fan NaN NaN
Toward Better Performance
You could make use of a modified version of Divakar’s justified function that works on object arrays –
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notnull(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
df.loc[:, m] = justify(j.values, invalid_val=np.nan, axis=1, side='left')
df
key sellyr brand makrc item1 item2 item3 item4 item5 item6
0 da12 2013 imp apt furi NaN NaN NaN NaN NaN
1 da32 2013 sa rye app NaN NaN NaN NaN NaN
2 da14 2013 sa pro pan fan NaN NaN NaN NaN
This should (hopefully) be faster than calling apply. You’ll especially see speed gains using the original version of the function that is optimised for numeric data.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0