I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D 3 4 8 1 9 2 7 2
Needs to become:
A B C D 8 4 3 1 9 7 2 2
Right now I’m applying sort to each row and building up a new dataframe row by row. I’m also doing a couple of extra, less important things to each row (hence why I’m using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False) Out[21]: D C B A 0 1 8 4 3 1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
Method 2
To Add to the answer given by @Andy-Hayden, to do this inplace to the whole frame… not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
Method 3
You could use pd.apply.
Eg: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five']) print (A) one two three four five 0 2 75 44 53 46 1 18 51 73 80 66 2 35 91 86 44 25 3 60 97 57 33 79 A = A.apply(np.sort, axis = 1) print(A) one two three four five 0 2 44 46 53 75 1 18 51 66 73 80 2 25 35 44 86 91 3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five']) A = A * -1 A = A.apply(np.sort, axis = 1) A = A * -1
Method 4
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D 3 4 8 1 9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1] A B C D 0 8 4 3 1 1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1] A B C D 0 3 8 4 1 1 9 7 2 2
Method 5
One could try this approach to preserve the integrity of the df:
import pandas as pd import numpy as np A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five']) print (A) print(type(A))
one two three four five 0 85 27 64 50 55 1 3 90 65 22 8 2 0 7 64 66 82 3 58 21 42 27 30 <class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True) print(B) print(type(B))
one two three four five 0 27 50 55 64 85 1 3 8 22 65 90 2 0 7 64 66 82 3 21 27 30 42 58 <class 'pandas.core.frame.DataFrame'>
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0