Given a DataFrame with a column “BoolCol”, we want to find the indexes of the DataFrame in which the values for “BoolCol” == True
I currently have the iterating way to do it, which works perfectly:
for i in range(100,3000):
if df.iloc[i]['BoolCol']== True:
print i,df.iloc[i]['BoolCol']
But this is not the correct panda’s way to do it.
After some research, I am currently using this code:
df[df['BoolCol'] == True].index.tolist()
This one gives me a list of indexes, but they dont match, when I check them by doing:
df.iloc[i]['BoolCol']
The result is actually False!!
Which would be the correct Pandas way to do this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
df.iloc[i] returns the ith row of df. i does not refer to the index label, i is a 0-based index.
In contrast, the attribute index returns actual index labels, not numeric row-indices:
df.index[df['BoolCol'] == True].tolist()
or equivalently,
df.index[df['BoolCol']].tolist()
You can see the difference quite clearly by playing with a DataFrame with
a non-default index that does not equal to the row’s numerical position:
df = pd.DataFrame({'BoolCol': [True, False, False, True, True]},
index=[10,20,30,40,50])
In [53]: df
Out[53]:
BoolCol
10 True
20 False
30 False
40 True
50 True
[5 rows x 1 columns]
In [54]: df.index[df['BoolCol']].tolist()
Out[54]: [10, 40, 50]
If you want to use the index,
In [56]: idx = df.index[df['BoolCol']] In [57]: idx Out[57]: Int64Index([10, 40, 50], dtype='int64')
then you can select the rows using loc instead of iloc:
In [58]: df.loc[idx] Out[58]: BoolCol 10 True 40 True 50 True [3 rows x 1 columns]
Note that loc can also accept boolean arrays:
In [55]: df.loc[df['BoolCol']] Out[55]: BoolCol 10 True 40 True 50 True [3 rows x 1 columns]
If you have a boolean array, mask, and need ordinal index values, you can compute them using np.flatnonzero:
In [110]: np.flatnonzero(df['BoolCol']) Out[112]: array([0, 3, 4])
Use df.iloc to select rows by ordinal index:
In [113]: df.iloc[np.flatnonzero(df['BoolCol'])] Out[113]: BoolCol 10 True 40 True 50 True
Method 2
Can be done using numpy where() function:
import pandas as pd
import numpy as np
In [716]: df = pd.DataFrame({"gene_name": ['SLC45A1', 'NECAP2', 'CLIC4', 'ADC', 'AGBL4'] , "BoolCol": [False, True, False, True, True] },
index=list("abcde"))
In [717]: df
Out[717]:
BoolCol gene_name
a False SLC45A1
b True NECAP2
c False CLIC4
d True ADC
e True AGBL4
In [718]: np.where(df["BoolCol"] == True)
Out[718]: (array([1, 3, 4]),)
In [719]: select_indices = list(np.where(df["BoolCol"] == True)[0])
In [720]: df.iloc[select_indices]
Out[720]:
BoolCol gene_name
b True NECAP2
d True ADC
e True AGBL4
Though you don’t always need index for a match, but incase if you need:
In [796]: df.iloc[select_indices].index Out[796]: Index([u'b', u'd', u'e'], dtype='object') In [797]: df.iloc[select_indices].index.tolist() Out[797]: ['b', 'd', 'e']
Method 3
If you want to use your dataframe object only once, use:
df['BoolCol'].loc[lambda x: x==True].index
Method 4
Simple way is to reset the index of the DataFrame prior to filtering:
df_reset = df.reset_index() df_reset[df_reset['BoolCol']].index.tolist()
Bit hacky, but it’s quick!
Method 5
First you may check query when the target column is type bool (PS: about how to use it please check link )
df.query('BoolCol')
Out[123]:
BoolCol
10 True
40 True
50 True
After we filter the original df by the Boolean column we can pick the index .
df=df.query('BoolCol')
df.index
Out[125]: Int64Index([10, 40, 50], dtype='int64')
Also pandas have nonzero, we just select the position of True row and using it slice the DataFrame or index
df.index[df.BoolCol.nonzero()[0]] Out[128]: Int64Index([10, 40, 50], dtype='int64')
Method 6
I extended this question that is how to gets the row, columnand value of all matches value?
here is solution:
import pandas as pd
import numpy as np
def search_coordinate(df_data: pd.DataFrame, search_set: set) -> list:
nda_values = df_data.values
tuple_index = np.where(np.isin(nda_values, [e for e in search_set]))
return [(row, col, nda_values<div class="su-row"></div>[col]) for row, col in zip(tuple_index[0], tuple_index[1])]
if __name__ == '__main__':
test_datas = [['cat', 'dog', ''],
['goldfish', '', 'kitten'],
['Puppy', 'hamster', 'mouse']
]
df_data = pd.DataFrame(test_datas)
print(df_data)
result_list = search_coordinate(df_data, {'dog', 'Puppy'})
print(f"nn{'row':<4} {'col':<4} {'name':>10}")
[print(f"{row:<4} {col:<4} {name:>10}") for row, col, name in result_list]
Output:
0 1 2 0 cat dog 1 goldfish kitten 2 Puppy hamster mouse row col name 0 1 dog 2 0 Puppy
Method 7
For known index candidate that we interested, a faster way by not checking the whole column can be done like this:
np.array(index_slice)[np.where(df.loc[index_slice]['column_name'] >= threshold)[0]]
Full comparison:
import pandas as pd
import numpy as np
index_slice = list(range(50,150)) # know index location for our inteterest
data = np.zeros(10000)
data[(index_slice)] = np.random.random(len(index_slice))
df = pd.DataFrame(
{'column_name': data},
)
threshold = 0.5
%%timeit
np.array(index_slice)[np.where(df.loc[index_slice]['column_name'] >= threshold)[0]]
# 600 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
[i for i in index_slice if i in df.index[df['column_name'] >= threshold].tolist()]
# 22.5 ms ± 29.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The way it works is like this:
# generate Boolean satisfy condition only in sliced column
df.loc[index_slice]['column_name'] >= threshold
# convert Boolean to index, but start from 0 and increment by 1
np.where(...)[0]
# list of index to be sliced
np.array(index_slice)[...]
Note:
It needs to be noted that np.array(index_slice) can’t be substituted by df.index due to np.where(...)[0] indexing start from 0 and increment by 1, but you can make something like df.index[index_slice]. And I think this is not worth the hassle if you just do it one time with small number of rows.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0