I have a multi-index data frame with columns ‘A’ and ‘B’.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B) df #can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple df.ix[df.A ==1]
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
One way is to use the get_level_values
Index method:
In [11]: df Out[11]: 0 A B 1 4 1 2 5 2 3 6 3 In [12]: df.iloc[df.index.get_level_values('A') == 1] Out[12]: 0 A B 1 4 1
In 0.13 you’ll be able to use xs
with drop_level
argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1] Out[22]: A 1 B 4 0 1
Method 2
You can also use query
which is very readable in my opinion and straightforward to use:
import pandas as pd df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]}) df = df.set_index(['A', 'B']) C A B 1 10 6 2 20 7 3 50 8 4 80 9
For what you had in mind you can now simply do:
df.query('A == 1') C A B 1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50') C A B 3 50 8 4 80 9
and or
df.query('A == 1 or B >= 50') C A B 1 10 6 3 50 8 4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C A B 1 10 6 3 50 8 4 80 9
If you want to use variables inside your query, you can use @
:
b_threshold = 20 c_threshold = 8 df.query('B >= @b_threshold and C <= @c_threshold') C A B 2 20 7 3 50 8
Method 3
You can use DataFrame.xs()
:
In [36]: df = DataFrame(np.random.randn(10, 4)) In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)] In [38]: df.columns.names = ['A', 'B'] In [39]: df Out[39]: A b a B d d d d 0 -1.406 0.548 -0.635 0.576 1 -0.212 -0.583 1.012 -1.377 2 0.951 -0.349 -0.477 -1.230 3 0.451 -0.168 0.949 0.545 4 -0.362 -0.855 1.676 -2.881 5 1.283 1.027 0.085 -1.282 6 0.583 -1.406 0.327 -0.146 7 -0.518 -0.480 0.139 0.851 8 -0.030 -0.630 -1.534 0.534 9 0.246 -1.558 -1.885 -1.543 In [40]: df.xs('a', level='A', axis=1) Out[40]: B d d 0 -0.635 0.576 1 1.012 -1.377 2 -0.477 -1.230 3 0.949 0.545 4 1.676 -2.881 5 0.085 -1.282 6 0.327 -0.146 7 0.139 0.851 8 -1.534 0.534 9 -1.885 -1.543
If you want to keep the A
level (the drop_level
keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False) Out[42]: A a B d d 0 -0.635 0.576 1 1.012 -1.377 2 -0.477 -1.230 3 0.949 0.545 4 1.676 -2.881 5 0.085 -1.282 6 0.327 -0.146 7 0.139 0.851 8 -1.534 0.534 9 -1.885 -1.543
Method 4
Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit']) columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type']) # mock some data data = np.round(np.random.randn(4, 6), 1) data[:, ::2] *= 10 data += 37 # create the DataFrame health_data = pd.DataFrame(data, index=index, columns=columns) health_data
Will give you table like this:
Standard access by column
health_data['Bob'] type HR Temp year visit 2013 1 22.0 38.6 2 52.0 38.3 2014 1 30.0 38.9 2 31.0 37.3 health_data['Bob']['HR'] year visit 2013 1 22.0 2 52.0 2014 1 30.0 2 31.0 Name: HR, dtype: float64 # filtering by column/subcolumn - your case: health_data['Bob']['HR']==22 year visit 2013 1 True 2 False 2014 1 False 2 False health_data['Bob']['HR'][2013] visit 1 22.0 2 52.0 Name: HR, dtype: float64 health_data['Bob']['HR'][2013][1] 22.0
Access by row
health_data.loc[2013] subject Bob Guido Sue type HR Temp HR Temp HR Temp visit 1 22.0 38.6 40.0 38.9 53.0 37.5 2 52.0 38.3 42.0 34.6 30.0 37.7 health_data.loc[2013,1] subject type Bob HR 22.0 Temp 38.6 Guido HR 40.0 Temp 38.9 Sue HR 53.0 Temp 37.5 Name: (2013, 1), dtype: float64 health_data.loc[2013,1]['Bob'] type HR 22.0 Temp 38.6 Name: (2013, 1), dtype: float64 health_data.loc[2013,1]['Bob']['HR'] 22.0
Slicing multi-index
idx=pd.IndexSlice health_data.loc[idx[:,1], idx[:,'HR']] subject Bob Guido Sue type HR HR HR year visit 2013 1 22.0 40.0 53.0 2014 1 30.0 52.0 45.0
Method 5
You can use DataFrame.loc
:
>>> df.loc[1]
Example
>>> print(df) result A B C 1 1 1 6 2 9 2 1 8 2 11 2 1 1 7 2 10 2 1 9 2 12 >>> print(df.loc[1]) result B C 1 1 6 2 9 2 1 8 2 11 >>> print(df.loc[2, 1]) result C 1 7 2 10
Method 6
Another option is:
filter1 = df.index.get_level_values('A') == 1 filter2 = df.index.get_level_values('B') == 4 df.iloc[filter1 & filter2] Out[11]: 0 A B 1 4 1
Method 7
You can use MultiIndex
slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ["one", "two", "one", "two", "one", "two", "one", "two"]] tuples = list(zip(*arrays)) index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"]) df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"]) col1 col2 A B bar one 0 8 two 4 8 baz one 6 0 two 7 3 foo one 6 8 two 2 6 qux one 7 0 two 6 4
To select all from A
and two
from B
:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2 A B bar two 4 8 baz two 7 3 foo two 2 6 qux two 6 4
To select bar
and baz
from A
and two
from B
:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2 A B bar two 4 8 baz two 7 3
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0