I have a multi-index data frame with columns ‘A’ and ‘B’.
Is there is a way to select rows by filtering on one column of the multi-index without resetting the index to a single column index?
For Example.
# has multi-index (A,B) df #can I do this? I know this doesn't work because the index is multi-index so I need to specify a tuple df.ix[df.A ==1]
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
One way is to use the get_level_values Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you’ll be able to use xs with drop_level argument:
df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:
In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1
Method 2
You can also use query which is very readable in my opinion and straightforward to use:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [10, 20, 50, 80], 'C': [6, 7, 8, 9]})
df = df.set_index(['A', 'B'])
C
A B
1 10 6
2 20 7
3 50 8
4 80 9
For what you had in mind you can now simply do:
df.query('A == 1')
C
A B
1 10 6
You can also have more complex queries using and
df.query('A >= 1 and B >= 50')
C
A B
3 50 8
4 80 9
and or
df.query('A == 1 or B >= 50')
C
A B
1 10 6
3 50 8
4 80 9
You can also query on different index levels, e.g.
df.query('A == 1 or C >= 8')
will return
C A B 1 10 6 3 50 8 4 80 9
If you want to use variables inside your query, you can use @:
b_threshold = 20
c_threshold = 8
df.query('B >= @b_threshold and C <= @c_threshold')
C
A B
2 20 7
3 50 8
Method 3
You can use DataFrame.xs():
In [36]: df = DataFrame(np.random.randn(10, 4))
In [37]: df.columns = [np.random.choice(['a', 'b'], size=4).tolist(), np.random.choice(['c', 'd'], size=4)]
In [38]: df.columns.names = ['A', 'B']
In [39]: df
Out[39]:
A b a
B d d d d
0 -1.406 0.548 -0.635 0.576
1 -0.212 -0.583 1.012 -1.377
2 0.951 -0.349 -0.477 -1.230
3 0.451 -0.168 0.949 0.545
4 -0.362 -0.855 1.676 -2.881
5 1.283 1.027 0.085 -1.282
6 0.583 -1.406 0.327 -0.146
7 -0.518 -0.480 0.139 0.851
8 -0.030 -0.630 -1.534 0.534
9 0.246 -1.558 -1.885 -1.543
In [40]: df.xs('a', level='A', axis=1)
Out[40]:
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
If you want to keep the A level (the drop_level keyword argument is only available starting from v0.13.0):
In [42]: df.xs('a', level='A', axis=1, drop_level=False)
Out[42]:
A a
B d d
0 -0.635 0.576
1 1.012 -1.377
2 -0.477 -1.230
3 0.949 0.545
4 1.676 -2.881
5 0.085 -1.282
6 0.327 -0.146
7 0.139 0.851
8 -1.534 0.534
9 -1.885 -1.543
Method 4
Understanding how to access multi-indexed pandas DataFrame can help you with all kinds of task like that.
Copy paste this in your code to generate example:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Will give you table like this:
Standard access by column
health_data['Bob']
type HR Temp
year visit
2013 1 22.0 38.6
2 52.0 38.3
2014 1 30.0 38.9
2 31.0 37.3
health_data['Bob']['HR']
year visit
2013 1 22.0
2 52.0
2014 1 30.0
2 31.0
Name: HR, dtype: float64
# filtering by column/subcolumn - your case:
health_data['Bob']['HR']==22
year visit
2013 1 True
2 False
2014 1 False
2 False
health_data['Bob']['HR'][2013]
visit
1 22.0
2 52.0
Name: HR, dtype: float64
health_data['Bob']['HR'][2013][1]
22.0
Access by row
health_data.loc[2013]
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit
1 22.0 38.6 40.0 38.9 53.0 37.5
2 52.0 38.3 42.0 34.6 30.0 37.7
health_data.loc[2013,1]
subject type
Bob HR 22.0
Temp 38.6
Guido HR 40.0
Temp 38.9
Sue HR 53.0
Temp 37.5
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']
type
HR 22.0
Temp 38.6
Name: (2013, 1), dtype: float64
health_data.loc[2013,1]['Bob']['HR']
22.0
Slicing multi-index
idx=pd.IndexSlice
health_data.loc[idx[:,1], idx[:,'HR']]
subject Bob Guido Sue
type HR HR HR
year visit
2013 1 22.0 40.0 53.0
2014 1 30.0 52.0 45.0
Method 5
You can use DataFrame.loc:
>>> df.loc[1]
Example
>>> print(df)
result
A B C
1 1 1 6
2 9
2 1 8
2 11
2 1 1 7
2 10
2 1 9
2 12
>>> print(df.loc[1])
result
B C
1 1 6
2 9
2 1 8
2 11
>>> print(df.loc[2, 1])
result
C
1 7
2 10
Method 6
Another option is:
filter1 = df.index.get_level_values('A') == 1
filter2 = df.index.get_level_values('B') == 4
df.iloc[filter1 & filter2]
Out[11]:
0
A B
1 4 1
Method 7
You can use MultiIndex slicing. For example:
arrays = [["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=["A", "B"])
df = pd.DataFrame(np.random.randint(9, size=(8, 2)), index=index, columns=["col1", "col2"])
col1 col2
A B
bar one 0 8
two 4 8
baz one 6 0
two 7 3
foo one 6 8
two 2 6
qux one 7 0
two 6 4
To select all from A and two from B:
df.loc[(slice(None), 'two'), :]
Output:
col1 col2 A B bar two 4 8 baz two 7 3 foo two 2 6 qux two 6 4
To select bar and baz from A and two from B:
df.loc[(['bar', 'baz'], 'two'), :]
Output:
col1 col2 A B bar two 4 8 baz two 7 3
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0
