What are the most common pandas ways to select/filter rows of a dataframe whose index is a MultiIndex?
- Slicing based on a single value/label
- Slicing based on multiple labels from one or more levels
- Filtering on boolean conditions and expressions
- Which methods are applicable in what circumstances
Assumptions for simplicity:
- input dataframe does not have duplicate index keys
- input dataframe below only has two levels. (Most solutions shown here generalize to N levels)
Example input:
mux = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), list('tuvwtuvwtuvwtuvw') ], names=['one', 'two']) df = pd.DataFrame({'col': np.arange(len(mux))}, mux) col one two a t 0 u 1 v 2 w 3 b t 4 u 5 v 6 w 7 t 8 c u 9 v 10 d w 11 t 12 u 13 v 14 w 15
Question 1: Selecting a Single Item
How do I select rows having “a” in level “one”?
col one two a t 0 u 1 v 2 w 3
Additionally, how would I be able to drop level “one” in the output?
col two t 0 u 1 v 2 w 3
Question 1b
How do I slice all rows with value “t” on level “two”?
col one two a t 0 b t 4 t 8 d t 12
Question 2: Selecting Multiple Values in a Level
How can I select rows corresponding to items “b” and “d” in level “one”?
col one two b t 4 u 5 v 6 w 7 t 8 d w 11 t 12 u 13 v 14 w 15
Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?
col one two a t 0 w 3 b t 4 w 7 t 8 d w 11 t 12 w 15
Question 3: Slicing a Single Cross Section (x, y)
How do I retrieve a cross section, i.e., a single row having a specific values for the index from df
? Specifically, how do I retrieve the cross section of ('c', 'u')
, given by
col one two c u 9
Question 4: Slicing Multiple Cross Sections [(a, b), (c, d), ...]
How do I select the two rows corresponding to ('c', 'u')
, and ('a', 'w')
?
col one two c u 9 a w 3
Question 5: One Item Sliced per Level
How can I retrieve all rows corresponding to “a” in level “one” or “t” in level “two”?
col one two a t 0 u 1 v 2 w 3 b t 4 t 8 d t 12
Question 6: Arbitrary Slicing
How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and for “d”, I would like to select rows with sub-level “w”.
col one two a u 1 v 2 b u 5 v 6 d w 11 w 15
Question 7 will use a unique setup consisting of a numeric level:
np.random.seed(0) mux2 = pd.MultiIndex.from_arrays([ list('aaaabbbbbccddddd'), np.random.choice(10, size=16) ], names=['one', 'two']) df2 = pd.DataFrame({'col': np.arange(len(mux2))}, mux2) col one two a 5 0 0 1 3 2 3 3 b 7 4 9 5 3 6 5 7 2 8 c 4 9 7 10 d 6 11 8 12 8 13 1 14 6 15
Question 7: Filtering by numeric inequality on individual levels of the multiindex
How do I get all rows where values in level “two” are greater than 5?
col one two b 7 4 9 5 c 7 10 d 6 11 8 12 8 13 6 15
Note: This post will not go through how to create MultiIndexes, how to perform assignment operations on them, or any performance related discussions (these are separate topics for another time).
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
MultiIndex / Advanced Indexing
Note
This post will be structured in the following manner:
- The questions put forth in the OP will be addressed, one by one
- For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.
Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details,
and other info cursory to the topic at hand. These notes have been
compiled through scouring the docs and uncovering various obscure
features, and from my own (admittedly limited) experience.All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not
find a solution applicable to your use case, please feel free to
suggest an edit, request clarification in the comments, or open a new
question, ….as applicable.
Here is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting
-
DataFrame.loc
– A general solution for selection by label (+pd.IndexSlice
for more complex applications involving slices) -
DataFrame.xs
– Extract a particular cross section from a Series/DataFrame. -
DataFrame.query
– Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes. -
Boolean indexing with a mask generated using
MultiIndex.get_level_values
(often in conjunction withIndex.isin
, especially when filtering with multiple values). This is also quite useful in some circumstances.
It will be beneficial to look at the various slicing and filtering problems in terms of the Four Idioms to gain a better understanding what can be applied to a given situation. It is very important to understand that not all of the idioms will work equally well (if at all) in every circumstance. If an idiom has not been listed as a potential solution to a problem below, that means that idiom cannot be applied to that problem effectively.
Question 1
How do I select rows having “a” in level “one”?
col one two a t 0 u 1 v 2 w 3
You can use loc
, as a general purpose solution applicable to most situations:
df.loc[['a']]
At this point, if you get
TypeError: Expected tuple, got str
That means you’re using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :]
.
Alternatively, you can use xs
here, since we are extracting a single cross section. Note the levels
and axis
arguments (reasonable defaults can be assumed here).
df.xs('a', level=0, axis=0, drop_level=False) # df.xs('a', drop_level=False)
Here, the drop_level=False
argument is needed to prevent xs
from dropping level “one” in the result (the level we sliced on).
Yet another option here is using query
:
df.query("one == 'a'")
If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'"
.
Finally, using get_level_values
:
df[df.index.get_level_values('one') == 'a'] # If your levels are unnamed, or if you need to select by position (not label), # df[df.index.get_level_values(0) == 'a']
Additionally, how would I be able to drop level “one” in the output?
col two t 0 u 1 v 2 w 3
This can be easily done using either
df.loc['a'] # Notice the single string argument instead the list.
Or,
df.xs('a', level=0, axis=0, drop_level=True) # df.xs('a')
Notice that we can omit the drop_level
argument (it is assumed to be True
by default).
Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,v = df.loc[['a']] print(v) col one two a t 0 u 1 v 2 w 3 print(v.index) MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']], labels=[[0, 0, 0, 0], [0, 1, 2, 3]], names=['one', 'two'])You can get rid of these levels using
MultiIndex.remove_unused_levels
:v.index = v.index.remove_unused_levels()
print(v.index) MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']], labels=[[0, 0, 0, 0], [0, 1, 2, 3]], names=['one', 'two'])
Question 1b
How do I slice all rows with value “t” on level “two”?
col one two a t 0 b t 4 t 8 d t 12
Intuitively, you would want something involving slice()
:
df.loc[(slice(None), 't'), :]
It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice
API here.
idx = pd.IndexSlice df.loc[idx[:, 't'], :]
This is much, much cleaner.
Note
Why is the trailing slice:
across the columns required? This is because,loc
can be used to select and slice along both axes (axis=0
or
axis=1
). Without explicitly making it clear which axis the slicing
is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.If you want to remove any shade of ambiguity,
loc
accepts anaxis
parameter:df.loc(axis=0)[pd.IndexSlice[:, 't']]Without the
axis
parameter (i.e., just by doingdf.loc[pd.IndexSlice[:, 't']]
), slicing is assumed to be on the columns,
and aKeyError
will be raised in this circumstance.This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.
With xs
, it is
df.xs('t', axis=0, level=1, drop_level=False)
With query
, it is
df.query("two == 't'") # Or, if the first level has no name, # df.query("ilevel_1 == 't'")
And finally, with get_level_values
, you may do
df[df.index.get_level_values('two') == 't'] # Or, to perform selection by position/integer, # df[df.index.get_level_values(1) == 't']
All to the same effect.
Question 2
How can I select rows corresponding to items “b” and “d” in level “one”?
col one two b t 4 u 5 v 6 w 7 t 8 d w 11 t 12 u 13 v 14 w 15
Using loc, this is done in a similar fashion by specifying a list.
df.loc[['b', 'd']]
To solve the above problem of selecting “b” and “d”, you can also use query
:
items = ['b', 'd'] df.query("one in @items") # df.query("one == @items", parser='pandas') # df.query("one in ['b', 'd']") # df.query("one == ['b', 'd']", parser='pandas')
Note
Yes, the default parser is'pandas'
, but it is important to highlight this syntax isn’t conventionally python. The
Pandas parser generates a slightly different parse tree from the
expression. This is done to make some operations more intuitive to
specify. For more information, please read my post on
Dynamic Expression Evaluation in pandas using pd.eval().
And, with get_level_values
+ Index.isin
:
df[df.index.get_level_values("one").isin(['b', 'd'])]
Question 2b
How would I get all values corresponding to “t” and “w” in level “two”?
col one two a t 0 w 3 b t 4 w 7 t 8 d w 11 t 12 w 15
With loc
, this is possible only in conjuction with pd.IndexSlice
.
df.loc[pd.IndexSlice[:, ['t', 'w']], :]
The first colon :
in pd.IndexSlice[:, ['t', 'w']]
means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.
With query
, this is
items = ['t', 'w'] df.query("two in @items") # df.query("two == @items", parser='pandas') # df.query("two in ['t', 'w']") # df.query("two == ['t', 'w']", parser='pandas')
With get_level_values
and Index.isin
(similar to above):
df[df.index.get_level_values('two').isin(['t', 'w'])]
Question 3
How do I retrieve a cross section, i.e., a single row having a specific values
for the index fromdf
? Specifically, how do I retrieve the cross
section of('c', 'u')
, given bycol one two c u 9
Use loc
by specifying a tuple of keys:
df.loc[('c', 'u'), :]
Or,
df.loc[pd.IndexSlice[('c', 'u')]]
Note
At this point, you may run into aPerformanceWarning
that looks like this:PerformanceWarning: indexing past lexsort depth may impact performance.This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort your
DataFrame in advance usingDataFrame.sort_index
. This is especially desirable from a performance standpoint if you plan on doing
multiple such queries in tandem:df_sort = df.sort_index() df_sort.loc[('c', 'u')]You can also use
MultiIndex.is_lexsorted()
to check whether the index
is sorted or not. This function returnsTrue
orFalse
accordingly.
You can call this function to determine whether an additional sorting
step is required or not.
With xs
, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:
df.xs(('c', 'u'))
With query
, things become a bit clunky:
df.query("one == 'c' and two == 'u'")
You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.
With accesses spanning multiple levels, get_level_values
can still be used, but is not recommended:
m1 = (df.index.get_level_values('one') == 'c') m2 = (df.index.get_level_values('two') == 'u') df[m1 & m2]
Question 4
How do I select the two rows corresponding to
('c', 'u')
, and('a', 'w')
?col one two c u 9 a w 3
With loc
, this is still as simple as:
df.loc[[('c', 'u'), ('a', 'w')]] # df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]
With query
, you will need to dynamically generate a query string by iterating over your cross sections and levels:
cses = [('c', 'u'), ('a', 'w')] levels = ['one', 'two'] # This is a useful check to make in advance. assert all(len(levels) == len(cs) for cs in cses) query = '(' + ') or ('.join([ ' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)]) for cs in cses ]) + ')' print(query) # ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w')) df.query(query)
100% DO NOT RECOMMEND! But it is possible.
What if I have multiple levels?
One option in this scenario would be to use droplevel
to drop the levels you’re not checking, then use isin
to test membership, and then boolean index on the final result.
df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]
Question 5
How can I retrieve all rows corresponding to “a” in level “one” or
“t” in level “two”?col one two a t 0 u 1 v 2 w 3 b t 4 t 8 d t 12
This is actually very difficult to do with loc
while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']]
is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]]
(i.e., selecting a cross section). You may think of a solution with pd.concat
to handle each label separately:
pd.concat([ df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:] ]) col one two a t 0 u 1 v 2 w 3 t 0 # Does this look right to you? No, it isn't! b t 4 t 8 d t 12
But you’ll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to do
v = pd.concat([ df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:] ]) v[~v.index.duplicated()]
But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.
With query
, this is stupidly simple:
df.query("one == 'a' or two == 't'")
With get_level_values
, this is still simple, but not as elegant:
m1 = (df.index.get_level_values('one') == 'a') m2 = (df.index.get_level_values('two') == 't') df[m1 | m2]
Question 6
How can I slice specific cross sections? For “a” and “b”, I would like to select all rows with sub-levels “u” and “v”, and
for “d”, I would like to select rows with sub-level “w”.col one two a u 1 v 2 b u 5 v 6 d w 11 w 15
This is a special case that I’ve added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.
Usually, slicing problems like this will require explicitly passing a list of keys to loc
. One way of doing this is with:
keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')] df.loc[keys, :]
If you want to save some typing, you will recognise that there is a pattern to slicing “a”, “b” and its sublevels, so we can separate the slicing task into two portions and concat
the result:
pd.concat([ df.loc[(('a', 'b'), ('u', 'v')), :], df.loc[('d', 'w'), :] ], axis=0)
Slicing specification for “a” and “b” is slightly cleaner (('a', 'b'), ('u', 'v'))
because the same sub-levels being indexed are the same for each level.
Question 7
How do I get all rows where values in level “two” are greater than 5?
col one two b 7 4 9 5 c 7 10 d 6 11 8 12 8 13 6 15
This can be done using query
,
df2.query("two > 5")
And get_level_values
.
df2[df2.index.get_level_values('two') > 5]
Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember thatloc
andxs
are specifically for label-based indexing, whilequery
and
get_level_values
are helpful for building general conditional masks
for filtering.
Bonus Question
What if I need to slice a
MultiIndex
column?
Actually, most solutions here are applicable to columns as well, with minor changes. Consider:
np.random.seed(0) mux3 = pd.MultiIndex.from_product([ list('ABCD'), list('efgh') ], names=['one','two']) df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3) print(df3) one A B C D two e f g h e f g h e f g h e f g h 0 5 0 3 3 7 9 3 5 2 4 7 6 8 8 1 6 1 7 7 8 1 5 9 8 9 4 3 0 3 5 0 2 3 2 8 1 3 3 3 7 0 1 9 9 0 4 7 3 2 7
These are the following changes you will need to make to the Four Idioms to have them working with columns.
-
To slice with
loc
, usedf3.loc[:, ....] # Notice how we slice across the index with `:`.
or,
df3.loc[:, pd.IndexSlice[...]]
-
To use
xs
as appropriate, just pass an argumentaxis=1
. -
You can access the column level values directly using
df.columns.get_level_values
. You will then need to do something likedf.loc[:, {condition}]
Where
{condition}
represents some condition built usingcolumns.get_level_values
. -
To use
query
, your only option is to transpose, query on the index, and transpose again:df3.T.query(...).T
Not recommended, use one of the other 3 options.
Method 2
Recently I came across a use case where I had a 3+ level multi-index dataframe in which I couldn’t make any of the solutions above produce the results I was looking for. It’s quite possible that the above solutions do of course work for my use case, and I tried several, however I was unable to get them to work with the time I had available.
I am far from expert, but I stumbled across a solution that was not listed in the comprehensive answers above. I offer no guarantee that the solutions are in any way optimal.
This is a different way to get a slightly different result to Question #6 above. (and likely other questions as well)
Specifically I was looking for:
- A way to choose two+ values from one level of the index and a single value from another level of the index, and
- A way to leave the index values from the previous operation in the dataframe output.
As a monkey wrench in the gears (however totally fixable):
- The indexes were unnamed.
On the toy dataframe below:
index = pd.MultiIndex.from_product([['a','b'], ['stock1','stock2','stock3'], ['price','volume','velocity']]) df = pd.DataFrame([1,2,3,4,5,6,7,8,9, 10,11,12,13,14,15,16,17,18], index) 0 a stock1 price 1 volume 2 velocity 3 stock2 price 4 volume 5 velocity 6 stock3 price 7 volume 8 velocity 9 b stock1 price 10 volume 11 velocity 12 stock2 price 13 volume 14 velocity 15 stock3 price 16 volume 17 velocity 18
Using the below works, of course:
df.xs(('stock1', 'velocity'), level=(1,2)) 0 a 3 b 12
But I wanted a different result, so my method to get that result was:
df.iloc[df.index.isin(['stock1'], level=1) & df.index.isin(['velocity'], level=2)] 0 a stock1 velocity 3 b stock1 velocity 12
And if I wanted two+ values from one level and a single (or 2+) value from another level:
df.iloc[df.index.isin(['stock1','stock3'], level=1) & df.index.isin(['velocity'], level=2)] 0 a stock1 velocity 3 stock3 velocity 9 b stock1 velocity 12 stock3 velocity 18
The above method is probably a bit clunky, however I found it filled my needs and as a bonus was easier for me to understand and read.
Method 3
This looks like a great case for dfsql
df.sql(<SQL select statement>)
https://github.com/mindsdb/dfsql
A complete article about it here:
Method 4
I have long used and appreciate this question, and @cs95‘s response, which is very thorough and handles all instances. Similar to @r-a‘s answer, I too wanted to find a way to work with multiple indices that contained multiple levels.
I finally found a way to obtain an arbitrary number of slices given a level or a named index, which is able to handle several of the questions proposed above. The major improvement here is not having to parse out slice(None)
or the :
with pd.IndexSlice
for multiple indexes, or slices.
import pandas as pd def slice_df_by(df_, slice_by=["Oman", "Nairobi",], slice_idx='country'): idxn = df_.index.names.index(slice_idx) return df_.loc[tuple([slice(None)]*idxn +[slice_by] ), :] gender = tuple(["male", "female"]*6) thrown = tuple(["rock", "scissors", "paper"]*4) country = tuple(["Nairobi", "Oman", "Djibouti", "Belize"]*3) names = tuple(["Chris", "Pat", "Michele", "Thomy", "Musa", "Casey"]*2) tuples = list(zip(gender, thrown, country, names)) idx = pd.MultiIndex.from_tuples(tuples, names=["gender", "thrown", "country", "name"]) df = pd.DataFrame({'Count A': [12., 70., 30., 20.]*3, 'Count B': [12., 70., 30., 20.]*3}, index=idx)
The benefit here is that you can add any combination of these calls to the function slice_df_by
to get more complicated slices while only using the index name and a list of values.
print(slice_df_by(df)) Count A Count B gender thrown country name female scissors Oman Pat 70.0 70.0 paper Oman Casey 70.0 70.0 rock Oman Thomy 70.0 70.0 male rock Nairobi Chris 12.0 12.0 scissors Nairobi Musa 12.0 12.0 paper Nairobi Michele 12.0 12.0
The catch, as @r-a pointed out is not having named indices. There are plenty of ways to satisfy this using the approach here, such as df.index.names = ["names", "for", "the", "indices"]
or some such method:
idxz = lambda ixln=4: [chr(i) for i in np.arange(ixln)+65] df.index.names = idxz(len(df.index.names)) print(idxz()) Out[132]: ['A', 'B', 'C', 'D']
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0