I have a Dataframe with a pandas MultiIndex:
In [1]: import pandas as pd
In [2]: multi_index = pd.MultiIndex.from_product([['CAN','USA'],['total']],names=['country','sex'])
In [3]: df = pd.DataFrame({'pop':[35,318]},index=multi_index)
In [4]: df
Out[4]:
pop
country sex
CAN total 35
USA total 318
Then I remove some rows from that DataFrame:
In [5]: df = df.query('pop > 100')
In [6]: df
Out[6]:
pop
country sex
USA total 318
But when I consult the MutliIndex, it still has both countries in its levels.
In [7]: df.index.levels[0] Out[7]: Index([u'CAN', u'USA'], dtype='object')
I can fix this myself in a rather strange way:
In [8]: idx_names = df.index.names
In [9]: df = df.reset_index(drop=False)
In [10]: df = df.set_index(idx_names)
In [11]: df
Out[11]:
pop
country sex
USA total 318
In [12]: df.index.levels[0]
Out[12]: Index([u'USA'], dtype='object')
But this seems rather messy. Is there a better way I’m missing?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
From version pandas 0.20.0+ use MultiIndex.remove_unused_levels:
print (df.index)
MultiIndex(levels=[['CAN', 'USA'], ['total']],
labels=[[1], [0]],
names=['country', 'sex'])
df.index = df.index.remove_unused_levels()
print (df.index)
MultiIndex(levels=[['USA'], ['total']],
labels=[[0], [0]],
names=['country', 'sex'])
Method 2
This is something that has bitten me before. Dropping columns or rows does NOT change the underlying MultiIndex, for performance and philosophical reasons, and this is officially not considered a bug (read more here). The short answer is that the developers say “that’s not what the MultiIndex is for”. If you need a list of the contents of a MultiIndex level after modification, for example for iteration or to check to see if something is included, you can use:
df.index.get_level_values(<levelname>)
This returns the current active values within that index level.
So I guess the “trick” here is that the API native way to do it is to use get_level_values instead of just .index or .columns
Method 3
I will be surprised if there is a more “built-in” way to eliminate the unused country than to re-create the index in the way you’re doing (or some similar way). If you look at your index before and after the slice:
In [165]: df.index
Out[165]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[0, 1], [0, 0]],
names=[u'country', u'sex'])
In [166]: df = df.query('pop > 100')
In [167]: df.index
Out[167]:
MultiIndex(levels=[[u'CAN', u'USA'], [u'total']],
labels=[[1], [0]],
names=[u'country', u'sex'])
you can see that the labels – which are indexes into the level values – have updated but not the level values. This may be an imperfect analogy, but it strikes me that the level values are analogous to an enumerated column in a database table, while the labels are analogous to the actual values of rows in the table. If you delete all the rows in a table with a value of “CAN”, it doesn’t change the fact that “CAN” is still a valid choice based on the column definition. To remove “CAN” from the enumeration, you have to alter the column definition; this is the equivalent of reindexing the dataframe in pandas.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0