Code example:

In [171]: A = np.array([1.1, 1.1, 3.3, 3.3, 5.5, 6.6]) In [172]: B = np.array([111, 222, 222, 333, 333, 777]) In [173]: C = randint(10, 99, 6) In [174]: df = pd.DataFrame(zip(A, B, C), columns=['A', 'B', 'C']) In [175]: df.set_index(['A', 'B'], inplace=True) In [176]: df Out[176]: C A B 1.1 111 20 222 31 3.3 222 24 333 65 5.5 333 22 6.6 777 74

Now, I want to retrieve A values:

**Q1**: in range [3.3, 6.6] – expected return value: [3.3, 5.5, 6.6] or [3.3, 3.3, 5.5, 6.6] in case last inclusive, and [3.3, 5.5] or [3.3, 3.3, 5.5] if not.

**Q2**: in range [2.0, 4.0] – expected return value: [3.3] or [3.3, 3.3]

Same for any other *MultiIndex* dimension, for example B values:

**Q3**: in range [111, 500] with repetitions, as number of data rows in range – expected return value: [111, 222, 222, 333, 333]

More formal:

Let us assume T is a table with columns A, B and C. The table includes *n* rows. Table cells are numbers, for example A double, B and C integers. Let’s create a *DataFrame* of table T, let us name it DF. Let’s set columns A and B indexes of DF (without duplication, i.e. no separate columns A and B as indexes, and separate as data), i.e. A and B in this case *MultiIndex*.

Questions:

- How to write a query on the index, for example, to query the index A (or B), say in the labels interval [120.0, 540.0]? Labels 120.0 and 540.0 exist. I must clarify that I am interested only in the list of indices as a response to the query!
- How to the same, but in case of the labels 120.0 and 540.0 do not exist, but there are labels by value lower than 120, higher than 120 and less than 540, or higher than 540?
- In case the answer for Q1 and Q2 was unique index values, now the same, but with repetitions, as number of data rows in index range.

I know the answers to the above questions in the case of columns which are not indexes, but in the indexes case, after a long research in the web and experimentation with the functionality of *pandas*, I did not succeed. The only method (without additional programming) I see now is to have a duplicate of A and B as data columns in addition to index.

## Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

### Method 1

To query the *df* by the *MultiIndex* values, for example where *(A > 1.7) and (B < 666)*:

In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)] In [537]: result_df Out[537]: C A B 3.3 222 43 333 59 5.5 333 56

Hence, to get for example the *‘A’* index values, if still required:

In [538]: result_df.index.get_level_values('A') Out[538]: Index([3.3, 3.3, 5.5], dtype=object)

The problem is, that in large data frames the performance of *by index* selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:

In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY) In [559]: len(df) Out[559]: 12857 In [560]: df.sort(inplace=True) In [561]: df_without_index = df.reset_index() In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)] 1000 loops, best of 3: 562 µs per loop In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)] 1000 loops, best of 3: 507 µs per loop

### Method 2

**For better readability**, we can simply use the `query()`

Method, to avoid the lengthy `df.index.get_level_values()`

and `reset_index`

/`set_index`

to and fro.

Here is the target `DataFrame`

:

In [12]: df Out[12]: C A B 1.1 111 68 222 40 3.3 222 20 333 11 5.5 333 80 6.6 777 51

Answer for **Q1** (`A`

in range `[3.3, 6.6]`

):

In [13]: df.query('3.3 <= A <= 6.6') # for closed interval Out[13]: C A B 3.3 222 20 333 11 5.5 333 80 6.6 777 51 In [14]: df.query('3.3 < A < 6.6') # for open interval Out[14]: C A B 5.5 333 80

and of course one can play around with `<, <=, >, >=`

for any kind of inclusion.

Similarly, answer for **Q2** (`A`

in range `[2.0, 4.0]`

):

In [15]: df.query('2.0 <= A <= 4.0') Out[15]: C A B 3.3 222 20 333 11

Answer for **Q3** (`B`

in range `[111, 500]`

):

In [16]: df.query('111 <= B <= 500') Out[16]: C A B 1.1 111 68 222 40 3.3 222 20 333 11 5.5 333 80

And moreover, you can **COMBINE** the query for col `A`

and `B`

very naturally!

In [17]: df.query('0 < A < 4 and 150 < B < 400') Out[17]: C A B 1.1 222 40 3.3 222 20 333 11

### Method 3

With a ‘float’ like index you always want to use it as a column rather than a direct indexing action. These will all work whether the endpoints exist or not.

In [11]: df Out[11]: C A B 1.1 111 81 222 45 3.3 222 98 333 13 5.5 333 89 6.6 777 98 In [12]: x = df.reset_index()

Q1

In [13]: x.loc[(x.A>=3.3)&(x.A<=6.6)] Out[13]: A B C 2 3.3 222 98 3 3.3 333 13 4 5.5 333 89 5 6.6 777 98

Q2

In [14]: x.loc[(x.A>=2.0)&(x.A<=4.0)] Out[14]: A B C 2 3.3 222 98 3 3.3 333 13

Q3

In [15]: x.loc[(x.B>=111.0)&(x.B<=500.0)] Out[15]: A B C 0 1.1 111 81 1 1.1 222 45 2 3.3 222 98 3 3.3 333 13 4 5.5 333 89

If you want the indices back, just set them. This is a cheap operation.

In [16]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']) Out[16]: C A B 1.1 111 81 222 45 3.3 222 98 333 13 5.5 333 89

If you REALLY want the actual index values

In [5]: x.loc[(x.B>=111.0)&(x.B<=500.0)].set_index(['A','B']).index Out[5]: MultiIndex [(1.1, 111), (1.1, 222), (3.3, 222), (3.3, 333), (5.5, 333)]

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0