I have a pandas dataframe with about 50 columns and >100 rows. I want to select columns 'col_x', 'col_y' where 'col_z' < m. Is there a simple way to do this, similar to df[df['col3'] < m] and df[['colx','coly']] but combined?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Let’s break down your problem. You want to
- Filter rows based on some boolean condition
- You want to select a subset of columns from the result.
For the first point, the condition you’d need is –
df["col_z"] < m
For the second requirement, you’d want to specify the list of columns that you need –
["col_x", "col_y"]
How would you combine these two to produce an expected output with pandas? The most straightforward way is using loc –
df.loc[df["col_z"] < m, ["col_x", "col_y"]]
The first argument selects rows, and the second argument selects columns.
More About loc
Think of this in terms of the relational algebra operations – selection and projection. If you’re from the SQL world, this would be a relatable equivalent. The above operation, in SQL syntax, would look like this –
SELECT col_x, col_y # projection on columns FROM df WHERE col_z < m # selection on rows
pandas loc allows you to specify index labels for selecting rows. For example, if you have a dataframe –
col_x col_y a 1 4 b 2 5 c 3 6
To select index a, and c, and col_x you’d use –
df.loc[['a', 'c'], ['col_x']] col_x a 1 c 3
Alternatively, for selecting by a boolean condition (using a series/array of bool values, as your original question asks), where all values in col_x are odd –
df.loc[(df.col_x % 2).ne(0), ['col_y']] col_y a 4 c 6
For details, df.col_x % 2 computes the modulus of each value with respect to 2. The ne(0) will then compare the value to 0, and return True if it isn’t (all odd numbers are selected like this). Here’s what that expression results in –
(df.col_x % 2).ne(0) a True b False c True Name: col_x, dtype: bool
Further Reading
- 10 Minutes to Pandas – Selection by Label
- Indexing and selecting data
- Selection with .loc in python
- pandas loc vs. iloc vs. ix vs. at vs. iat?
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0