Selecting Pandas Columns by dtype

I was wondering if there is an elegant and shorthand way in Pandas DataFrames to select columns by data type (dtype). i.e. Select only int64 columns from a DataFrame.

To elaborate, something along the lines of

df.select_columns(dtype=float64)

Thanks in advance for the help

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Method 7

Method 8

Method 9

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Since 0.14.1 there’s a select_dtypes method so you can do this more elegantly/generally.

In [11]: df = pd.DataFrame([[1, 2.2, 'three']], columns=['A', 'B', 'C'])

In [12]: df.select_dtypes(include=['int'])
Out[12]:
   A
0  1

To select all numeric types use the numpy dtype numpy.number

In [13]: df.select_dtypes(include=[np.number])
Out[13]:
   A    B
0  1  2.2

In [14]: df.select_dtypes(exclude=[object])
Out[14]:
   A    B
0  1  2.2

Method 2

df.loc[:, df.dtypes == np.float64]

Method 3

df.select_dtypes(include=[np.float64])

Method 4

I’d like to extend existing answer by adding options for selecting all floating dtypes or all integer dtypes:

Demo:

np.random.seed(1234)

df = pd.DataFrame({
        'a':np.random.rand(3), 
        'b':np.random.rand(3).astype('float32'), 
        'c':np.random.randint(10,size=(3)).astype('int16'),
        'd':np.arange(3).astype('int32'), 
        'e':np.random.randint(10**7,size=(3)).astype('int64'),
        'f':np.random.choice([True, False], 3),
        'g':pd.date_range('2000-01-01', periods=3)
     })

yields:

In [2]: df
Out[2]:
          a         b  c  d        e      f          g
0  0.191519  0.785359  6  0  7578569  False 2000-01-01
1  0.622109  0.779976  8  1  7981439   True 2000-01-02
2  0.437728  0.272593  0  2  2558462   True 2000-01-03

In [3]: df.dtypes
Out[3]:
a           float64
b           float32
c             int16
d             int32
e             int64
f              bool
g    datetime64[ns]
dtype: object

Selecting all floating number columns:

In [4]: df.select_dtypes(include=['floating'])
Out[4]:
          a         b
0  0.191519  0.785359
1  0.622109  0.779976
2  0.437728  0.272593

In [5]: df.select_dtypes(include=['floating']).dtypes
Out[5]:
a    float64
b    float32
dtype: object

Selecting all integer number columns:

In [6]: df.select_dtypes(include=['integer'])
Out[6]:
   c  d        e
0  6  0  7578569
1  8  1  7981439
2  0  2  2558462

In [7]: df.select_dtypes(include=['integer']).dtypes
Out[7]:
c    int16
d    int32
e    int64
dtype: object

Selecting all numeric columns:

In [8]: df.select_dtypes(include=['number'])
Out[8]:
          a         b  c  d        e
0  0.191519  0.785359  6  0  7578569
1  0.622109  0.779976  8  1  7981439
2  0.437728  0.272593  0  2  2558462

In [9]: df.select_dtypes(include=['number']).dtypes
Out[9]:
a    float64
b    float32
c      int16
d      int32
e      int64
dtype: object

Method 5

Multiple includes for selecting columns with list of types for example- float64 and int64

df_numeric = df.select_dtypes(include=[np.float64,np.int64])

Method 6

If you want to select int64 columns and then update “in place”, you can use:

int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)]
df[int64_cols]

For example, notice that I update all the int64 columns in df to zero below:

In [1]:

    import pandas as pd
    from pandas.api.types import is_int64_dtype

    df = pd.DataFrame({'a': [1, 2] * 3,
                       'b': [True, False] * 3,
                       'c': [1.0, 2.0] * 3,
                       'd': ['red','blue'] * 3,
                       'e': pd.Series(['red','blue'] * 3, dtype="category"),
                       'f': pd.Series([1, 2] * 3, dtype="int64")})

    int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)] 
    print('int64 Cols: ',int64_cols)

    print(df[int64_cols])

    df[int64_cols] = 0

    print(df[int64_cols]) 

Out [1]:

    int64 Cols:  ['a', 'f']

           a  f
        0  1  1
        1  2  2
        2  1  1
        3  2  2
        4  1  1
        5  2  2
           a  f
        0  0  0
        1  0  0
        2  0  0
        3  0  0
        4  0  0
        5  0  0

Just for completeness:

df.loc() and df.select_dtypes() are going to give a copy of a slice from the dataframe. This means that if you try to update values from df.select_dtypes(), you will get a SettingWithCopyWarning and no updates will happen to df in place.

For example, notice when I try to update df using .loc() or .select_dtypes() to select columns, nothing happens:

In [2]:

    df = pd.DataFrame({'a': [1, 2] * 3,
                       'b': [True, False] * 3,
                       'c': [1.0, 2.0] * 3,
                       'd': ['red','blue'] * 3,
                       'e': pd.Series(['red','blue'] * 3, dtype="category"),
                       'f': pd.Series([1, 2] * 3, dtype="int64")})

    df_bool = df.select_dtypes(include='bool')
    df_bool.b[0] = False

    print(df_bool.b[0])
    print(df.b[0])

    df.loc[:, df.dtypes == np.int64].a[0]=7
    print(df.a[0])

Out [2]:

    False
    True
    1

Method 7

select_dtypes(include=[np.int])

Method 8

Optionally if you don’t want to create a subset of the dataframe during the process, you can directly iterate through the column datatype.

I haven’t benchmarked the code below, assume it will be faster if you work on very large dataset.

[col for col in df.columns.tolist() if df[col].dtype not in ['object','<M8[ns]']]

Method 9

You can use :

for i in x.columns[x.dtypes == 'object']:
    print(i)

incase you just want to display only the column names of a particular dataframe rather than a sliced dataframe. Don’t know if any function as such exits for python.

PS : replace object with the datatype you want.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating