I was wondering if there is an elegant and shorthand way in Pandas DataFrames to select columns by data type (dtype). i.e. Select only int64 columns from a DataFrame.
To elaborate, something along the lines of
df.select_columns(dtype=float64)
Thanks in advance for the help
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Since 0.14.1 there’s a select_dtypes method so you can do this more elegantly/generally.
In [11]: df = pd.DataFrame([[1, 2.2, 'three']], columns=['A', 'B', 'C']) In [12]: df.select_dtypes(include=['int']) Out[12]: A 0 1
To select all numeric types use the numpy dtype numpy.number
In [13]: df.select_dtypes(include=[np.number]) Out[13]: A B 0 1 2.2 In [14]: df.select_dtypes(exclude=[object]) Out[14]: A B 0 1 2.2
Method 2
df.loc[:, df.dtypes == np.float64]
Method 3
df.select_dtypes(include=[np.float64])
Method 4
I’d like to extend existing answer by adding options for selecting all floating dtypes or all integer dtypes:
Demo:
np.random.seed(1234)
df = pd.DataFrame({
'a':np.random.rand(3),
'b':np.random.rand(3).astype('float32'),
'c':np.random.randint(10,size=(3)).astype('int16'),
'd':np.arange(3).astype('int32'),
'e':np.random.randint(10**7,size=(3)).astype('int64'),
'f':np.random.choice([True, False], 3),
'g':pd.date_range('2000-01-01', periods=3)
})
yields:
In [2]: df
Out[2]:
a b c d e f g
0 0.191519 0.785359 6 0 7578569 False 2000-01-01
1 0.622109 0.779976 8 1 7981439 True 2000-01-02
2 0.437728 0.272593 0 2 2558462 True 2000-01-03
In [3]: df.dtypes
Out[3]:
a float64
b float32
c int16
d int32
e int64
f bool
g datetime64[ns]
dtype: object
Selecting all floating number columns:
In [4]: df.select_dtypes(include=['floating'])
Out[4]:
a b
0 0.191519 0.785359
1 0.622109 0.779976
2 0.437728 0.272593
In [5]: df.select_dtypes(include=['floating']).dtypes
Out[5]:
a float64
b float32
dtype: object
Selecting all integer number columns:
In [6]: df.select_dtypes(include=['integer']) Out[6]: c d e 0 6 0 7578569 1 8 1 7981439 2 0 2 2558462 In [7]: df.select_dtypes(include=['integer']).dtypes Out[7]: c int16 d int32 e int64 dtype: object
Selecting all numeric columns:
In [8]: df.select_dtypes(include=['number'])
Out[8]:
a b c d e
0 0.191519 0.785359 6 0 7578569
1 0.622109 0.779976 8 1 7981439
2 0.437728 0.272593 0 2 2558462
In [9]: df.select_dtypes(include=['number']).dtypes
Out[9]:
a float64
b float32
c int16
d int32
e int64
dtype: object
Method 5
Multiple includes for selecting columns with list of types for example- float64 and int64
df_numeric = df.select_dtypes(include=[np.float64,np.int64])
Method 6
If you want to select int64 columns and then update “in place”, you can use:
int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)] df[int64_cols]
For example, notice that I update all the int64 columns in df to zero below:
In [1]:
import pandas as pd
from pandas.api.types import is_int64_dtype
df = pd.DataFrame({'a': [1, 2] * 3,
'b': [True, False] * 3,
'c': [1.0, 2.0] * 3,
'd': ['red','blue'] * 3,
'e': pd.Series(['red','blue'] * 3, dtype="category"),
'f': pd.Series([1, 2] * 3, dtype="int64")})
int64_cols = [col for col in df.columns if is_int64_dtype(df[col].dtype)]
print('int64 Cols: ',int64_cols)
print(df[int64_cols])
df[int64_cols] = 0
print(df[int64_cols])
Out [1]:
int64 Cols: ['a', 'f']
a f
0 1 1
1 2 2
2 1 1
3 2 2
4 1 1
5 2 2
a f
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
Just for completeness:
df.loc() and df.select_dtypes() are going to give a copy of a slice from the dataframe. This means that if you try to update values from df.select_dtypes(), you will get a SettingWithCopyWarning and no updates will happen to df in place.
For example, notice when I try to update df using .loc() or .select_dtypes() to select columns, nothing happens:
In [2]:
df = pd.DataFrame({'a': [1, 2] * 3,
'b': [True, False] * 3,
'c': [1.0, 2.0] * 3,
'd': ['red','blue'] * 3,
'e': pd.Series(['red','blue'] * 3, dtype="category"),
'f': pd.Series([1, 2] * 3, dtype="int64")})
df_bool = df.select_dtypes(include='bool')
df_bool.b[0] = False
print(df_bool.b[0])
print(df.b[0])
df.loc[:, df.dtypes == np.int64].a[0]=7
print(df.a[0])
Out [2]:
False
True
1
Method 7
select_dtypes(include=[np.int])
Method 8
Optionally if you don’t want to create a subset of the dataframe during the process, you can directly iterate through the column datatype.
I haven’t benchmarked the code below, assume it will be faster if you work on very large dataset.
[col for col in df.columns.tolist() if df[col].dtype not in ['object','<M8[ns]']]
Method 9
You can use :
for i in x.columns[x.dtypes == 'object']:
print(i)
incase you just want to display only the column names of a particular dataframe rather than a sliced dataframe. Don’t know if any function as such exits for python.
PS : replace object with the datatype you want.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0