Subclassing Pandas classes seems a common need, but I could not find references on the subject. (It seems that Pandas developers are still working on it: Easier subclassing #60.)
There are some SO questions on the subject, but I am hoping that someone here can provide a more systematic account on the current best way to subclass pandas.DataFrame that satisfies two general requirements:
- calling standard DataFrame methods on instances of MyDF should produce instances of MyDF
- calling standard DataFrame methods on instances of MyDF should leave all attributes still attached to the output
(And are there any significant differences for subclassing pandas.Series?)
Code for subclassing pd.DataFrame:
import numpy as np
import pandas as pd
class MyDF(pd.DataFrame):
# how to subclass pandas DataFrame?
pass
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print(type(mydf)) # <class '__main__.MyDF'>
# Requirement 1: Instances of MyDF, when calling standard methods of DataFrame,
# should produce instances of MyDF.
mydf_sub = mydf[['A','C']]
print(type(mydf_sub)) # <class 'pandas.core.frame.DataFrame'>
# Requirement 2: Attributes attached to instances of MyDF, when calling standard
# methods of DataFrame, should still attach to the output.
mydf.myattr = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print(hasattr(mydf_cp1, 'myattr')) # False
print(hasattr(mydf_cp2, 'myattr')) # False
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
There is now an official guide on how to subclass Pandas data structures, which includes DataFrame as well as Series.
The guide is available here: https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas
The guide mentions this subclassed DataFrame from the Geopandas project as a good example: https://github.com/geopandas/geopandas/blob/master/geopandas/geodataframe.py
As in HYRY’s answer, it seems there are two things you’re trying to accomplish:
- When calling methods on an instance of your class, return instances of the correct type (your type). For this, you can just add the
_constructorproperty which should return your type. - Adding attributes which will be attached to copies of your object. To do this, you need to store the names of these attributes in a list, as the special
_metadataattribute.
Here’s an example:
class SubclassedDataFrame(DataFrame):
_metadata = ['added_property']
added_property = 1 # This will be passed to copies
@property
def _constructor(self):
return SubclassedDataFrame
Method 2
For Requirement 1, just define _constructor:
import pandas as pd
import numpy as np
class MyDF(pd.DataFrame):
@property
def _constructor(self):
return MyDF
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)
mydf_sub = mydf[['A','C']]
print type(mydf_sub)
I think there is no simple solution for Requirement 2. I think you need define __init__, copy, or do something in _constructor, for example:
import pandas as pd
import numpy as np
class MyDF(pd.DataFrame):
_attributes_ = "myattr1,myattr2"
def __init__(self, *args, **kw):
super(MyDF, self).__init__(*args, **kw)
if len(args) == 1 and isinstance(args[0], MyDF):
args[0]._copy_attrs(self)
def _copy_attrs(self, df):
for attr in self._attributes_.split(","):
df.__dict__[attr] = getattr(self, attr, None)
@property
def _constructor(self):
def f(*args, **kw):
df = MyDF(*args, **kw)
self._copy_attrs(df)
return df
return f
mydf = MyDF(np.random.randn(3,4), columns=['A','B','C','D'])
print type(mydf)
mydf_sub = mydf[['A','C']]
print type(mydf_sub)
mydf.myattr1 = 1
mydf_cp1 = MyDF(mydf)
mydf_cp2 = mydf.copy()
print mydf_cp1.myattr1, mydf_cp2.myattr1
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0