In the context of unit testing some functions, I’m trying to establish the equality of 2 DataFrames using python pandas:
ipdb> expect
1 2
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df
identifier 1 2
timestamp
2012-01-01 00:00:00+00:00 NaN 3
2013-05-14 12:00:00+00:00 3 NaN
ipdb> df[1][0]
nan
ipdb> df[1][0], expect[1][0]
(nan, nan)
ipdb> df[1][0] == expect[1][0]
False
ipdb> df[1][1] == expect[1][1]
True
ipdb> type(df[1][0])
<type 'numpy.float64'>
ipdb> type(expect[1][0])
<type 'numpy.float64'>
ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])
ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False
Given that I’m trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?
What is the simplest way to compare equality of Series/DataFrames including NaNs?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:
In [11]: from pandas.testing import assert_frame_equal In [12]: assert_frame_equal(df, expected, check_names=False)
You can wrap this in a function with something like:
try:
assert_frame_equal(df, expected, check_names=False)
return True
except AssertionError:
return False
In more recent pandas this functionality has been added as .equals:
df.equals(expected)
Method 2
One of the properties of NaN is that NaN != NaN is True.
Check out this answer for a nice way to do this using numexpr.
(a == b) | ((a != a) & (b != b))
says this (in pseudocode):
a == b or (isnan(a) and isnan(b))
So, either a equals b, or both a and b are NaN.
If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.
In [1]: df = DataFrame(rand(1e7, 15)) In [2]: df = df[df > 0.5] In [3]: df2 = df.copy() In [4]: df Out[4]: <class 'pandas.core.frame.DataFrame'> Int64Index: 10000000 entries, 0 to 9999999 Columns: 15 entries, 0 to 14 dtypes: float64(15) In [5]: timeit (df == df2) | ((df != df) & (df2 != df2)) 1 loops, best of 3: 598 ms per loop
timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:
In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all() 1 loops, best of 3: 687 ms per loop
Method 3
Like @PhillipCloud answer, but more written out
In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]]) In [27]: df2 = df1.copy()
They really are equivalent
In [28]: result = df1 == df2
In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [30]: result
Out[30]:
0 1
0 True True
1 True True
A nan in df2 that doesn’t exist in df1
In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])
In [32]: result = df1 == df2
In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True
In [34]: result
Out[34]:
0 1
0 True True
1 False True
You can also fill with a value you know not to be in the frame
In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]:
0 1
0 True True
1 True True
Method 4
Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.
Simply, df1.fillna('NULL') == df2.fillna('NULL'), if ‘NULL’ is not a value in the original data.
To be safe, do the following:
Example a) Compare two dataframes with NaN values
bools = (df1 == df2) bools[pd.isnull(df1) & pd.isnull(df2)] = True assert bools.all().all()
Example b) Filter rows in df1 that do not match with df2
bools = (df1 != df2) bools[pd.isnull(df1) & pd.isnull(df2)] = False df_outlier = df1[bools.all(axis=1)]
(Note: this is wrong – bools[pd.isnull(df1) == pd.isnull(df2)] = False)
Method 5
df.fillna(0) == df2.fillna(0)
You can use fillna(). Documenation here.
from pandas import DataFrame
# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df
# comparison fails!
print df == df2
# all is well
print df.fillna(0) == df2.fillna(0)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0