Pandas DataFrames with NaNs equality comparison

In the context of unit testing some functions, I’m trying to establish the equality of 2 DataFrames using python pandas:

ipdb> expect
                            1   2
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df
identifier                  1   2
timestamp
2012-01-01 00:00:00+00:00 NaN   3
2013-05-14 12:00:00+00:00   3 NaN

ipdb> df[1][0]
nan

ipdb> df[1][0], expect[1][0]
(nan, nan)

ipdb> df[1][0] == expect[1][0]
False

ipdb> df[1][1] == expect[1][1]
True

ipdb> type(df[1][0])
<type 'numpy.float64'>

ipdb> type(expect[1][0])
<type 'numpy.float64'>

ipdb> (list(df[1]), list(expect[1]))
([nan, 3.0], [nan, 3.0])

ipdb> df1, df2 = (list(df[1]), list(expect[1])) ;; df1 == df2
False

Given that I’m trying to test the entire of expect against the entire of df, including NaN positions, what am I doing wrong?

What is the simplest way to compare equality of Series/DataFrames including NaNs?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use assert_frame_equals with check_names=False (so as not to check the index/columns names), which will raise if they are not equal:

In [11]: from pandas.testing import assert_frame_equal

In [12]: assert_frame_equal(df, expected, check_names=False)

You can wrap this in a function with something like:

try:
    assert_frame_equal(df, expected, check_names=False)
    return True
except AssertionError:
    return False

In more recent pandas this functionality has been added as .equals:

df.equals(expected)

Method 2

One of the properties of NaN is that NaN != NaN is True.

Check out this answer for a nice way to do this using numexpr.

(a == b) | ((a != a) & (b != b))

says this (in pseudocode):

a == b or (isnan(a) and isnan(b))

So, either a equals b, or both a and b are NaN.

If you have small frames then assert_frame_equal will be okay. However, for large frames (10M rows) assert_frame_equal is pretty much useless. I had to interrupt it, it was taking so long.

In [1]: df = DataFrame(rand(1e7, 15))

In [2]: df = df[df > 0.5]

In [3]: df2 = df.copy()

In [4]: df
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000000 entries, 0 to 9999999
Columns: 15 entries, 0 to 14
dtypes: float64(15)

In [5]: timeit (df == df2) | ((df != df) & (df2 != df2))
1 loops, best of 3: 598 ms per loop

timeit of the (presumably) desired single bool indicating whether the two DataFrames are equal:

In [9]: timeit ((df == df2) | ((df != df) & (df2 != df2))).values.all()
1 loops, best of 3: 687 ms per loop

Method 3

Like @PhillipCloud answer, but more written out

In [26]: df1 = DataFrame([[np.nan,1],[2,np.nan]])

In [27]: df2 = df1.copy()

They really are equivalent

In [28]: result = df1 == df2

In [29]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [30]: result
Out[30]: 
      0     1
0  True  True
1  True  True

A nan in df2 that doesn’t exist in df1

In [31]: df2 = DataFrame([[np.nan,1],[np.nan,np.nan]])

In [32]: result = df1 == df2

In [33]: result[pd.isnull(df1) == pd.isnull(df2)] = True

In [34]: result
Out[34]: 
       0     1
0   True  True
1  False  True

You can also fill with a value you know not to be in the frame

In [38]: df1.fillna(-999) == df1.fillna(-999)
Out[38]: 
      0     1
0  True  True
1  True  True

Method 4

Any equality comparison using == with np.NaN is False, even np.NaN == np.NaN is False.

Simply, df1.fillna('NULL') == df2.fillna('NULL'), if ‘NULL’ is not a value in the original data.

To be safe, do the following:

Example a) Compare two dataframes with NaN values

bools = (df1 == df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = True
assert bools.all().all()

Example b) Filter rows in df1 that do not match with df2

bools = (df1 != df2)
bools[pd.isnull(df1) & pd.isnull(df2)] = False
df_outlier = df1[bools.all(axis=1)]

(Note: this is wrong – bools[pd.isnull(df1) == pd.isnull(df2)] = False)

Method 5

df.fillna(0) == df2.fillna(0)

You can use fillna(). Documenation here.

from pandas import DataFrame

# create a dataframe with NaNs
df = DataFrame([{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}])
df2 = df

# comparison fails!
print df == df2

# all is well 
print df.fillna(0) == df2.fillna(0)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x