I’m working through the “Python For Data Analysis” and I don’t understand a particular functionality. Adding two pandas series objects will automatically align the indexed data but if one object does not contain that index it is returned as NaN. For example from book:
a = Series([35000,71000,16000,5000],index=['Ohio','Texas','Oregon','Utah']) b = Series([NaN,71000,16000,35000],index=['California', 'Texas', 'Oregon', 'Ohio'])
Result:
In [63]: a
Out[63]: Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
In [64]: b
Out[64]: California NaN
Texas 71000
Oregon 16000
Ohio 35000
When I add them together I get this…
In [65]: a+b
Out[65]: California NaN
Ohio 70000
Oregon 32000
Texas 142000
Utah NaN
So why is the Utah value NaN and not 500? It seems that 500+NaN=500. What gives? I’m missing something, please explain.
Update:
In [92]: # fill NaN with zero
b = b.fillna(0)
b
Out[92]: California 0
Texas 71000
Oregon 16000
Ohio 35000
In [93]: a
Out[93]: Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
In [94]: # a is still good
a+b
Out[94]: California NaN
Ohio 70000
Oregon 32000
Texas 142000
Utah NaN
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Pandas does not assume that 500+NaN=500, but it is easy to ask it to do that:
a.add(b, fill_value=0)
Method 2
The default approach is to assume that any computation involving NaN gives NaN as the result. Anything plus NaN is NaN, anything divided by NaN is NaN, etc. If you want to fill the NaN with some value, you have to do that explicitly (as Dan Allan showed in his answer).
Method 3
It makes more sense to use pd.concat() as it can accept more columns.
import pandas as pd import numpy as np a = pd.Series([35000,71000,16000,5000],index=['Ohio','Texas','Oregon','Utah']) b = pd.Series([np.nan,71000,16000,35000],index=['California', 'Texas', 'Oregon', 'Ohio']) pd.concat((a,b), axis=1).sum(1, min_count=1)
Output:
California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah 5000.0 dtype: float64
Or with 3 series:
import pandas as pd import numpy as np a = pd.Series([1, np.NaN, 4, 5]) b = pd.Series([3, np.NaN, 5, np.NaN]) c = pd.Series([np.NaN,np.NaN,np.NaN,np.NaN]) print(pd.concat((a,b,c), axis=1).sum(1, min_count=1)) #0 4.0 #1 NaN #2 9.0 #3 5.0 #dtype: float64
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0