pandas concat generates nan values

I am curious why a simple concatenation of two data frames in pandas:

shape: (66441, 1)
dtypes: prediction    int64
dtype: object
isnull().sum(): prediction    0
dtype: int64

shape: (66441, 1)
CUSTOMER_ID    int64
dtype: object
isnull().sum() CUSTOMER_ID    0
dtype: int64

of the same shape and both without NaN values

foo = pd.concat([initId, ypred], join='outer', axis=1)
print(foo.shape)
print(foo.isnull().sum())

can result in a lot of NaN values if joined.

(83384, 2)
CUSTOMER_ID    16943
prediction     16943

Contents hide

How can I fix this problem and prevent NaN values being introduced?

Answers:

Method 1

Method 2

How can I fix this problem and prevent NaN values being introduced?

Trying to reproduce it like

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'])
print(aaa)
bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
pd.concat([aaa, bbb], axis=1)

failed e.g. worked just fine as no NaN values were introduced.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I think there is problem with different index values, so where concat cannot align get NaN:

aaa  = pd.DataFrame([0,1,0,1,0,0], columns=['prediction'], index=[4,5,8,7,10,12])
print(aaa)
    prediction
4            0
5            1
8            0
7            1
10           0
12           0

bbb  = pd.DataFrame([0,0,1,0,1,1], columns=['groundTruth'])
print(bbb)
   groundTruth
0            0
1            0
2            1
3            0
4            1
5            1

print (pd.concat([aaa, bbb], axis=1))
    prediction  groundTruth
0          NaN          0.0
1          NaN          0.0
2          NaN          1.0
3          NaN          0.0
4          0.0          1.0
5          1.0          1.0
7          1.0          NaN
8          0.0          NaN
10         0.0          NaN
12         0.0          NaN

Solution is reset_index if indexes values are not necessary:

aaa.reset_index(drop=True, inplace=True)
bbb.reset_index(drop=True, inplace=True)

print(aaa)
   prediction
0           0
1           1
2           0
3           1
4           0
5           0

print(bbb)
   groundTruth
0            0
1            0
2            1
3            0
4            1
5            1

print (pd.concat([aaa, bbb], axis=1))
   prediction  groundTruth
0           0            0
1           1            0
2           0            1
3           1            0
4           0            1
5           0            1

EDIT: If need same index like aaa and length of DataFrames is same use:

bbb.index = aaa.index
print (pd.concat([aaa, bbb], axis=1))
    prediction  groundTruth
4            0            0
5            1            0
8            0            1
7            1            0
10           0            1
12           0            1

Method 2

You can do something like this:

concatenated_dataframes = concat(
    [
        dataframe_1.reset_index(drop=True),
        dataframe_2.reset_index(drop=True),
        dataframe_3.reset_index(drop=True)
    ],
    axis=1,
    ignore_index=True,
)

concatenated_dataframes_columns = [
    list(dataframe_1.columns),
    list(dataframe_2.columns),
    list(dataframe_3.columns)
]
    
flatten = lambda nested_lists: [item for sublist in nested_lists for item in sublist]

concatenated_dataframes.columns = flatten(concatenated_dataframes_columns)

To concatenate multiple DataFrames and keep the columns names / avoid NaN.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating