I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10) Out[20]: First_Date Second Date 0 2016-02-09 2015-11-19 1 2016-01-06 2015-11-30 2 NaT 2015-12-04 3 2016-01-06 2015-12-08 4 NaT 2015-12-09 5 2016-01-07 2015-12-11 6 NaT 2015-12-12 7 NaT 2015-12-14 8 2016-01-06 2015-12-14 9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0) df_test.head() Out[22]: First_Date Second Date Difference 0 2016-02-09 2015-11-19 82 days 1 2016-01-06 2015-11-30 37 days 2 NaT 2015-12-04 NaT 3 2016-01-06 2015-12-08 29 days 4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric) df_test.head() Out[25]: First_Date Second Date Difference 0 2016-02-09 2015-11-19 7.084800e+15 1 2016-01-06 2015-11-30 3.196800e+15 2 NaT 2015-12-04 NaN 3 2016-01-06 2015-12-08 2.505600e+15 4 NaT 2015-12-09 NaN
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
Method 2
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D') print (df_test) First_Date Second Date Difference 0 2016-02-09 2015-11-19 82.0 1 2016-01-06 2015-11-30 37.0 2 NaT 2015-12-04 NaN 3 2016-01-06 2015-12-08 29.0 4 NaT 2015-12-09 NaN 5 2016-01-07 2015-12-11 27.0 6 NaT 2015-12-12 NaN 7 NaT 2015-12-14 NaN 8 2016-01-06 2015-12-14 23.0 9 NaT 2015-12-15 NaN
Method 3
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt import numpy as np import pandas as pd #Assume we have df_test: In [222]: df_test Out[222]: first_date second_date 0 2016-01-31 2015-11-19 1 2016-02-29 2015-11-20 2 2016-03-31 2015-11-21 3 2016-04-30 2015-11-22 4 2016-05-31 2015-11-23 5 2016-06-30 2015-11-24 6 NaT 2015-11-25 7 NaT 2015-11-26 8 2016-01-31 2015-11-27 9 NaT 2015-11-28 10 NaT 2015-11-29 11 NaT 2015-11-30 12 2016-04-30 2015-12-01 13 NaT 2015-12-02 14 NaT 2015-12-03 15 2016-04-30 2015-12-04 16 NaT 2015-12-05 17 NaT 2015-12-06 In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date'] In [224]: df_test Out[224]: first_date second_date Difference 0 2016-01-31 2015-11-19 73 days 1 2016-02-29 2015-11-20 101 days 2 2016-03-31 2015-11-21 131 days 3 2016-04-30 2015-11-22 160 days 4 2016-05-31 2015-11-23 190 days 5 2016-06-30 2015-11-24 219 days 6 NaT 2015-11-25 NaT 7 NaT 2015-11-26 NaT 8 2016-01-31 2015-11-27 65 days 9 NaT 2015-11-28 NaT 10 NaT 2015-11-29 NaT 11 NaT 2015-11-30 NaT 12 2016-04-30 2015-12-01 151 days 13 NaT 2015-12-02 NaT 14 NaT 2015-12-03 NaT 15 2016-04-30 2015-12-04 148 days 16 NaT 2015-12-05 NaT 17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days) In [227]: df_test Out[227]: first_date second_date Difference Diffference 0 2016-01-31 2015-11-19 73 days 73 1 2016-02-29 2015-11-20 101 days 101 2 2016-03-31 2015-11-21 131 days 131 3 2016-04-30 2015-11-22 160 days 160 4 2016-05-31 2015-11-23 190 days 190 5 2016-06-30 2015-11-24 219 days 219 6 NaT 2015-11-25 NaT NaN 7 NaT 2015-11-26 NaT NaN 8 2016-01-31 2015-11-27 65 days 65 9 NaT 2015-11-28 NaT NaN 10 NaT 2015-11-29 NaT NaN 11 NaT 2015-11-30 NaT NaN 12 2016-04-30 2015-12-01 151 days 151 13 NaT 2015-12-02 NaT NaN 14 NaT 2015-12-03 NaT NaN 15 2016-04-30 2015-12-04 148 days 148 16 NaT 2015-12-05 NaT NaN 17 NaT 2015-12-06 NaT NaN
Hope that helps.
Method 4
I feel that the overall answer does not handle if the dates ‘wrap’ around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Method 5
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0