pandas to_datetime parsing wrong year

I’m coming across something that is almost certainly a stupid mistake on my part, but I can’t seem to figure out what’s going on.

Essentially, I have a series of dates as strings in the format "%d-%b-%y", such as 26-Sep-05. When I go to convert them to datetime, the year is sometimes correct, but sometimes it is not.

E.g.:

dates = ['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55']

pd.to_datetime(dates, format="%d-%b-%y")
DatetimeIndex(['2005-09-26', '2005-09-26', '1970-06-15', '1994-12-05',
               '2061-01-09', '2055-02-08'],
              dtype='datetime64[ns]', freq=None)

The last two entries, which get returned as 2061 and 2055 for the years, are wrong. But this works fine for the 15-Jun-70 entry. What’s going on here?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 – 69:

datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)

datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)

Two digits year ambiguity

So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900

The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.

If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)

I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 – 1968 (instead of 2017 – 2068).

If you have overlap then you can’t process with insufficient year information, unless e.g. you have some ordered data and a reference

If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as ’16’, that’s ambiguous and there isn’t sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

Method 2

from the docs

Year 2000 (Y2K) issues: Python depends on the platform’s C library,
which generally doesn’t have year 2000 issues, since all dates and
times are represented internally as seconds since the epoch. Function
strptime() can parse 2-digit years when given %y format code. When
2-digit years are parsed, they are converted according to the POSIX
and ISO C standards: values 69–99 are mapped to 1969–1999, and values
0–68 are mapped to 2000–2068.

Method 3

For anyone looking for a quick and dirty code snippet to fix these cases, this worked for me:

from datetime import timedelta, date
col = 'date'
df[col] = pd.to_datetime(df[col])
future = df[col] > date(year=2050,month=1,day=1)
df.loc[future, col] -= timedelta(days=365.25*100)

You may need to tune the threshold date closer to the present depending on the earliest dates in your data.

Method 4

You can write a simple function to correct this parsing of wrong year as stated below:

import datetime

def fix_date(x):

    if x.year > 1989:

        year = x.year - 100

    else:

        year = x.year

    return datetime.date(year,x.month,x.day)


df['date_column'] = data['date_column'].apply(fix_date)

Hope this helps..

Method 5

Another quick solution to the problem:-

import pandas as pd
import numpy as np
dates = pd.DataFrame(['26-Sep-05', '26-Sep-05', '15-Jun-70', '5-Dec-94', '9-Jan-61', '8-Feb-55'])

for i in dates:
    tempyear=pd.to_numeric(dates[i].str[-2:])
    dates["temp_year"]=np.where((tempyear>=44)&(tempyear<=99),tempyear+1900,tempyear+2000).astype(str)
    dates["temp_month"]=dates[i].str[:-2]
    dates["temp_flyr"]=dates["temp_month"]+dates["temp_year"]
    dates["pddt"]=pd.to_datetime(dates.temp_flyr.str.upper(), format='%d-%b-%Y', yearfirst=False)
    tempdrops=["temp_year","temp_month","temp_flyr",i]
    dates.drop(tempdrops, axis=1, inplace=True)

And the output is as follows, here I have converted the output to pandas datetime format from object using pd.to_datetime

    pddt
0   2005-09-26
1   2005-09-26
2   1970-06-15
3   1994-12-05
4   1961-01-09
5   1955-02-08

As mentioned in some other answers this works best if there is no overlap between the dates of the two centuries.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x