I read a csv file containing 150,000 lines into a pandas dataframe. This dataframe has a field, Date, with the dates in yyyy-mm-dd format. I want to extract the month, day and year from it and copy into the dataframes’ columns, Month, Day and Year respectively. For a few hundred records the below two methods work ok, but for 150,000 records both take a ridiculously long time to execute. Is there a faster way to do this for 100,000+ records?
First method:
df = pandas.read_csv(filename)
for i in xrange(len(df)):
df.loc[i,'Day'] = int(df.loc[i,'Date'].split('-')[2])
Second method:
df = pandas.read_csv(filename) for i in xrange(len(df)): df.loc[i,'Day'] = datetime.strptime(df.loc[i,'Date'], '%Y-%m-%d').day
Thank you.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
In 0.15.0 you will be able to use the new .dt accessor to do this nice syntactically.
In [36]: df = DataFrame(date_range('20000101',periods=150000,freq='H'),columns=['Date'])
In [37]: df.head(5)
Out[37]:
Date
0 2000-01-01 00:00:00
1 2000-01-01 01:00:00
2 2000-01-01 02:00:00
3 2000-01-01 03:00:00
4 2000-01-01 04:00:00
[5 rows x 1 columns]
In [38]: %timeit f(df)
10 loops, best of 3: 22 ms per loop
In [39]: def f(df):
df = df.copy()
df['Year'] = DatetimeIndex(df['Date']).year
df['Month'] = DatetimeIndex(df['Date']).month
df['Day'] = DatetimeIndex(df['Date']).day
return df
....:
In [40]: f(df).head()
Out[40]:
Date Year Month Day
0 2000-01-01 00:00:00 2000 1 1
1 2000-01-01 01:00:00 2000 1 1
2 2000-01-01 02:00:00 2000 1 1
3 2000-01-01 03:00:00 2000 1 1
4 2000-01-01 04:00:00 2000 1 1
[5 rows x 4 columns]
From 0.15.0 on (release in end of Sept 2014), the following is now possible with the new .dt accessor:
df['Year'] = df['Date'].dt.year df['Month'] = df['Date'].dt.month df['Day'] = df['Date'].dt.day
Method 2
I use below code which works very well for me
df['Year']=[d.split('-')[0] for d in df.Date]
df['Month']=[d.split('-')[1] for d in df.Date]
df['Day']=[d.split('-')[2] for d in df.Date]
df.head(5)
Method 3
This is the cleanest answer I’ve found.
df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})
In [30]: df = pd.DataFrame({'data':pd.date_range(start, end)})
In [31]: df.head()
Out[31]:
data
0 2011-01-01
1 2011-01-02
2 2011-01-03
3 2011-01-04
4 2011-01-05
nomtimes = ["year", "hour", "month", "dayofweek"]
df = df.assign(**{t:getattr(df.data.dt,t) for t in nomtimes})
In [33]: df.head()
Out[33]:
data dayofweek hour month year
0 2011-01-01 5 0 1 2011
1 2011-01-02 6 0 1 2011
2 2011-01-03 0 0 1 2011
3 2011-01-04 1 0 1 2011
4 2011-01-05 2 0 1 2011
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0