Generating random dates within a given range in pandas

This is a self-answered post. A common problem is to randomly generate dates between a given start and end date.

There are two cases to consider:

  1. random dates with a time component, and
  2. random dates without time

For example, given some start date 2015-01-01 and an end date 2018-01-01, how can I sample N random dates between this range using pandas?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Is converting to the unix timestamp acceptable?

def random_dates(start, end, n=10):

    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

Sample run:

start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
random_dates(start, end)

DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
               '2015-01-24 10:11:04', '2015-03-26 16:23:53',
               '2017-04-01 00:38:21', '2015-05-15 03:47:54',
               '2015-06-24 07:32:32', '2015-11-10 20:39:36',
               '2016-07-25 05:48:09', '2015-03-19 16:05:19'],
              dtype='datetime64[ns]', freq=None)

EDIT:

As per the comment by @smci, I wrote a function to accommodate both 1 and 2 with a little explanation inside the function itself.

def random_datetimes_or_dates(start, end, out_format='datetime', n=10): 

    '''   
    unix timestamp is in ns by default. 
    I divide the unix time value by 10**9 to make it seconds (or 24*60*60*10**9 to make it days).
    The corresponding unit variable is passed to the pd.to_datetime function. 
    Values for the (divide_by, unit) pair to select is defined by the out_format parameter.
    for 1 -> out_format='datetime'
    for 2 -> out_format=anything else
    '''
    (divide_by, unit) = (10**9, 's') if out_format=='datetime' else (24*60*60*10**9, 'D')

    start_u = start.value//divide_by
    end_u = end.value//divide_by

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit=unit)

Sample run:

random_datetimes_or_dates(start, end, out_format='datetime')

DatetimeIndex(['2017-01-30 05:14:27', '2016-10-18 21:17:16',
               '2016-10-20 08:38:02', '2015-09-02 00:03:08',
               '2015-06-04 02:38:12', '2016-02-19 05:22:01',


                  '2015-11-06 10:37:10', '2017-12-17 03:26:02',
                   '2017-11-20 06:51:32', '2016-01-02 02:48:03'],
                  dtype='datetime64[ns]', freq=None)

random_datetimes_or_dates(start, end, out_format='not datetime')

DatetimeIndex(['2017-05-10', '2017-12-31', '2017-11-10', '2015-05-02',
               '2016-04-11', '2015-11-27', '2015-03-29', '2017-05-21',
               '2015-05-11', '2017-02-08'],
              dtype='datetime64[ns]', freq=None)

Method 2

np.random.randn + to_timedelta

This addresses Case (1). You can do this by generating a random array of timedelta objects and adding them to your start date.

def random_dates(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
>>> np.random.seed(0)
>>> start = pd.to_datetime('2015-01-01')
>>> end = pd.to_datetime('2018-01-01')
>>> random_dates(start, end, 10)
DatetimeIndex([   '2016-08-25 01:09:42.969600',
                  '2017-02-23 13:30:20.304000',
                  '2016-10-23 05:33:15.033600',
               '2016-08-20 17:41:04.012799999',
               '2016-04-09 17:59:00.815999999',
                  '2016-12-09 13:06:00.748800',
                  '2016-04-25 00:47:45.974400',
                  '2017-09-05 06:35:58.444800',
                  '2017-11-23 03:18:47.347200',
                  '2016-02-25 15:14:53.894400'],
              dtype='datetime64[ns]', freq=None)

This will generate dates with a time component as well.

Sadly, rand does not support a replace=False, so if you want unique dates, you’ll need a two-step process of 1) generate the non-unique days component, and 2) generate the unique seconds/milliseconds component, then add the two together.


np.random.randint + to_timedelta

This addresses Case (2). You can modify random_dates above to generate random integers instead of random floats:

def random_dates2(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return start + pd.to_timedelta(
        np.random.randint(0, ndays, n), unit=unit
    )
>>> random_dates2(start, end, 10)
DatetimeIndex(['2016-11-15', '2016-07-13', '2017-04-15', '2017-02-02',
               '2017-10-30', '2015-10-05', '2016-08-22', '2017-12-30',
               '2016-08-23', '2015-11-11'],
              dtype='datetime64[ns]', freq=None)

To generate dates with other frequencies, the functions above can be called with a different value for unit. Additionally, you can add a parameter freq and tweak your function call as needed.

If you want unique random dates, you can use np.random.choice with replace=False:

def random_dates2_unique(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return start + pd.to_timedelta(
        np.random.choice(ndays, n, replace=False), unit=unit
    )

Performance

Going to benchmark just the methods that address Case (1), since Case (2) is really a special case which any method can get to using dt.floor.

Generating random dates within a given range in pandas
Functions

def cs(start, end, n):
    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit='D') + start

def akilat90(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

def piR(start, end, n):
    dr = pd.date_range(start, end, freq='H') # can't get better than this :-(
    return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))

def piR2(start, end, n):
    dr = pd.date_range(start, end, freq='H')
    a = np.arange(len(dr))
    b = np.sort(np.random.permutation(a)[:n])
    return dr[b]

Benchmarking Code

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cs', 'akilat90', 'piR', 'piR2'],
       columns=[10, 20, 50, 100, 200, 500, 1000, 2000, 5000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        np.random.seed(0)

        start = pd.to_datetime('2015-01-01')
        end = pd.to_datetime('2018-01-01')

        stmt = '{}(start, end, c)'.format(f)
        setp = 'from __main__ import start, end, c, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Method 3

We can speed up @akilat90’s approach about twofold (in @coldspeed’s benchmark) by using the fact that datetime64 is just a rebranded int64 hence we can view-cast:

def pp(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.DatetimeIndex((10**9*np.random.randint(start_u, end_u, n, dtype=np.int64)).view('M8[ns]'))

enter image description here

Method 4

numpy.random.choice

You can leverage Numpy’s random choice. choice may be problematic over large data_ranges. For example, too large will result in a MemoryError. It requires storing the entire thing in order to select random bits.

random_dates('2015-01-01', '2018-01-01', 10, 'ns', seed=[3, 1415])

MemoryError

Also, this requires a sort.

def random_dates(start, end, n, freq, seed=None):
    if seed is not None:
        np.random.seed(seed)

    dr = pd.date_range(start, end, freq=freq)
    return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))

random_dates('2015-01-01', '2018-01-01', 10, 'H', seed=[3, 1415])

DatetimeIndex(['2015-04-24 02:00:00', '2015-11-26 23:00:00',
               '2016-01-18 00:00:00', '2016-06-27 22:00:00',
               '2016-08-12 17:00:00', '2016-10-21 11:00:00',
               '2016-11-07 11:00:00', '2016-12-09 23:00:00',
               '2017-02-20 01:00:00', '2017-06-17 18:00:00'],
              dtype='datetime64[ns]', freq=None)

numpy.random.permutation

Similar to other answer. However, I like this answer as it slices the datetimeindex produced by date_range and automatically returns another datetimeindex.

def random_dates_2(start, end, n, freq, seed=None):
    if seed is not None:
        np.random.seed(seed)

    dr = pd.date_range(start, end, freq=freq)
    a = np.arange(len(dr))
    b = np.sort(np.random.permutation(a)[:n])
    return dr[b]

Method 5

Just my two cents, using date_range and sample:

def random_dates(start, end, n, seed=1, replace=False):
    dates = pd.date_range(start, end).to_series()
    return dates.sample(n, replace=replace, random_state=seed)

random_dates("20170101","20171223", 10, seed=1)
Out[29]: 
2017-10-01   2017-10-01
2017-08-23   2017-08-23
2017-11-30   2017-11-30
2017-06-15   2017-06-15
2017-11-18   2017-11-18
2017-10-31   2017-10-31
2017-07-31   2017-07-31
2017-03-07   2017-03-07
2017-09-09   2017-09-09
2017-10-15   2017-10-15
dtype: datetime64[ns]

Method 6

I found A new base library generated the range of the date, seems like on my side a bit faster than pandas.data_range , credit from this answer

from dateutil.rrule import rrule, DAILY
import datetime, random
def pick(start,end,n):
    return (random.sample(list(rrule(DAILY, dtstart=start,until=end)),n))


pick(datetime.datetime(2010, 2, 1, 0, 0),datetime.datetime(2010, 2, 5, 0, 0),2)
[datetime.datetime(2010, 2, 3, 0, 0), datetime.datetime(2010, 2, 2, 0, 0)]

Method 7

Thats some alternative way 😀 Maybe someone will need it.

from datetime import datetime
import random
import numpy as np
import pandas as pd

N = 10 #N-samples
dates = np.zeros([N,3])

for i in range(0,N):
    year = random.randint(1970, 2010) 
    month = random.randint(1, 12)
    day = random.randint(1, 28)
    #if you need to change it use variables :3
    birth_date = datetime(year, month, day)
    dates[i] = [year,month,day]

df = pd.DataFrame(dates.astype(int))
df.columns = ['year', 'month', 'day']
pd.to_datetime(df)

Result:

0   1999-08-22
1   1989-04-27
2   1978-10-01
3   1998-12-09
4   1979-04-19
5   1988-03-22
6   1992-03-02
7   1993-04-28
8   1978-10-04
9   1972-01-13
dtype: datetime64[ns]

Method 8

I think this is an easier solution for just creating a date field in a pandas DateFrame

list1 = []
for x in range(0,365):
    list1.append(x)
date = pd.DataFrame(pd.to_datetime(list1, unit='D',origin=pd.Timestamp('2018-01-01')))


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x