So I am wondering if there’s a more efficient solution in generating a 2-D array using np.random.choice where each row has unique values.
For example, for an array with shape (3,4), we expect an output of:
# Expected output given a shape (3,4)
array([[0, 1, 3, 2],
[2, 3, 1, 0],
[1, 3, 2, 0]])
This means that the values for each row must be unique with respect to the number of columns. So for each row in out, the integers should only fall between 0 to 3.
I know that I can achieve it by passing False to the replace argument. But I can only do it for each row and not for the whole matrix. For instance, I can do this:
>>> np.random.choice(4, size=(1,4), replace=False) array([[0,2,3,1]])
But when I try to do this:
>>> np.random.choice(4, size=(3,4), replace=False)
I get an error like this:
File "<stdin>", line 1, in <module> File "mtrand.pyx", line 1150, in mtrand.RandomState.choice (numpyrandommtrandmtrand.c:18113) ValueError: Cannot take a larger sample than population when 'replace=False'
I assume it’s because it’s trying to draw 3 x 4 = 12 samples due to the size of the matrix without replacement but I’m only giving it a limit of 4.
I know that I can solve it by using a for-loop:
>>> a = (np.random.choice(4,size=4,replace=False) for _ in range(3))
>>> np.vstack(a)
array([[3, 1, 2, 0],
[1, 2, 0, 3],
[2, 0, 3, 1]])
But I wanted to know if there’s a workaround without using any for-loops? (I’m kinda assuming that adding for-loops might make it slower if I have a number of rows greater than 1000. But as you can see I am actually creating a generator in a so I’m also not sure if it has an effect after all.)
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
One trick I have used often is generating a random array and using argsort to get unique indices as the required unique numbers. Thus, we could do –
def random_choice_noreplace(m,n, axis=-1):
# m, n are the number of rows, cols of output
return np.random.rand(m,n).argsort(axis=axis)
Sample runs –
In [98]: random_choice_noreplace(3,7)
Out[98]:
array([[0, 4, 3, 2, 6, 5, 1],
[5, 1, 4, 6, 0, 2, 3],
[6, 1, 0, 4, 5, 3, 2]])
In [99]: random_choice_noreplace(5,7, axis=0) # unique nums along cols
Out[99]:
array([[0, 2, 4, 4, 1, 0, 2],
[1, 4, 3, 2, 4, 1, 3],
[3, 1, 1, 3, 2, 3, 0],
[2, 3, 0, 0, 0, 2, 4],
[4, 0, 2, 1, 3, 4, 1]])
Runtime test –
# Original approach
def loopy_app(m,n):
a = (np.random.choice(n,size=n,replace=False) for _ in range(m))
return np.vstack(a)
Timings –
In [108]: %timeit loopy_app(1000,100) 10 loops, best of 3: 20.6 ms per loop In [109]: %timeit random_choice_noreplace(1000,100) 100 loops, best of 3: 3.66 ms per loop
Method 2
Here is my solution to repeated sampling without replacement, modified based on Divakar’s answer. In his comment section, he suggested slicing the result if no. of samples < length of array. However, this may not be the most efficient method if length of array is large but no. of samples is small, because argsort can take a long time. I suggest using argpartition instead.
def random_choice_noreplace2(l, n_sample, num_draw):
'''
l: 1-D array or list
n_sample: sample size for each draw
num_draw: number of draws
Intuition: Randomly generate numbers, get the index of the smallest n_sample number for each row.
'''
l = np.array(l)
return l[np.argpartition(np.random.rand(num_draw,len(l)), n_sample-1,axis=-1)[:,:n_sample]]
Timings –
def loopy_app(l, n_sample, num_draw):
l = np.array(l)
a = [np.random.choice(l,size=n_sample,replace=False) for _ in range(num_draw)]
return np.vstack(a)
def random_choice_noreplace(l, n_sample, num_draw):
# m, n are the number of rows, cols of output
l = np.array(l)
return l[np.random.rand(num_draw,len(l)).argsort(axis=-1)[:,:n_sample]]
In [2]: %timeit loopy_app(range(100),2,1000)
48.5 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit random_choice_noreplace(range(100),2,1000)
7.8 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [4]: %timeit random_choice_noreplace2(range(100),2,1000)
2.71 ms ± 57.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Correctness –
In [5]: np.random.seed(42)
...: random_choice_noreplace(range(100),2,10)
Out[5]:
array([[72, 10],
[28, 71],
[ 8, 5],
[32, 71],
[ 7, 56],
[63, 15],
[40, 28],
[94, 64],
[21, 98],
[45, 36]])
In [6]: np.random.seed(42)
...: random_choice_noreplace2(range(100),2,10)
Out[6]:
array([[72, 10],
[28, 71],
[ 8, 5],
[32, 71],
[ 7, 56],
[63, 15],
[40, 28],
[94, 64],
[21, 98],
[45, 36]])
Method 3
In addition to Divakar’s nice answer, here is another alternative that is even faster by roughly 20% on my machine:
def random_choice_noreplace_2(m, n):
data = np.arange(m * n).reshape(n, m) % m
for row in data: np.random.shuffle(row)
return data
Timings:
In [3]: %timeit random_choice_noreplace(1000, 100) 3.85 ms ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [4]: %timeit random_choice_noreplace_2(1000, 100) 3.1 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0