What's the fastest way to generate a 1 GB text file containing random digits?

I tried a bash script, but it took too long to create a simple 1 MB file. I think the answer lies in using /dev/random or /dev/urandom, but other posts here only show how to add all kinds of data to a file using these things, but I want to add only numbers.

So, is there a command that I can use to create a random file of size 1 GB containing only numbers between 0 and 9?

Edit:
I want the output to be something like this

0 1 4 7 ..... 9
8 7 5 8 ..... 8
....
....
8 7 5 3 ..... 3

The range is 0 – 9 meaning only numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Also I need them to be space separated and 100 per line, up to n number of lines. This n is something I don’t care, I want my final size to be 1 GB.

Edit:
I am using Ubuntu 16.04 LTS

Contents hide

Answers:

Method 1

Method 2

decimal-digits.c:

Method 3

Method 4

How it’s done

The code: AVX2 version

Performance notes:

Method 5

Method 6

Method 7

Method 8

Method 9

Method 10

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This:

 LC_ALL=C tr '-377' 
             '[0*25][1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][x*]' 
    < /dev/urandom |
    tr -d x |
    fold -w 1 |
    paste -sd "$(printf '%99s\n')" - |
    head -c1G

(assuming a head implementation that supports -c) appears to be reasonably fast on my system.

tr translates the whole byte range (0 to 255, 0 to 0377 in octal): the 25 first bytes as 0, the 25 next ones as 1… then 25 9 the rest (250 to 255) to “x” which we then discard (with tr -d x) as we want a uniform distribution (assuming /dev/urandom has a uniform distribution itself) and so not give a bias to some digits.

That produces one digit for 97% of the bytes of /dev/urandom. fold -w 1 makes it one digit per line. paste -s is called with a list of separators that consists on 99 space characters and one newline character, so to have 100 space separated digits on each line.

head -c1G will get the first GiB (2³⁰) of that. Note that the last line will be truncated and undelimited. You could truncate to 2³⁰-1 and add the missing newline by hand, or truncate to 10⁹ bytes instead which is 50 million of those 200 byte lines (head -n 50000000 would also make it a standard/portable command).

These timings (obtained by zsh on a quad-core system), give an indication of where the CPU time is spent:

LC_ALL=C tr '-377'  < /dev/urandom  0.61s user 31.28s system 99% cpu 31.904 total
tr -d x  1.00s user 0.27s system 3% cpu 31.903 total
fold -w 1  14.93s user 0.48s system 48% cpu 31.902 total
paste -sd "$(printf '%99s\n')" -  7.23s user 0.08s system 22% cpu 31.899 total
head -c1G > /dev/null  0.49s user 1.21s system 5% cpu 31.898 total

The first tr is the bottle neck, most of the time spent in the kernel (I suppose for the random number generation). The timing is roughly in line with the rate I can get bytes from /dev/uramdom (about 19MiB/s and here we produce 2 bytes for each 0.97 byte of /dev/urandom at a rate of 32MiB/s). fold seems to be spending an unreasonable amount of CPU time (15s) just to insert a newline character after every byte but that doesn’t affect the overall time as it works on a different CPU in my case (adding the -b option makes it very slightly more efficient, dd cbs=1 conv=unblock seems like a better alternative).

You can do away with the head -c1G and shave off a few seconds by setting a limit on the file size (limit filesize 1024m with zsh or ulimit -f "$((1024*1024))" with most other shells (including zsh)) instead in a subshell.

That could be improved if we extracted 2 digits for each byte, but we would need a different approach for that. The above is very efficient because tr just looks up each byte in a 256 byte array. It can’t do that for 2 bytes at a time, and using things like hexdump -e '1/1 "%02u"' that computes the text representation of a byte using more complex algorithms would be more expensive than the random number generation itself. Still, if like in my case, you have CPU cores whose time to spare, it may still manage to shave off a few seconds:

With:

< /dev/urandom LC_ALL=C tr '-377' '-143-143[x*]' |
  tr -d x |
  hexdump -n250000000 -ve '500/1 "%02u" "n"' |
  fold -w1 |
  paste -sd "$(printf '%99s\n')" - > /dev/null

I get (note however that here it’s 1,000,000,000 bytes as opposed to 1,073,741,824):

LC_ALL=C tr '-377' '-143-143[x*]' < /dev/urandom  0.32s user 18.83s system 70% cpu 27.001 total
tr -d x  2.17s user 0.09s system 8% cpu 27.000 total
hexdump -n250000000 -ve '500/1 "%02u" "n"'  26.79s user 0.17s system 99% cpu 27.000 total
fold -w1  14.42s user 0.67s system 55% cpu 27.000 total
paste -sd "$(printf '%99s\n')" - > /dev/null  8.00s user 0.23s system 30% cpu 26.998 total

More CPU time overall, but better distributed between my 4 CPU cores, so it ends up taking less wall-clock time. The bottleneck is now hexdump.

If we use dd instead of the line-based fold, we can actually reduce the amount of work hexdump needs to do and improve the balance of work between CPUs:

< /dev/urandom LC_ALL=C tr '-377' '-143-143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\n')" -

(here assuming GNU dd for its iflag=fullblock and status=none) which gives:

LC_ALL=C tr '-377' '-143-143[x*]' < /dev/urandom  0.32s user 15.58s system 99% cpu 15.915 total
tr -d x  1.62s user 0.16s system 11% cpu 15.914 total
hexdump -ve '"%02u"'  10.90s user 0.32s system 70% cpu 15.911 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  5.44s user 0.19s system 35% cpu 15.909 total
paste -sd "$(printf '%99s\n')" - > /dev/null  5.50s user 0.30s system 36% cpu 15.905 total

Back to the random-number generation being the bottleneck.

Now, as pointed out by @OleTange, if you have the openssl utility, you could use it to get a faster (especially on processors that have AES instructions) pseudo-random generator of bytes.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom

on my system spews 15 times as many bytes per second than /dev/urandom. (I can’t comment on how it compares in terms of cryptographically secure source of randomness if that applies to your use case).

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '-377' '-143-143[x*]' |
  tr -d x |
  hexdump -ve '"%02u"' |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\n')" -

Now gives:

openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom < /dev/zero 2>   1.13s user 0.16s system 12% cpu 10.174 total
LC_ALL=C tr '-377' '-143-143[x*]'  0.56s user 0.20s system 7% cpu 10.173 total
tr -d x  2.50s user 0.10s system 25% cpu 10.172 total
hexdump -ve '"%02u"'  9.96s user 0.19s system 99% cpu 10.172 total
dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock  4.38s user 0.20s system 45% cpu 10.171 total
paste -sd "$(printf '%99s\n')" - > /dev/null

back to hexdump being the bottleneck.

As I still have CPUs to spare, I can run 3 of those hexdump in parallel.

</dev/zero openssl enc -aes-128-ctr -nosalt -pass file:/dev/urandom 2> /dev/null | 
  LC_ALL=C tr '-377' '-143-143[x*]' |
  tr -d x |
  (hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"' <&3 & hexdump -ve '"%02u"') 3<&0 |
  dd bs=50000 count=10000 iflag=fullblock status=none cbs=1 conv=unblock |
  paste -sd "$(printf '%99s\n')" -

(the <&3 is needed for shells other than zsh that close commands’ stdin on /dev/null when run in background).

Now down to 6.2 seconds and my CPUs almost fully utilised.

Method 2

This is partially a tongue-in-cheek answer, because of the title of the question.

When you look for “the fastest way to …”, the answer is almost always some specialized tool. This “answers” shows one such tool, just so you can experiment.

This is not a serious answer, because you should not look into specialized tools for jobs you only do once, or very rarely. You see, you’ll end up spending more time looking for tools and learning about them, than actually doing stuff. Shells and utilities like bash and awk are not the fastest, but you can usually write a one-liner to achieve the job, spending only seconds. Better scripting languages like perl can also be used, although the learning curve for perl is steep, and I hesitate to recommend it for such purposes, because I’ve been traumatized by awful perl projects. python on the other hand is slightly handicapped by its rather slow I/O; it is only an issue when you filter or generate gigabytes of data, though.

In any case, the following C89 example program (which uses POSIX.1 for higher accuracy clock only if available) should achieve about 100 MB/s generation rate (tested in Linux on a laptop with an Intel i5-4200U processor, piping the output to /dev/null), using a pretty good pseudo-random number generator. (The output should pass all the BigCrunch tests, except the MatrixRank test, as the code uses xorshift64* and the exclusion method to avoid biasing the digits.)

decimal-digits.c:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license,
       https://creativecommons.org/publicdomain/zero/1.0/
   In other words, this is dedicated to the public domain.
   There are no warranties either, so if something breaks,
   you only have yourself to blame.
*/

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "n");
        fprintf(stderr, "Usage: %s [ -h | --help ]n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]n", argv[0]);
        fprintf(stderr, "n");
        fprintf(stderr, "This program generates random decimal digitsn");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytesn");
        fprintf(stderr, "will be used.n");
        fprintf(stderr, "n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*n");
        fprintf(stderr, "pseudo-random number generator used in this program.n");
        fprintf(stderr, "If omitted, current time is used as the seed.n");
        fprintf(stderr, "n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    for (line = 0UL; line < lines; line++) {
        fputc('0' + digit(), stdout);
        for (col = 1UL; col < cols; col++) {
            fputc(' ', stdout);
            fputc('0' + digit(), stdout);
        }
        fputc('n', stdout);

        /* Check for write errors. */
        if (ferror(stdout))
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

We can make it a lot faster, if we switch to a line buffer, and fwrite() it once instead of outputting each digit at a time. Note that we still keep the stream fully buffered, to avoid partial (non-power-of-two) writes if the output is a block device.

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    char         *oneline;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "n");
        fprintf(stderr, "Usage: %s [ -h | --help ]n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]n", argv[0]);
        fprintf(stderr, "n");
        fprintf(stderr, "This program generates random decimal digitsn");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytesn");
        fprintf(stderr, "will be used.n");
        fprintf(stderr, "n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*n");
        fprintf(stderr, "pseudo-random number generator used in this program.n");
        fprintf(stderr, "If omitted, current time is used as the seed.n");
        fprintf(stderr, "n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    /* Allocate memory for a full line. */
    oneline = malloc((size_t)(2 * cols + 1));
    if (!oneline) {
        fprintf(stderr, "Not enough memory for %lu column buffer.n", cols);
        return EXIT_FAILURE;
    }

    /* Set spaces and terminating newline. */
    for (col = 0; col < cols; col++)
        oneline[2*col + 1] = ' ';
    oneline[2*cols-1] = 'n';

    /* Not needed, but in case a code modification treats it as a string. */
    oneline[2*cols] = '';

    for (line = 0UL; line < lines; line++) {
        for (col = 0UL; col < cols; col++)
            oneline[2*col] = digit();

        if (fwrite(oneline, 2*cols, 1, stdout) != 1)
            return EXIT_FAILURE; 
    }

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;

    return EXIT_SUCCESS;
}

Note: both examples edited on 2016-11-18 to ensure uniform distribution of digits (zero is excluded; see e.g. here for comparison and details on various pseudo-random number generators).

Compile using for example

gcc -Wall -O2 decimal-digits.c -o decimal-digits

and optionally install system-wide to /usr/bin using

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

It takes the number of digits per line, and the number of lines. Because 1000000000 / 100 / 2 = 5000000 (five million; total bytes divided by columns divided by 2), you can use

./decimal-digits 100 5000000 > digits.txt

to generate the gigabyte-sized digits.txt as desired by the OP.

Note that the program itself is written more with readability than efficiency in mind. My intent here is not to showcase the efficiency of the code — I’d use POSIX.1 and low-level I/O anyway, rather than generic C interfaces — but to let you easily see what kind of balance there is with effort spent in developing dedicated tools versus their performance, compared to one-liners or short shell or awk scriptlets.

Using the GNU C library, calling the fputc() function for every character output incurs a very small overhead (of an indirect function call, or conditionals — the FILE interface is actually pretty complex and versatile, you see). On this particular Intel Core i5-4200U laptop, redirecting the output to /dev/null, the first (fputc) version takes about 11 seconds, whereas the line-at-a-time version takes just 1.3 seconds.

I happen to often write such programs and generators only because I like to play with huge datasets. I’m weird that way. For example, I once wrote a program to print all finite positive IEEE-754 floating-point values into a text file, with sufficient precision to yield the exact same value when parsed. The file was a few gigabytes in size (perhaps 4G or so); there are not that many finite positive floats as one might think. I used this to compare implementations that read and parse such data.

For normal use cases, like the OP is having, shell scripts and scriptlets and one-liners are the better approach. Less time spent to accomplish the overall task. (Except if they need a different file every day or so, or there are many people who need a different file, in which — rare — case, a dedicated tool like above, might warrant the effort spent.)

Method 3

If you have shuf available (recent GNU coreutils does) you can do this:

time shuf -r -n $((512*1024*1024)) -i 0-9 | paste -sd "$(printf '%99s\n')" -

On my VM, this is now a bit slower than Stéphane’s answer by about a 3:4 factor.

Method 4

If you don’t need very high quality randomness, and close-to-uniform distribution is good enough, you can go really fast, especially on a modern CPU with efficient SIMD integer vectors like x86 with SSE2 or AVX2.

This is like @NominalAnimal’s answer since we both had the same idea, but manually vectorized for x86. (And with with worse quality random numbers, but still probably good enough for a lot of use-cases.) This runs about 15 or 30 times faster than @Nominal’s code, at ~13GB/s of ASCII output on a 2.5GHz Intel Haswell CPU with AVX2. That’s still less than theoretical max main memory bandwidth (dual channel DDR3-1600 is about 25.6GB/s), but I was timing writing to /dev/null so it’s actually just rewriting a buffer that stays hot in cache. Skylake should run this same code significantly faster than Haswell (see the bottom of this answer).

Assuming you actually bottleneck on I/O to disk or piping this somewhere, a fast implementation means your CPU doesn’t even have to clock higher than idle. It uses much less total energy to produce the result. (Battery life / heat / global warming.)

This is so fast that you probably don’t want to write it to disk. Just re-generate as-needed (from the same seed if you want the same data again). Even if you want to feed it to a multi-threaded process that can use all CPUs, running this to pipe the data to it will leave it hot in L3 cache (and L2 cache on the core that wrote it), and use so very little CPU time. (But note that piping adds a lot of overhead vs. writing to /dev/null. On a Skylake i7-6700k, piping to wc -c or another program that just reads + discards its input, it’s about 8x slower than writing to /dev/null, and only uses 70% of a CPU. But that’s still 4.0GB/s on a 3.9GHz CPU.

Re-generating it is faster than re-reading it even from a fast PCIe-connected SSD, but IDK if it’s more power efficient (the vector-integer multiplier is kept pretty busy, and it’s probably pretty power-hungry, along with other AVX2 256b vector ALUs). OTOH, I don’t know how much CPU time reading from disk would take away from something that was maxing out all cores processing this input. I’d guess that a context-switch to re-generate in 128k chunks might be competitive with running filesystem / pagecache code and allocating pages to read data from disk. Of course, if it’s already hot in the pagecache, it’s just basically memcpy. OTOH, we already write about as fast as memcpy! (which has to split main memory bandwidth between reading and writing). (Also note that writing to memory that’s not already hot in cache usually triggers a read-for-ownership to maintain cache coherency, which can be avoided with non-temporal stores, or with x86’s rep movsb (optimized memcpy and memset in microcode, which avoids RFO, since Andy Glew’s implementation of it in P6 (Pentium Pro))).

So far this is only a proof of concept, and the newline handling is only approximately correct. It’s wrong around the ends of a power-of-2 buffer. With more development time. I’m confident I could find a more efficient way to insert newlines that’s also exactly correct, with overhead at least as low as this (compared to outputting only spaces). I think this is something like 10 to 20%. I’m only interested in knowing how fast we could make this run, not in actually having a polished version of it, so I’ll leave that part as an exercise for the reader, with comments describing some ideas.

On a Haswell i5 at its 2.5GHz max turbo, with DDR3-1600MHz RAM, timed producing 100GiB but scaled down. (Timed on cygwin64 on Win10 with gcc5.4 -O3 -march=native, omitted -funroll-loops since I was having a hard enough time getting decent timing runs on this borrowed laptop. Should have just booted Linux on a USB).

writing to /dev/null unless otherwise specified.

James Hollis’s: (not tested)
Nominal’s fwrite version: ~2.21s
this (SSE2): ~0.142s (unscaled times = real=14.232s, user=13.999s, sys=0.187s).
this (AVX-128): ~0.140s
this (AVX2): ~0.073s (unscaled: real=0m7.291s, user=0m7.125s, sys=0m0.155s).
this (AVX2) cygwin piping to wc -c, with 128kiB buffer size: 0.32s with CPU at 2.38GHz (max dual-core turbo). (unscaled times: real=32.466s user=11.468s sys=41.092s, including both this and wc). Only half the data was actually copied, though, because my silly program assumes that write does the full buffer, even though that’s not the case and cygwin write() only does 64k per call into a pipe.

So with SSE2 this is about 15 times faster than @Nominal Animal’s scalar code. With AVX2, it’s about 30 times faster. I didn’t try a version of Nominal’s code which just uses write() instead of fwrite(), but presumably for large buffers stdio mostly stays out of the way. If it is copying the data, that would account for a lot of slowdown.

Times to produce 1GB of data on a Core2Duo E6600 (Merom 2.4GHz, 32kiB private L1, 4MiB shared L2 caches), DDR2-533MHz in 64-bit Linux 4.2 (Ubuntu 15.10). Still using a 128kiB buffer size for write(), haven’t explored that dimension.

writing to /dev/null unless otherwise specified.

(SSE2) this with newline handling and 4 vectors of digits from each vector of random bytes: 0.183s (timed doing 100GiB in 18.3s, but similar results for 1GiB runs). 1.85 instructions per cycle.
(SSE2) this, piping to wc -c: 0.593s (unscaled: real=59.266s
user=20.148s sys=1m6.548s, including wc’s CPU time). Same number of write() system calls as with cygwin, but actually piping all the data because Linux handles all 128k of a write() to a pipe.
NominalAnimal’s fwrite() version (gcc5.2 -O3 -march=native), run with ./decdig 100 $((1024*1024*1024/200)) > /dev/null: 3.19s +/- 0.1%, with 1.40 instruction per cycle. -funroll-loops made maybe a tiny difference. clang-3.8 -O3 -march=native: 3.42s +/- 0.1%
Nominal-fwrite piping to wc -c: real=3.980s user=3.176s sys=2.080s
James Hollis’s line-at-a-time version (clang++-3.8 -O3 -march=native): 22.885s +/- 0.07%, with 0.84 instructions per cycle. (g++5.2 was slightly slower: 22.98s). Writing only one line at a time probably hurt significantly.
Stéphane Chazelas’s tr < /dev/urandom | ...: real=41.430s user=26.832s
sys=40.120s. tr was getting all of a CPU core to itself most of the time, spending nearly all its time in the kernel driver generating random bytes and copying them to a pipe. The other core on this dual core machine was running the rest of the pipeline.
time LC_ALL=C head -c512M </dev/urandom >/dev/null: i.e. just reading that much randomness with no piping: real=35.018s user=0.036s sys=34.940s.
Lưu Vĩnh Phúc’s perl program (perl v5.20.2 from Ubuntu15.10):
LANG=en_CA.UTF-8: real=4m32.634s user=4m3.288s sys=0m29.364.
LC_ALL=C LANG=C: real=4m18.637s user=3m50.324s sys=0m29.356s. Still very slow.

(SSE2) this with no newline handling, and either 3 or 4 vectors of digits from each vector of random bytes (almost exactly the same speed: the dig3 = v%10 step is about break-even on this HW): 0.166s (1.82 instructions per cycle). This is basically the lower limit for what we can come close to with perfectly efficient newline handling.

(SSE2) Old version of this with no newline handling, but only getting one digit per uint16_t element using v%10, 0.222 seconds +/- 0.4%, 2.12 instructions per cycle. (Compiled with gcc5.2, -march=native -O3 -funroll-loops. Unroll loops does happen to help for this code on this hardware. Don’t use it blindly, especially for large programs).
(SSE2) Old version of this, writing to a file (on a RAID10f2 of 3 fast magnetic hard drives, not very optimized for writes): ~4 seconds. Could go faster by tweaking kernel I/O buffer settings to allow a lot more dirty data before write() blocks. “System” time is still ~1.0 seconds, much higher than “user” time. On this old system with slow DDR2-533 RAM, it takes ~4x longer for the kernel to memcpy the data into the pagecache and run XFS functions than it does for my loop to keep rewriting it in-place in a buffer that stays hot in cache.

How it’s done

A fast PRNG is obviously essential. xorshift128+ can be vectorized, so you have two or four 64-bit generators in parallel, in elements of a SIMD vector. Each step produces a full vector of random bytes. (256b AVX2 implementation here with Intel intrinsics). I picked it over Nominal’s choice of xorshift*, because 64-bit vector integer multiplication is only possible in SSE2/AVX2 with extended-precision techniques.

Given a vector of random bytes, we can chop up each 16-bit element into multiple decimal digits. We produce multiple vectors of 16-bit elements that are each one ASCII digit + ASCII space. We store that directly into our output buffer.

My original version just used x / 6554 to get one random digit from every uint16_t element of a vector. It’s always between 0 and 9, inclusive. It’s biased away from 9, because (2^16 -1 ) / 6554 is only 9.99923. (6554 = ceil((2^16-1)/10), which ensures that the quotient is always < 10.)

x/6554 can be computed with one multiply by a “magic” constant (the fixed-point reciprocal) and a right shift of the high-half result. This is the best case for division by a constant; some divisors take more operations, and signed division takes extra work. x % 10 has a similar bias and isn’t as cheap to compute. (gcc’s asm output is equivalent to x - 10*(x/10), i.e. an extra multiply and subtract on top of the division using a modular multiplicative inverse.) Also, the lowest bit of xorshift128+ is not as high quality, so dividing to take entropy from high bits is better (for quality as well as speed) than modulo to take entropy from low bits.

However, we can use more of the entropy in each uint16_t by looking at the low decimal digits, like @Nominal’s digit() function. For maximum performance, I decided to take the low 3 decimal digits and x/6554, to save one PMULLW and PSUBW (and probably some MOVDQA) vs. the higher quality option of taking the 4 low decimal digits. x/6554 is slightly affected by the low 3 decimal digits, so there is some correlation between digits from the same element (8 or 16 digits separation in the ASCII output, depending on vector width).

I think gcc is dividing by 100 and by 1000, rather than a longer chain that successively divides by 10, so it’s probably not significantly shortening the length of the non-loop-carried dependency chain that produces 4 results from each PRNG output. port0 (vector multiply and shift) is the bottleneck because of the modular multiplicative inverses, and the shifts in xorshift+, so it’s definitely useful to save a vector-multiply.

xorshift+ is so fast that even using only ~3.3 bits of randomness from every 16 (i.e. 20% efficiency) is not a lot slower than chopping it up into multiple decimal digits. We only approximate the uniform distribution, because this answer is focused on speed as long as the quality isn’t too bad.

Any kind of conditional behaviour that keeps a variable number of elements would take much more work. (But could maybe still be done somewhat efficiently using SIMD left-packing techniques. However, that gets less efficient for small element sizes; giant shuffle-mask lookup tables are not viable, and there’s no AVX2 lane-crossing shuffle with smaller than 32-bit elements. A 128b PSHUFB version might still be able to generate a mask on the fly with BMI2 PEXT/PDEP, like you can for AVX2 with larger elements, but it’s tricky because a 64-bit integer only holds 8 bytes. The godbolt link on that answer has some code that might work for higher element counts.)

If latency of the RNG is a bottleneck, we could go even faster by running two vectors of generators in parallel, alternating which one we use. The compiler can still easily keep everything in registers in an unrolled loop, and that lets the two dependency chains run in parallel.

In the current version, chopping up the output of the PRNG, we actually bottleneck on port 0 throughput, not PRNG latency, so there’s no need for that.

The code: AVX2 version

Full version with more comments on the Godbolt compiler explorer.

Not very tidy, sorry I have to get to sleep and want to get this posted.

To get the SSE2 version, s/_mm256/_mm, s/256/128/, s/v16u/v8u/, and change vector_size(32) to 16. Also change the newline increment from 4*16 to 4*8. (Like I said, code is messy, and not well set up for compiling two versions. Didn’t originally plan on making an AVX2 version, but then I really wanted to test on a Haswell CPU I had access to.)

#include <immintrin.h>
#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
//#include <string.h>

// This would work equally fast 128b or 256b at a time (AVX2):
// https://stackoverflow.com/questions/24001930/avx-sse-version-of-xorshift128
struct rngstate256 {
    __m256i state0;
    __m256i state1;
};

static inline __m256i xorshift128plus_avx2(struct rngstate256 *sp)
{
    __m256i s1 = sp->state0;
    const __m256i s0 = sp->state1;
    sp->state0 = s0;
    s1 = _mm256_xor_si256(s1, _mm256_slli_epi64(s1, 23));
    __m256i state1new = _mm256_xor_si256(_mm256_xor_si256(_mm256_xor_si256(s1, s0),
                            _mm256_srli_epi64(s1, 18)),
                      _mm256_srli_epi64(s0, 5));
    sp->state1 = state1new;
    return _mm256_add_epi64(state1new, s0);
}



// GNU C native vectors let us get the compiler to do stuff like %10 each element
typedef unsigned short v16u __attribute__((vector_size(32)));

__m256i* vec_store_digit_and_space(__m256i vec, __m256i *restrict p)
{
    v16u v = (v16u)vec;
    v16u ten = (v16u)_mm256_set1_epi16(10);

    v16u divisor = (v16u)_mm256_set1_epi16(6554);  // ceil((2^16-1) / 10.0)
    v16u div6554 = v / divisor;      // Basically the entropy from the upper two decimal digits: 0..65.
    // Probably some correlation with the modulo-based values, especially dig3, but we do this instead of
    // dig4 for more ILP and fewer instructions total.

    v16u dig1 = v % ten;
    v /= ten;
    v16u dig2 = v % ten;
    v /= ten;
    v16u dig3 = v % ten;
    //  dig4 would overlap much of the randomness that div6554 gets

    const v16u ascii_digitspace = (v16u)_mm256_set1_epi16( (' '<<8) | '0');

    v16u *vecbuf = (v16u*)p;
    vecbuf[0] = div6554 | ascii_digitspace;
    vecbuf[1] = dig1    | ascii_digitspace;
    vecbuf[2] = dig2    | ascii_digitspace;
    vecbuf[3] = dig3    | ascii_digitspace;
    return p + 4;  // always a constant number of full vectors
}


void random_decimal_fill_buffer(char *restrict buf, size_t len, struct rngstate256 *restrict rngstate)
{
    buf = __builtin_assume_aligned(buf, 32);

    // copy to a local so clang can keep state in register, even in the non-inline version
    // restrict works for gcc, but apparently clang still thinks that *buf might alias *rngstate
    struct rngstate256 rng_local = *rngstate;

    __m256i *restrict p = (__m256i*restrict)buf;
    __m256i *restrict endbuf = (__m256i*)(buf+len);
    static unsigned newline_pos = 0;
    do {
        __m256i rvec = xorshift128plus_avx2(&rng_local);
        p = vec_store_digit_and_space(rvec, p);  // stores multiple ASCII vectors from the entropy in rvec

#if 1
        // this is buggy at the end or start of a power-of-2 buffer:
        // usually there's a too-short line, sometimes a too-long line
        const unsigned ncols = 100;
        newline_pos += 4*16;
        if (newline_pos >= ncols) {
            newline_pos -= ncols;
            char *cur_pos = (char*)p;
            *(cur_pos - newline_pos*2 - 1) = 'n';
        }
#endif
        // Turning every 100th space into a newline.
        // 1) With an overlapping 1B store to a location selected by a counter.  A down-counter would be more efficient
        // 2) Or by using a different constant for ascii_digitspace to put a newline in one element

        // lcm(200, 16) is 400 bytes, so unrolling the loop enough to produce two full lines makes a pattern of full vectors repeat
        // lcm(200, 32) is 800 bytes
        // a power-of-2 buffer size doesn't hold a whole number of lines :/
        // I'm pretty sure this can be solved with low overhead, like maybe 10% at worst.
    } while(p <= endbuf-3);

    *rngstate = rng_local;
}



#define BUFFER_SIZE (128 * 1024)
const static size_t bufsz = BUFFER_SIZE;
__attribute__((aligned(64))) static char static_buf[BUFFER_SIZE];

int main(int argc, char *argv[])
{
    // TODO: choose a seed properly.  (Doesn't affect the speed)
    struct rngstate256 xorshift_state = {
      _mm256_set_epi64x(123, 456, 0x123, 0x456),
      _mm256_set_epi64x(789, 101112, 0x789, 0x101112)
    };

    for (int i=0; i < 1024ULL*1024*1024 / bufsz * 100; i++) {
        random_decimal_fill_buffer(static_buf, bufsz, &xorshift_state);
        size_t written = write(1, static_buf, bufsz);
        (void)written;
        //fprintf(stderr, "wrote %#lx of %#lxn", written, bufsz);
    }

}

Compile with gcc, clang, or ICC (or hopefully any other compiler that understands the GNU C dialect of C99, and Intel’s intrinsics). GNU C vector extensions are highly convenient to get the compiler to generate the magic numbers for division/modulo using modular multiplicative inverses, and occasional __attribute__s are useful.

This could be written portably, but it would take more code.

Performance notes:

The overlapping-store to insert newlines has significant overhead to decide where to place it (branch mispredictions, and frontend bottlenecks on Core2), but the store itself has no impact on performance. Commenting out just that store instruction in the compiler’s asm (leaving all the branching the same) left the performance on Core2 completely unchanged, with repeated runs giving the same time to +/- less than 1%. So I conclude that the store buffer / cache handle it just fine.

Still, using some kind of rotating window of ascii_digitspace with one element having a newline might be even faster, if we unroll enough that any counters/branching go away.

Writing to /dev/null is basically a no-op, so the buffer probably stays hot in L2 cache (256kiB per core on Haswell). The perfect speedup from 128b vectors to 256b vectors is expected: there are no extra instructions, and everything (including the stores) happens with twice the width. The newline-insertion branch is taken twice as often, though. I unfortunately didn’t time on my Haswell cygwin setup with that part #ifdefed out.

2.5GHz * 32B / 13.7GB/s = 5.84 cycles per AVX2-store on Haswell. That’s pretty good, but could be faster. Maybe there’s some overhead in the cygwin system calls than I thought. I didn’t try commenting those out in the compiler’s asm output (which would ensure that nothing optimized away.)

L1 cache can sustain one 32B store per clock, and L2 is not much lower bandwidth (higher latency, though).

When I looked at IACA a few versions ago (without the branching for newlines, but only getting one ASCII vector per RNG vector), it was predicting something like one 32B vector store per 4 or 5 clocks.

I was hoping to get more of a speedup from extracting more data from each RNG result, based on looking at the asm myself, considering Agner Fog’s guides and other optimization resources which I’ve added links for in the SO x86 tag wiki.)

Likely it would be significantly faster on Skylake, where vector integer multiply and shift can run on twice as many ports (p0 / p1) compared to Haswell (p0 only). xorshift and the digit extraction both use a lot of shifts and multiplies. (Update: Skylake runs it at 3.02 IPC, giving us 3.77 cycles per 32-byte AVX2 store, timed at 0.030s per 1GB iteration, writing to /dev/null on Linux 4.15 on i7-6700k at 3.9GHz.

It doesn’t require 64-bit mode to work well. The SSE2 version is just as fast when compiled with -m32, because it doesn’t need very many vector registers, and all the 64-bit math is done in vectors, not general-purpose registers.

It’s actually slightly faster in 32-bit mode on Core2, because compare/branch macro-fusion only works in 32-bit mode, so there are fewer uops for the out-of-order core (18.3s (1.85 Instructions Per Clock) vs. 16.9s (2.0 IPC)). The smaller code-size from having no REX prefixes also helps Core2’s decoders.

Also, some reg-reg vector moves are replaced with loads, since not all the constants fix in vector regs anymore. Since load throughput from L1 cache isn’t a bottleneck, this actually helps. (e.g. multiplying by a constant vector of set1(10): movdqa xmm0, xmm10 / pmullw xmm0, xmm1 turns into movdqa xmm0, [constant] / pmullw xmm0, xmm1.) Since reg-reg MOVDQA requires an ALU port, it competes with the real work being done, but a MOVDQA load only competes for front-end decode bandwidth. (Having a 4-byte address inside many instructions cancels out a lot of the gain from saving REX prefixes.

I wouldn’t be surprised if saving ALU MOVDQA uops is where the real gains are coming from, since the frontend should be keeping up with the average of 2.0 IPC pretty well.

All these differences disappear on Haswell, where the whole thing should run from the decoded-uop cache, if not the loopback buffer. ALU+branch macro-fusion works in both modes since Nehalem.

Method 5

Here is a solution I hope is simple to understand:

od -An -x /dev/urandom | tr -dc 0-9 | fold -w100 | awk NF=NF FS= | head -c1G

od creates a uniform stream of hexadecimal digits from /dev/random.
tr gets rid of letters, only keeping 0-9 digits
fold ensures there are 100 digits per line
awk inserts spaces inside lines
head truncates the input to 1 gigabyte

Method 6

You can use the jot command for this:

jot -r 50000000 0 9 | fmt -w 200 > output.txt

Method 7

This is similar to Stéphane Chazelas’ method, however I read 64 bits at once to improve performance. The distribution is still uniform but now you get 19 digits for each 8 bytes instead of only 8 in the best case like before

perl -nle 'BEGIN{$/=8; $,=" "}
           $n = unpack("Q");
           next if $n >= 10000000000000000000;
           $s = sprintf("%019u", $n);
           push @a, (split //, $s);
           if (@a >= 100) {print (splice @a, 0, 100);}' < /dev/urandom | head -c1G

On 32-bit platform 9 digits will be read each time instead of 19.

Method 8

I kind of agree with Nominal Animal in using a compiled programming language if you need speed. However, you do not have to write your own RNG code in C. C++11 offers the excellent Mersenne Twister as part of it’s standard library.

#include <time.h>
#include <random>
#include <iostream>
using namespace std;

int main() {
    mt19937 gen(time(0)); 
    uniform_int_distribution<> dist(0,9);

    for(int j=0; j<5000000; j++){
        for (int i = 0; i < 99; i++) {  
            cout << dist(gen) << " ";
        }  
        cout << dist(gen) << endl;
    }
    return 0;
}

The above code is reasonably simple and takes about a minute when I pipe the output to a file. We can go a lot faster by creating a string big enough for 100 digits and hacking the digits into it. This allows us to call cout every line rather than every digit.

#include <time.h>
#include <random>
#include <iostream>
using namespace std;

int main() {
    mt19937 gen(time(0)); 
    uniform_int_distribution<> dist(0,9);

    char line[201];
    for(int i=1; i<199; i++)
        line[i] = ' ';
    line[199] = 'n';
    line[200] = 0;

    for(int j=0; j<5000000; j++){
        for (int i = 0; i < 199; i += 2) {  
            line[i] = dist(gen)+'0';
        }  
        cout << line;
    }
    return 0;
}

This code takes my machine around six seconds. Remember it’s standard output, so pipe it to a file.

I have a couple of disclaimers. First, I’m writing this on a Windows PC. I think the libraries are all present on Linux, but if I’m wrong, be sure to point it out.

Also, it actually outputs exactly half a billion space separated digits, which is technically a gigabyte but maybe not exactly what you wanted. It outputs 5 million lines, 100 digits per line. If the difference is important, you can increase the number of lines. On my Windows box the file seems to be slightly larger than 10^9 bytes, which I think is something to do with extra newline characters.

Method 9

It depends on your definition of “random”. If you mean cryptographically random, you just have to get a good library and bite the bullet, wait for it to run.

If you just need something that looks pretty random, here’s an easy way:

Get a file that is several Gb long. Your favorite movie will be good.
Gzip it, an easy way to squeeze out repeated patterns
Go through the file a nybble (half a byte) at a time. Each value will be between 0 and 15. Throw away any less than 1 or greater than 10. Subtract 1 from each of the first billion survivors and write it out as a digit.

It might take an hour to run on a slow machine; fast enough and random enough for most purposes.

Method 10

#!/bin/bash
FILE_CREAT='/tmp/testfile'
MAX_SIZE=$(( 1 * 1024 * 1024 ))
rm -rf ${FILE_CREAT}
while true
do
    STRING=''
    for (( i = 0 ; i < 100 ; i++ ))
    do
        NUM_RAN=$(cat /dev/urandom | tr -dc 0-9 | head -c 1)
        if [ $i -eq 0 ]
        then
            STRING=${NUM_RAN}
        else
            STRING=${STRING}' '${NUM_RAN}
        fi
    done
    echo ${STRING} >> $FILE_CREAT
    FILE_SIZE=$(du -s ${FILE_CREAT} | awk '{print $1}')
    if [ ${FILE_SIZE} -ge ${MAX_SIZE} ]
    then
        break
    fi
done
exit $1

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating