Fast way to build a test file with every second listed in YYYY-mm-dd HH:MM:SS format

I want to create a large test file with lines containg dates listed by the second, but my method is taking inordinately long… (or at least, that’s how it feels 🙂 … 43 minutes to create only 1051201 lines. 20.1 MB file….

I want to crate a much bigger file, with each line’s date being unique..
Is there a faster way than how I’ve approached it?:

# # BEGIN CREATE TEST DATA  ============ 
# # Create some dummy data.
  file=/tmp/$USER/junk
  ((secY2 =s3600*24*365*2))
  cnt=0
  secBeg=$(date --date="2010-01-01 00:00:00" +%s)
  secEnd=$((secBeg+secY2))
  ((sec=secBeg))
  while ((sec<=secEnd)) ; do
      date -d '1970-01-01 UTC '$sec' seconds' '+%Y-%m-%d %H:%M:%S' >>"$file" 
      ((sec+=1))
      ((cnt+=1))
  done
  ls -l "$file"
  echo Lines written: $cnt
# END CREATE TEST DATA  ============

Contents hide

Answers:

Method 1

Method 2

Method 3

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I haven’t made any benchmark, but I see a few potential improvements.

You open and close the file for each call to date. This is a waste: just put the redirection around the whole loop.

while …; do …; done >"$file"

You’re making separate calls to date for each line. Unix is good at calling external programs quickly, but internal is still better. GNU date has a batch option: feed it dates on standard input, and it pretty-prints them. Furthermore, to enumerate a range of integers, use seq, it’s likely to be faster than interpreting the loop in the shell.

seq -f @%12.0f $secBeg $secEnd | date -f - '+%Y-%m-%d %H:%M:%S' >"$file"
cnt=$(($secY2 + 1))

Generally speaking, if your shell script is too slow, try to have the inner loop executed in a dedicated utility — here seq and date, but often sed or awk. If you can’t manage that, switch to a more advanced scripting language such as Perl or Python (but the dedicated utilities are typically faster, if you fit their use cases).

Method 2

We know it’s slow from running:

$ time ./junk.sh
Lines written: 14401
./junk.sh  2.27s user 3.31s system 21% cpu 25.798 total

(and that’s a version that only prints 4 hours, not 2 years.)

To get a better understanding of where bash is spending its time, we can use strace -c.

$ strace -c ./junk.sh
Lines written: 14401
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 79.01    0.128906           4     28806     14403 waitpid
 17.92    0.029241           2     14403           clone
  2.45    0.003999           0    158448           rt_sigprocmask
  0.33    0.000532           0     28815           rt_sigaction
  0.29    0.000479           0     14403           sigreturn

So we can see that the top two calls are waitpid and clone. They don’t take up much time on their own (only 0.128906 seconds and 0.029241 seconds), but we can see they are being called a lot, so we are suspecting the problem is the fact we are having to start a separate date command to echo each number.

So then I did some searching, and found out you can compile bash with gprof support by doing:

$ ./configure --enable-profiling --without-bash-malloc
$ make

Now using that:

$ ./bash-gprof junk.sh
Lines written: 14401
$ gprof ./bash-gprof gmon.out

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
  8.05      0.28     0.28    14403     0.00     0.00  make_child
  6.61      0.51     0.23                             __gconv_transform_utf8_internal
  5.75      0.71     0.20                             fork
  5.75      0.91     0.20   259446     0.00     0.00  hash_search
  5.17      1.09     0.18   129646     0.00     0.00  dispose_words

So assuming the function names are meaningful, it confirms that the problem is we are making bash fork and call an external command repeatedly.

If we move the >> to the end of the while loop, it barely makes a dent.

$ time ./junk2.sh
...
./junk2.sh  2.46s user 3.18s system 22% cpu 25.659 total

But Gilles’ answer finds a way to only run date once, and not surprisingly, it’s much faster:

$ time ./bash-gprof junk3.sh
Lines written: 14401
./bash-gprof junk3.sh  0.10s user 0.16s system 96% cpu 0.264 total

$ strace -c ./bash-gprof junk3.sh
Lines written: 14401
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.63    0.039538        5648         7         3 waitpid
  2.37    0.000961          37        26           writev
  0.00    0.000000           0         9           read
  ...
  0.00    0.000000           0         4           clone


$ gprof ./bash-gprof gmon.out 
Flat profile:

Each sample counts as 0.01 seconds.
 no time accumulated

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00     1162     0.00     0.00  xmalloc
  0.00      0.00     0.00      782     0.00     0.00  mbschr
  0.00      0.00     0.00      373     0.00     0.00  shell_getc

7 waitpids and 4 clones compared to 28806 and 14403 in the original!

So the moral is: If you have to call an external command inside a loop that is repeated many times, you either need to find a way to move it out of the loop, or switch to a programming language that doesn’t have to call an external command to do the work.

As requested, a test based on Iain’s method (modified to use same variable names and looping):

#!/bin/bash
datein=junk.$$.datein
file=junk.$$
((secY2=3600*4))
cnt=0
secBeg=$(date --date="2010-01-01 00:00:00" +%s)
secEnd=$((secBeg+secY2))
((sec=secBeg))
while ((sec<=secEnd)) ; do
  echo @$sec >>"$datein"
  ((sec+=1))
  ((cnt+=1))
done
date --file="$datein" '+%Y-%m-%d %H:%M:%S' >>"$file"
ls -l "$file"
rm "$datein"
echo Lines written: $cnt

Results:

$ time ./bash-gprof ./junk4.sh 
Lines written: 14401
./bash-gprof ./junk4.sh  0.92s user 0.20s system 94% cpu 1.182 total

$ strace -c ./junk4.sh
Lines written: 14401
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.71    0.116007       14501         8         4 waitpid
  3.70    0.004684           0     14402           write
  1.54    0.001944           0     28813           close
  1.35    0.001707           0     72008         1 fcntl64
  0.88    0.001109           0     43253           rt_sigprocmask
  0.45    0.000566           0     28803           dup2
  0.36    0.000452           0     14410           open

$ gprof ./bash-gprof gmon.out 
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.06      0.15     0.15                             __gconv_transform_utf8_internal
 16.18      0.26     0.11                             mbrtowc
  7.35      0.31     0.05                             _int_malloc
  5.88      0.35     0.04                             __profile_frequency
  4.41      0.38     0.03   345659     0.00     0.00  readtok
  4.41      0.41     0.03                             _int_free
  2.94      0.43     0.02   230661     0.00     0.00  hash_search
  2.94      0.45     0.02    28809     0.00     0.00  stupidly_hack_special_variables
  1.47      0.46     0.01   187241     0.00     0.00  cprintf
  1.47      0.47     0.01   115232     0.00     0.00  do_redirections

So close and open are showing up.

Now Eelvex’s observation about >> per line versus > around the while loop starts to make a difference.

Let’s factor it out…

#!/bin/bash
datein=junk.$$.datein
file=junk.$$
((secY2=3600*4))
cnt=0
secBeg=$(date --date="2010-01-01 00:00:00" +%s)
secEnd=$((secBeg+secY2))
for ((sec=secBeg; sec<=secEnd; sec=sec+1)) ; do
  echo @$sec
  ((cnt+=1))
done >"$datein"
date --file="$datein" '+%Y-%m-%d %H:%M:%S' >>"$file"
ls -l "$file"
rm "$datein"
echo Lines written: $cnt

$ time ./junk6.sh
Lines written: 14401
./junk6.sh  0.58s user 0.14s system 95% cpu 0.747 total

$ strace -c junk6.sh
Lines written: 14401
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 97.41    0.092263       11533         8         4 waitpid
  2.06    0.001949           0     43252           rt_sigprocmask
  0.53    0.000506           0     14402           write
  0.00    0.000000           0        13           read
  0.00    0.000000           0        10           open
  0.00    0.000000           0        13           close
  0.00    0.000000           0         1           execve

$ gprof ./bash-gprof gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 10.00      0.05     0.05    72025     0.00     0.00  expand_word_internal
 10.00      0.10     0.05                             __gconv_transform_utf8_internal
  8.00      0.14     0.04                             __profile_frequency
  8.00      0.18     0.04                             _int_malloc
  4.00      0.20     0.02  1355024     0.00     0.00  xmalloc
  4.00      0.22     0.02   303217     0.00     0.00  mbschr

Which is also much, much faster than the original script, but slightly slower that Gilles’.

Method 3

This script generates a 10 million line 201Mb file in 7m50.0s on a VM I have handy. That’s about 1.5Gb/hr.

#!/bin/bash
Tstart=$(date +%s)
let Tend=$Tstart+100000000

[ -e datein.txt ] && rm datein.txt
[ -e logfile.log ] && rm logfile.log

for (( Tloop=Tstart; Tloop <=Tend; Tloop++ ))
do
    echo @$Tloop >> datein.txt
done

date --file=datein.txt '+%Y-%m-%d %H:%M:%S' >>logfile.log

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating