I want to create a large test file with lines containg dates listed by the second, but my method is taking inordinately long… (or at least, that’s how it feels 🙂 … 43 minutes to create only 1051201 lines. 20.1 MB file….
I want to crate a much bigger file, with each line’s date being unique..
Is there a faster way than how I’ve approached it?:
# # BEGIN CREATE TEST DATA ============
# # Create some dummy data.
file=/tmp/$USER/junk
((secY2 =s3600*24*365*2))
cnt=0
secBeg=$(date --date="2010-01-01 00:00:00" +%s)
secEnd=$((secBeg+secY2))
((sec=secBeg))
while ((sec<=secEnd)) ; do
date -d '1970-01-01 UTC '$sec' seconds' '+%Y-%m-%d %H:%M:%S' >>"$file"
((sec+=1))
((cnt+=1))
done
ls -l "$file"
echo Lines written: $cnt
# END CREATE TEST DATA ============
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I haven’t made any benchmark, but I see a few potential improvements.
You open and close the file for each call to date. This is a waste: just put the redirection around the whole loop.
while …; do …; done >"$file"
You’re making separate calls to date for each line. Unix is good at calling external programs quickly, but internal is still better. GNU date has a batch option: feed it dates on standard input, and it pretty-prints them. Furthermore, to enumerate a range of integers, use seq, it’s likely to be faster than interpreting the loop in the shell.
seq -f @%12.0f $secBeg $secEnd | date -f - '+%Y-%m-%d %H:%M:%S' >"$file" cnt=$(($secY2 + 1))
Generally speaking, if your shell script is too slow, try to have the inner loop executed in a dedicated utility — here seq and date, but often sed or awk. If you can’t manage that, switch to a more advanced scripting language such as Perl or Python (but the dedicated utilities are typically faster, if you fit their use cases).
Method 2
We know it’s slow from running:
$ time ./junk.sh Lines written: 14401 ./junk.sh 2.27s user 3.31s system 21% cpu 25.798 total
(and that’s a version that only prints 4 hours, not 2 years.)
To get a better understanding of where bash is spending its time, we can use strace -c.
$ strace -c ./junk.sh Lines written: 14401 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 79.01 0.128906 4 28806 14403 waitpid 17.92 0.029241 2 14403 clone 2.45 0.003999 0 158448 rt_sigprocmask 0.33 0.000532 0 28815 rt_sigaction 0.29 0.000479 0 14403 sigreturn
So we can see that the top two calls are waitpid and clone. They don’t take up much time on their own (only 0.128906 seconds and 0.029241 seconds), but we can see they are being called a lot, so we are suspecting the problem is the fact we are having to start a separate date command to echo each number.
So then I did some searching, and found out you can compile bash with gprof support by doing:
$ ./configure --enable-profiling --without-bash-malloc $ make
Now using that:
$ ./bash-gprof junk.sh Lines written: 14401 $ gprof ./bash-gprof gmon.out Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 8.05 0.28 0.28 14403 0.00 0.00 make_child 6.61 0.51 0.23 __gconv_transform_utf8_internal 5.75 0.71 0.20 fork 5.75 0.91 0.20 259446 0.00 0.00 hash_search 5.17 1.09 0.18 129646 0.00 0.00 dispose_words
So assuming the function names are meaningful, it confirms that the problem is we are making bash fork and call an external command repeatedly.
If we move the >> to the end of the while loop, it barely makes a dent.
$ time ./junk2.sh ... ./junk2.sh 2.46s user 3.18s system 22% cpu 25.659 total
But Gilles’ answer finds a way to only run date once, and not surprisingly, it’s much faster:
$ time ./bash-gprof junk3.sh Lines written: 14401 ./bash-gprof junk3.sh 0.10s user 0.16s system 96% cpu 0.264 total $ strace -c ./bash-gprof junk3.sh Lines written: 14401 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.63 0.039538 5648 7 3 waitpid 2.37 0.000961 37 26 writev 0.00 0.000000 0 9 read ... 0.00 0.000000 0 4 clone $ gprof ./bash-gprof gmon.out Flat profile: Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 1162 0.00 0.00 xmalloc 0.00 0.00 0.00 782 0.00 0.00 mbschr 0.00 0.00 0.00 373 0.00 0.00 shell_getc
7 waitpids and 4 clones compared to 28806 and 14403 in the original!
So the moral is: If you have to call an external command inside a loop that is repeated many times, you either need to find a way to move it out of the loop, or switch to a programming language that doesn’t have to call an external command to do the work.
As requested, a test based on Iain’s method (modified to use same variable names and looping):
#!/bin/bash datein=junk.$$.datein file=junk.$$ ((secY2=3600*4)) cnt=0 secBeg=$(date --date="2010-01-01 00:00:00" +%s) secEnd=$((secBeg+secY2)) ((sec=secBeg)) while ((sec<=secEnd)) ; do echo @$sec >>"$datein" ((sec+=1)) ((cnt+=1)) done date --file="$datein" '+%Y-%m-%d %H:%M:%S' >>"$file" ls -l "$file" rm "$datein" echo Lines written: $cnt
Results:
$ time ./bash-gprof ./junk4.sh Lines written: 14401 ./bash-gprof ./junk4.sh 0.92s user 0.20s system 94% cpu 1.182 total $ strace -c ./junk4.sh Lines written: 14401 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 91.71 0.116007 14501 8 4 waitpid 3.70 0.004684 0 14402 write 1.54 0.001944 0 28813 close 1.35 0.001707 0 72008 1 fcntl64 0.88 0.001109 0 43253 rt_sigprocmask 0.45 0.000566 0 28803 dup2 0.36 0.000452 0 14410 open $ gprof ./bash-gprof gmon.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 22.06 0.15 0.15 __gconv_transform_utf8_internal 16.18 0.26 0.11 mbrtowc 7.35 0.31 0.05 _int_malloc 5.88 0.35 0.04 __profile_frequency 4.41 0.38 0.03 345659 0.00 0.00 readtok 4.41 0.41 0.03 _int_free 2.94 0.43 0.02 230661 0.00 0.00 hash_search 2.94 0.45 0.02 28809 0.00 0.00 stupidly_hack_special_variables 1.47 0.46 0.01 187241 0.00 0.00 cprintf 1.47 0.47 0.01 115232 0.00 0.00 do_redirections
So close and open are showing up.
Now Eelvex’s observation about >> per line versus > around the while loop starts to make a difference.
Let’s factor it out…
#!/bin/bash datein=junk.$$.datein file=junk.$$ ((secY2=3600*4)) cnt=0 secBeg=$(date --date="2010-01-01 00:00:00" +%s) secEnd=$((secBeg+secY2)) for ((sec=secBeg; sec<=secEnd; sec=sec+1)) ; do echo @$sec ((cnt+=1)) done >"$datein" date --file="$datein" '+%Y-%m-%d %H:%M:%S' >>"$file" ls -l "$file" rm "$datein" echo Lines written: $cnt $ time ./junk6.sh Lines written: 14401 ./junk6.sh 0.58s user 0.14s system 95% cpu 0.747 total $ strace -c junk6.sh Lines written: 14401 % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 97.41 0.092263 11533 8 4 waitpid 2.06 0.001949 0 43252 rt_sigprocmask 0.53 0.000506 0 14402 write 0.00 0.000000 0 13 read 0.00 0.000000 0 10 open 0.00 0.000000 0 13 close 0.00 0.000000 0 1 execve $ gprof ./bash-gprof gmon.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 10.00 0.05 0.05 72025 0.00 0.00 expand_word_internal 10.00 0.10 0.05 __gconv_transform_utf8_internal 8.00 0.14 0.04 __profile_frequency 8.00 0.18 0.04 _int_malloc 4.00 0.20 0.02 1355024 0.00 0.00 xmalloc 4.00 0.22 0.02 303217 0.00 0.00 mbschr
Which is also much, much faster than the original script, but slightly slower that Gilles’.
Method 3
This script generates a 10 million line 201Mb file in 7m50.0s on a VM I have handy. That’s about 1.5Gb/hr.
#!/bin/bash
Tstart=$(date +%s)
let Tend=$Tstart+100000000
[ -e datein.txt ] && rm datein.txt
[ -e logfile.log ] && rm logfile.log
for (( Tloop=Tstart; Tloop <=Tend; Tloop++ ))
do
echo @$Tloop >> datein.txt
done
date --file=datein.txt '+%Y-%m-%d %H:%M:%S' >>logfile.log
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0