Suppose I have a file (call it sample.txt) that looks like this:
Row1,10 Row2,20 Row3,30 Row4,40
I want to be able to work on a stream from this file that is essentially the pairwise combination of all four rows (so we should end up with 16 in total). For instance, I’m looking for a streaming (ie efficient) command where the output is:
Row1,10 Row1,10 Row1,10 Row2,20 Row1,10 Row3,30 Row1,10 Row4,40 Row2,20 Row1,10 Row1,20 Row2,20 ... Row4,40 Row4,40
My use case is that I want to stream this output into another command (like awk) to compute some metric about this pairwise combination.
I have a way to do this in awk but my concern is that my use of the END{} block means that I’m basically storing the entire file in memory before I output. Example code:
awk '{arr[$1]=$1} END{for (a in arr){ for (a2 in arr) { print arr[a] " " arr[a2]}}}' samples/rows.txt
Row3,30 Row3,30
Row3,30 Row4,40
Row3,30 Row1,10
Row3,30 Row2,20
Row4,40 Row3,30
Row4,40 Row4,40
Row4,40 Row1,10
Row4,40 Row2,20
Row1,10 Row3,30
Row1,10 Row4,40
Row1,10 Row1,10
Row1,10 Row2,20
Row2,20 Row3,30
Row2,20 Row4,40
Row2,20 Row1,10
Row2,20 Row2,20
Is there an efficient streaming way to do this without having to essentially store the file in memory and then output in the END block?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Here’s how to do it in awk so that it doesn’t have to store the whole file in an array. This is basically the same algorithm as terdon’s.
If you like, you can even give it multiple filenames on the command line and it will process each file independently, concatenating the results together.
#!/usr/bin/awk -f
#Cartesian product of records
{
file = FILENAME
while ((getline line <file) > 0)
print $0, line
close(file)
}
On my system, this runs in about 2/3 the time of terdon’s perl solution.
Method 2
I’m not sure this is better than doing it in memory, but with a sed that reads out its infile for every line in its infile and another on the other side of a pipe alternating Hold space with input lines…
cat <<IN >/tmp/tmp Row1,10 Row2,20 Row3,30 Row4,40 IN </tmp/tmp sed -e 'i ' -e 'r /tmp/tmp' | sed -n '/./!n;h;N;/n$/D;G;s/n/ /;P;D'
OUTPUT
Row1,10 Row1,10 Row1,10 Row2,20 Row1,10 Row3,30 Row1,10 Row4,40 Row2,20 Row1,10 Row2,20 Row2,20 Row2,20 Row3,30 Row2,20 Row4,40 Row3,30 Row1,10 Row3,30 Row2,20 Row3,30 Row3,30 Row3,30 Row4,40 Row4,40 Row1,10 Row4,40 Row2,20 Row4,40 Row3,30 Row4,40 Row4,40
I did this another way. It does store some in memory – it stores a string like:
"$1" -
… for each line in the file.
pairs(){ [ -e "$1" ] || return
set -- "$1" "$(IFS=0 n=
case "${0%sh*}" in (ya|*s) n=-1;; (mk|po) n=+1;;esac
printf '"$1" - %s' $(printf "%.$(($(wc -l <"$1")$n))d" 0))"
eval "cat -- $2 </dev/null | paste -d ' n' -- $2"
}
It is very fast. It cat‘s the file as many times as there are lines in the file to a |pipe. On the other side of the pipe that input is merged with the file itself as many times as there are lines in the file.
The case stuff is just for portability – yash and zsh both add one element to the split, while mksh and posh both lose one. ksh, dash, busybox, and bash all split out to exactly as many fields as there are zeroes as printed by printf. As written the above renders the same results for every one of the above mentioned shells on my machine.
If the file is very long, there may be $ARGMAX issues with too many arguments in which case you would need to introduce xargs or similar as well.
Given the same input I used before the output is identical. But, if I were to go bigger…
seq 10 10 10000 | nl -s, >/tmp/tmp
That generates a file almost identical to what I used before (sans ‘Row’) – but at 1000 lines. You can see for yourself how fast it is:
time pairs /tmp/tmp |wc -l 1000000 pairs /tmp/tmp 0.20s user 0.07s system 110% cpu 0.239 total wc -l 0.05s user 0.03s system 32% cpu 0.238 total
At 1000 lines there is some slight variation in performance between shells – bash is invariably the slowest – but because the only work they do anyway is generate the arg string (1000 copies of filename -) the effect is minimal. The difference in performance between zsh – as above – and bash is 100th of a second here.
Here’s another version that should work for a file of any length:
pairs2()( [ -e "$1" ] || exit
rpt() until [ "$((n+=1))" -gt "$1" ]
do printf %s\n "$2"
done
[ -n "${1##*/*}" ] || cd -P -- "${1%/*}" || exit
: & set -- "$1" "/tmp/pairs$!.ln" "$(wc -l <"$1")"
ln -s "$PWD/${1##*/}" "$2" || exit
n=0 rpt "$3" "$2" | xargs cat | { exec 3<&0
n=0 rpt "$3" p | sed -nf - "$2" | paste - /dev/fd/3
}; rm "$2"
)
It creates a soft-link to its first arg in /tmp with a semi-random name so that it won’t get hung-up on weird filenames. That’s important because cat‘s args are fed to it over a pipe via xargs. cat‘s output is saved to <&3 while sed prints every line in the first arg as many times as there are lines in that file – and its script is also fed to it via a pipe. Again paste merges its input, but this time it takes only two arguments - again for its standard input and the link name /dev/fd/3.
That last – the /dev/fd/[num] link – should work on any linux system and many more besides, but if it doesn’t creating a named pipe with mkfifo and using that instead should work as well.
The last thing it does is rm the soft-link it creates before exiting.
This version is actually faster still on my system. I guess it is because that though it execs more applications, it starts handing them their arguments immediately – whereas before it stacked them all first.
time pairs2 /tmp/tmp | wc -l 1000000 pairs2 /tmp/tmp 0.30s user 0.09s system 178% cpu 0.218 total wc -l 0.03s user 0.02s system 26% cpu 0.218 total
Method 3
Well, you could always do it in your shell:
while read i; do
while read k; do echo "$i $k"; done < sample.txt
done < sample.txt
It is a good deal slower than your awk solution (on my machine, it took ~11 seconds for 1000 lines, versus ~0.3 seconds in awk) but at least it never holds more than a couple of lines in memory.
The loop above works for the very simple data you have in your example. It will choke on backslashes and it will eat trailing and leading spaces. A more robust version of the same thing is:
while IFS= read -r i; do
while IFS= read -r k; do printf "%s %sn" "$i" "$k"; done < sample.txt
done < sample.txt
Another choice is to use perl instead:
perl -lne '$line1=$_; open(A,"sample.txt");
while($line2=<A>){printf "$line1 $line2"} close(A)' sample.txt
The script above will read each line of the input file (-ln), save it as $l, open sample.txt again, and print each line along with $l. The result is all pairwise combinations while only 2 lines are ever stored in memory. On my system, that took only about 0.6 seconds on 1000 lines.
Method 4
With zsh:
a=( Row1,10 Row2,20 Row3,30 Row4,40 ) printf '%sn' $^a' '$^a
$^a on an array turns on brace-like expansion (like in {elt1,elt2}) for the array.
Method 5
You can compile this c++ code for quite quick results.
It completes in around 0.19 – 0.27 seconds on a 1000 line file.
It currently reads 10000 lines into memory(to speed up printing to screen) which if you had 1000 characters per line would use less than 10mb memory which i wouldn’t think would be a problem.
You can remove that section completely though and just print directly to screen if it does cause an issue though.
You can compile using g++ -o "NAME" "NAME.cpp"
Where NAME is the name of the File to save it to and NAME.cpp is the file this code is saved to
CTEST.cpp:
#include <iostream>
#include <string>
#include <fstream>
#include <iomanip>
#include <cstdlib>
#include <sstream>
int main(int argc,char *argv[])
{
if(argc != 2)
{
printf("You must provide at least one argumentn"); // Make sure only one arg
exit(0);
}
std::ifstream file(argv[1]),file2(argv[1]);
std::string line,line2;
std::stringstream ss;
int x=0;
while (file.good()){
file2.clear();
file2.seekg (0, file2.beg);
getline(file, line);
if(file.good()){
while ( file2.good() ){
getline(file2, line2);
if(file2.good())
ss << line <<" "<<line2 << "n";
x++;
if(x==10000){
std::cout << ss.rdbuf();
ss.clear();
ss.str(std::string());
}
}
}
}
std::cout << ss.rdbuf();
ss.clear();
ss.str(std::string());
}
Demonstration
$ g++ -o "Stream.exe" "CTEST.cpp" $ seq 10 10 10000 | nl -s, > testfile $ time ./Stream.exe testfile | wc -l 1000000 real 0m0.243s user 0m0.210s sys 0m0.033s
Method 6
join -j 2 file.txt file.txt | cut -c 2-
- join by a non existing field and remove the first space
Field 2 is empty and equal for all the element in file.txt so join will concatenate each element with all the others: it is in fact calculating the Cartesian product.
Method 7
One option with Python is to memory-map the file and take advantage of the fact that the Python regular expression library can work directly with memory-mapped files. Although this has the appearance of running nested loops over the file, memory mapping ensures that the OS brings available physical RAM optimally into play
import mmap
import re
with open('test.file', 'rt') as f1, open('test.file') as f2:
with mmap.mmap(f1.fileno(), 0, flags=mmap.MAP_SHARED, access=mmap.ACCESS_READ) as m1,
mmap.mmap(f2.fileno(), 0, flags=mmap.MAP_SHARED, access=mmap.ACCESS_READ) as m2:
for line1 in re.finditer(b'.*?n', m1):
for line2 in re.finditer(b'.*?n', m2):
print('{} {}'.format(line1.group().decode().rstrip(),
line2.group().decode().rstrip()))
m2.seek(0)
Alternately a quick solution in Python, although memory efficiency might still be a concern
from itertools import product
with open('test.file') as f:
for a, b in product(f, repeat=2):
print('{} {}'.format(a.rstrip(), b.rstrip()))
Row1,10 Row1,10
Row1,10 Row2,20
Row1,10 Row3,30
Row1,10 Row4,40
Row2,20 Row1,10
Row2,20 Row2,20
Row2,20 Row3,30
Row2,20 Row4,40
Row3,30 Row1,10
Row3,30 Row2,20
Row3,30 Row3,30
Row3,30 Row4,40
Row4,40 Row1,10
Row4,40 Row2,20
Row4,40 Row3,30
Row4,40 Row4,40
Method 8
In bash, ksh should work as well, using only shell built-ins:
#!/bin/bash
# we require array support
d=( $(< sample.txt) )
# quote arguments and
# build up brace expansion string
d=$(printf -- '%q,' "${d[@]}")
d=$(printf -- '%s' "{${d%,}}' '{${d%,}}")
eval printf -- '%s\n' "$d"
Note that while this holds the entire file in memory in a shell variable, it only needs a single read access to it.
Method 9
sed solution.
line_num=$(wc -l < input.txt)
sed 'r input.txt' input.txt | sed -re "1~$((line_num + 1)){h;d}" -e 'G;s/(.*)n(.*)/2 1/'
Explanation:
sed 'r file2' file1– read the all file contents of file2 for every line of file1.- Construction
1~imeans 1-th line, then 1+i line, 1+2*i, 1+3*i, etc. Therefore,1~$((line_num + 1)){h;d}meanshold pointed line to the buffer,delete pattern space and start new cycle. 'G;s/(.*)n(.*)/2 1/'– for all lines, except picked in the previous step, do next:Get line from hold buffer and append it to the current line. Then swap places of lines. Wascurrent_linenbuffer_linen, becamebuffer_linencurrent_linen
Output
Row1,10 Row1,10 Row1,10 Row2,20 Row1,10 Row3,30 Row1,10 Row4,40 Row2,20 Row1,10 Row2,20 Row2,20 Row2,20 Row3,30 Row2,20 Row4,40 Row3,30 Row1,10 Row3,30 Row2,20 Row3,30 Row3,30 Row3,30 Row4,40 Row4,40 Row1,10 Row4,40 Row2,20 Row4,40 Row3,30 Row4,40 Row4,40
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0