I am trying to find a way to determine if a text file is a subset of another..
For example:
foo bar
is a subset of
foo bar pluto
While:
foo pluto
and
foo bar
are not a subset of each other…
Is there a way to do this with a command?
This check must be a cross check, and it has to return:
file1 subset of file2 : True file2 subset of file1 : True otherwise : False
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If those file contents are called file1, file2 and file3 in order of apearance then you can do it with the following one-liner:
# python -c "x=open('file1').read(); y=open('file2').read(); print x in y or y in x"
True
# python -c "x=open('file2').read(); y=open('file1').read(); print x in y or y in x"
True
# python -c "x=open('file1').read(); y=open('file3').read(); print x in y or y in x"
False
Method 2
With perl:
if perl -0777 -e '$n = <>; $h = <>; exit(index($h,$n)<0)' needle.txt haystack.txt then echo needle.txt is found in haystack.txt fi
-0octal defines the record delimiter. When that octal number is greater than 0377 (the maximum byte value), that means there’s no delimiter, it’s equivalent to doing $/ = undef. In that case, <> returns the full content of a single file, that’s the slurp mode.
Once we have the content of the files in two $h and $n variables, we can use index() to determine if one is found in the other.
That means however that the whole files are stored in memory which means that method won’t work for very large files.
For mmappable files (usually includes regular files and most seekable files like block devices), that can be worked around by using mmap() on the files, like with the Sys::Mmap perl module:
if
perl -MSys::Mmap -le '
open N, "<", $ARGV[0] || die "$ARGV[0]: $!";
open H, "<", $ARGV[1] || die "$ARGV[1]: $!";
mmap($n, 0, PROT_READ, MAP_SHARED, N);
mmap($h, 0, PROT_READ, MAP_SHARED, H);
exit (index($h, $n) < 0)' needle.txt haystack.txt
then
echo needle.txt is found in haystack.txt
fi
Method 3
I found a solution thanks to this question
Basically I am testing two files a.txt and b.txt with this script:
#!/bin/bash
first_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$1" "$2" | wc -l)
second_cmp=$(diff --unchanged-line-format= --old-line-format= --new-line-format='%L' "$2" "$1" | wc -l)
if [ "$first_cmp" -eq "0" -o "$second_cmp" -eq "0" ]
then
echo "Subset"
exit 0
else
echo "Not subset"
exit 1
fi
If one is subset of the other the script return 0 for True otherwise 1.
Method 4
If f1 is a subset of f2 then f1 – f2 is an empty set. Building on that we can write an is_subset function and a function derived from it. As per Set difference between 2 text files
sort_files () {
f1_sorted="$1.sorted"
f2_sorted="$2.sorted"
if [ ! -f $f1_sorted ]; then
cat $1 | sort | uniq > $f1_sorted
fi
if [ ! -f $f2_sorted ]; then
cat $2 | sort | uniq > $f2_sorted
fi
}
remove_sorted_files () {
f1_sorted="$1.sorted"
f2_sorted="$2.sorted"
rm -f $f1_sorted
rm -f $f2_sorted
}
set_union () {
sort_files $1 $2
cat "$1.sorted" "$2.sorted" | sort | uniq
remove_sorted_files $1 $2
}
set_diff () {
sort_files $1 $2
cat "$1.sorted" "$2.sorted" "$2.sorted" | sort | uniq -u
remove_sorted_files $1 $2
}
rset_diff () {
sort_files $1 $2
cat "$1.sorted" "$2.sorted" "$1.sorted" | sort | uniq -u
remove_sorted_files $1 $2
}
is_subset () {
sort_files $1 $2
output=$(set_diff $1 $2)
remove_sorted_files $1 $2
if [ -z $output ]; then
return 0
else
return 1
fi
}
Method 5
From http://www.catonmat.net/blog/set-operations-in-unix-shell/:
Comm compares two sorted files line by line. It may be run in such a way that it outputs lines that appear only in the first specified file. If the first file is subset of the second, then all the lines in the 1st file also appear in the 2nd, so no output is produced:
$ comm -23 <(sort subset | uniq) <(sort set | uniq) | head -1 # comm returns no output if subset ⊆ set # comm outputs something if subset ⊊ set
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0