Using Diff on a specific column in a file

Will it be possible to use diff on a specific columns in a file?

Contents hide

file1

file2

output(Expected)

Answers:

Method 1

Method 2

file1

Something  123 item1
Something  456 item2
Something  768 item3
Something  353 item4

file2

Another   123 stuff1
Another   193 stuff2
Another   783 stuff3
Another   353 stuff4

output(Expected)

Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

I want to diff the 2nd column of each file, then, the result will contain the diff-ed column but along with the whole line.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

awk is a better tool for comparing columns of files. See, for example, the answer to: compare two columns of different files and print if it matches — there are similar answers out there for printing lines for matching columns.

Since you want to print lines that don’t match, we can create an awk command that prints the lines in file2 for which column 2 has not been seen in file1:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file1 file2
Another   193 stuff2
Another   783 stuff3

As explained similarly by terdon in the above-mentioned question,

NR==FNR : NR is the current input line number and FNR the current file’s line number. The two will be equal only while the 1st file is being read.
c[$2]++; next : if this is the 1st file, save the 2nd field in the c array. Then, skip to the next line so that this is only applied on the 1st file.
c[$2] == 0 : the else block will only be executed if this is the second file so we check whether field 2 of this file has already been seen (c[$2]==0) and if it has been, we print the line. In awk, the default action is to print the line so if c[$2]==0 is true, the line will be printed.

But you also want the lines from file1 for which column 2 doesn’t match in file2. This you can get by simply exchanging their position in the same command:

$ awk 'NR==FNR{c[$2]++;next};c[$2] == 0' file2 file1
Something  456 item2
Something  768 item3

So now you can generate the output you want, by using awk twice. Perhaps someone with more awk expertise can get it done in one pass.

You tagged your question with /ksh, so I’ll assume you are using korn shell. In ksh you can define a function for your diff, say diffcol2, to make your job easier:

diffcol2()
{
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $2 $1      
   awk 'NR==FNR{c[$2]++;next};c[$2] == 0' $1 $2      
}

This has the behavior you desire:

$ diffcol2 file1 file2
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

Method 2

I don’t think diff (even in combination with cut) will be flexible enough to handle this. And it seems as though what you really want is keys in file1 that are not in file2 and vice versa – not strictly a line-by-line diff. If the input files are big, I would go with perl, but for small files this awk script works for the input provided:

%cat a.awk

BEGIN {
  while (getline < "file1") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f1[key]=line
  }
  while (getline < "file2") {
    line=$0;
    split(line,f," ");
    key=f[2];
    f2[key]=line
  }
}
END {
  for (c in f1) {
    if (c in f2 == 0) print f1[c]
  }
  for (c in f2) {
    if (c in f1 == 0) print f2[c]
  }
}

And this is how you run it (note the trick with /dev/null, since awk expects an input file as a parameter:

%awk -f a.awk /dev/null
Something  456 item2
Something  768 item3
Another   193 stuff2
Another   783 stuff3

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating