I have a large file in the following format:
2 1019 0 12 2 1019 3 0 2 1021 0 2 2 1021 2 0 2 1022 4 5 2 1030 0 1 2 1030 5 0 2 1031 4 4
If the values in column 2 match, I want to sum the values in column 3 and 4 of both lines, else just the sum of the values in the unique line.
So the output I am hoping for would look like this:
2 1019 15 2 1021 4 2 1022 9 2 1030 6 2 1031 8
I am able to sort files according to column 2 with awk or sort and sum the last columns with awk, but only for individual lines not for two lines where column 2 matches.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
I would do this in Perl:
$ perl -lane '$k{"$F[0] $F[1]"}+=$F[2]+$F[3];
END{print "$_ $k{$_}" for keys(%k) }' file
2 1019 15
2 1021 4
2 1030 6
2 1031 8
2 1022 9
Or awk:
awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file
If you want the output sorted according to the second column you could just pipe to sort:
awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file | sort -k2
Note that both solutions include the 1st column as well. The idea is to use the first and second columns as keys to a hash (in perl) or an associative array (in awk). The key in each solution is column1 column2 so if two lines have the same column two but a different column one, they will be grouped separately:
$ cat file
2 1019 2 3
2 1019 4 1
3 1019 2 2
$ awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file
3 1019 4
2 1019 10
Method 2
Maybe this could help, but is column 1 always 2 and does results depend on it?
awk '{ map[$2] += $3 + $4; } END { for (i in map) { print "2", i, map[i] | "sort -t't'" } }' file
or as mentioned by glenn jackman in comments about sorting:
gawk '{ map[$2] += $3 + $4; } END { PROCINFO["sorted_in"] = "@ind_str_asc"; for (i in map) { print 2, i, map[i] } }' file
Method 3
You could pre-sort the data and let awk handle the details:
sort -n infile | awk 'NR>1 && p!=$2 {print p,s} {s+=$3+$4} {p=$2}'
You may want to reset the accumulator:
sort -n infile | awk 'NR>1 && p!=$2 {print p,s;s=0} {s+=$3+$4} {p=$2}'
Output:
1019 15 1021 19 1022 28 1030 34
If you really want the keep the first column, do something like this:
sort -n infile | awk 'NR>1 && p!=$1FS$2 {print p,s} {s+=$3+$4} {p=$1FS$2}'
Output:
2 1019 15 2 1021 19 2 1022 28 2 1030 34
Explanation
The p variable holds the $2 value of the previous line, or $1FS$2 in the second case above. This means that the {print p,s} is triggered when $2 of the previous line is not the same as the one on the current line (p!=$2).
Method 4
Using swiss army knife util mlr:
mlr --nidx put '$5=$3+$4' then stats1 -g 1,2 -f 5 -a sum infile
Output:
2 1019 15 2 1021 4 2 1022 9 2 1030 6 2 1031 8
Notes:
-
--nidxtellsmlrto use numeric field names. -
put '$5=$3+$4'makes a new 5th field, the sum of fields 3 and 4. -
The
stats1function (or “verb“) is a smaller swiss army knife
within the greater swiss army knife ofmlr, with several
accumulator based functions such assum,count,mean, etc.stats1 -g 1,2groups the data by columns 1 and 2, and-f 5then adds up those groups’ field 5.
-a sumstats1prints named fields only.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0