sum pair of columns based on matching fields

I have a large file in the following format:

2 1019 0 12 
2 1019 3 0 
2 1021 0 2 
2 1021 2 0 
2 1022 4 5
2 1030 0 1 
2 1030 5 0 
2 1031 4 4

If the values in column 2 match, I want to sum the values in column 3 and 4 of both lines, else just the sum of the values in the unique line.

So the output I am hoping for would look like this:

2 1019 15 
2 1021 4 
2 1022 9 
2 1030 6 
2 1031 8

I am able to sort files according to column 2 with awk or sort and sum the last columns with awk, but only for individual lines not for two lines where column 2 matches.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I would do this in Perl:

$ perl -lane '$k{"$F[0] $F[1]"}+=$F[2]+$F[3]; 
              END{print "$_ $k{$_}" for keys(%k) }' file 
2 1019 15
2 1021 4
2 1030 6
2 1031 8
2 1022 9

Or awk:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file

If you want the output sorted according to the second column you could just pipe to sort:

awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file | sort -k2

Note that both solutions include the 1st column as well. The idea is to use the first and second columns as keys to a hash (in perl) or an associative array (in awk). The key in each solution is column1 column2 so if two lines have the same column two but a different column one, they will be grouped separately:

$ cat file
2 1019 2 3
2 1019 4 1
3 1019 2 2

$ awk '{a[$1" "$2]+=$3+$4}END{for (i in a){print i,a[i]}}' file
3 1019 4
2 1019 10

Method 2

Maybe this could help, but is column 1 always 2 and does results depend on it?

awk '{ map[$2] += $3 + $4; } END { for (i in map) { print "2", i, map[i] | "sort -t't'" } }' file

or as mentioned by glenn jackman in comments about sorting:

gawk '{ map[$2] += $3 + $4; } END { PROCINFO["sorted_in"] = "@ind_str_asc"; for (i in map) { print 2, i, map[i] } }' file

Method 3

You could pre-sort the data and let awk handle the details:

sort -n infile | awk 'NR>1 && p!=$2 {print p,s} {s+=$3+$4} {p=$2}'

You may want to reset the accumulator:

sort -n infile | awk 'NR>1 && p!=$2 {print p,s;s=0} {s+=$3+$4} {p=$2}'

Output:

1019 15
1021 19
1022 28
1030 34

If you really want the keep the first column, do something like this:

sort -n infile | awk 'NR>1 && p!=$1FS$2 {print p,s} {s+=$3+$4} {p=$1FS$2}'

Output:

2 1019 15
2 1021 19
2 1022 28
2 1030 34

Explanation

The p variable holds the $2 value of the previous line, or $1FS$2 in the second case above. This means that the {print p,s} is triggered when $2 of the previous line is not the same as the one on the current line (p!=$2).

Method 4

Using swiss army knife util mlr:

mlr --nidx   put '$5=$3+$4'   then   stats1 -g 1,2 -f 5 -a sum   infile

Output:

2   1019    15
2   1021    4
2   1022    9
2   1030    6
2   1031    8

Notes:

  • --nidx tells mlr to use numeric field names.
  • put '$5=$3+$4' makes a new 5th field, the sum of fields 3 and 4.
  • The stats1 function (or “verb“) is a smaller swiss army knife
    within the greater swiss army knife of mlr, with several
    accumulator based functions such as sum, count, mean, etc.

    stats1 -g 1,2 groups the data by columns 1 and 2, and -f 5
    -a sum
    then adds up those groups’ field 5. stats1 prints named fields only.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x