Average rows with same first column

Given a file with two columns:

I need a way to coalesce all rows with the same ID into one that has an average height. In this case, (69 + 67 + 65 + 62 + 59) / 5 = 64 and (29 + 26 + 21 + 20) / 4 = 24, so the output should be:

Id  Avg.ht
 510 64
 601 24

How can I do that using sed/awk/perl?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Using awk :

The input file

Awk in a shell :

$ awk '
    NR>1{
        arr[$1]   += $2
        count[$1] += 1
    }
    END{
        for (a in arr) {
            print "id avg " a " = " arr[a] / count[a]
        }
    }
' FILE

Or with Perl in a shell :

$ perl -lane '
    END {
        foreach my $key (keys(%hash)) {
            print "id avg $key = " . $hash{$key} / $count{$key};
        }
    }
    if ($. > 1) {
        $hash{$F[0]}  += $F[1];
        $count{$F[0]} += 1;
    }
' FILE

Output is :

id avg 601 = 24
id avg 510 = 64.4

And last for the joke, a Perl dark-obfuscated one-liner =)

perl -lane'END{for(keys(%h)){print"$_:".$h{$_}/$c{$_}}}($.>1)&&do{$h{$F[0]}+=$F[1];$c{$F[0]}++}' FILE

Method 2

#!/usr/bin/perl
use strict;
use warnings;

my %sum_so_far;
my %count_so_far;
while ( <> ) {
    # Skip lines that don't start with a digit
    next if m/^[^d]/;

    # Accumulate the sum and the count
    my @line = split();
    $sum_so_far{$line[0]}   += $line[1];
    $count_so_far{$line[0]} += 1;
}

# Dump the output
print "Id Avg.htn";
foreach my $id ( keys %count_so_far ) {
    my $avg = $sum_so_far{$id}/$count_so_far{$id};
    print " $id $avgn";
}

Output:

<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="721b0017321e1d11131e1a1d0106">[email protected]</a>$ perl make_average.pl input.txt 
Id Avg.ht
 510 64.4
 601 24

Note that your sample output is wrong. There’s no way you can get an average of 52 when every value for that id is 59 or larger.

Also, you have a letter l in one of your columns, masquerading as the number 1…

Method 3

With gnu datamash:

datamash -H -s -g 1 mean 2 <file

GroupBy(Id) mean()
510 64.4
601 24

This sorts and groups by 1st field calculating 2nd field mean value, preserving Headers. It assumes the fields are separated by single tab. Use -W, --whitespace if they’re separated by multiple blanks or -t, --field-separator= to define another field separator (space, comma etc). Since datamash requires sorted input, the output will be sorted by the grouped column.

Method 4

Take a look at what is done here: http://www.sugihartono.com/programming/group-by-count-and-sorting-using-perl-script/

The essential difficult part is doing a ‘group by’ operation.
The linked script does that using a hash.

In that link they are calculating the sum, but getting the average will not be far different.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating