Given a file with two columns:
Id ht 510 69 510 67 510 65 510 62 510 59 601 29 601 26 601 21 601 20
I need a way to coalesce all rows with the same ID into one that has an average height. In this case, (69 + 67 + 65 + 62 + 59) / 5 = 64 and (29 + 26 + 21 + 20) / 4 = 24, so the output should be:
Id Avg.ht 510 64 601 24
How can I do that using sed/awk/perl?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Using awk :
The input file
$ cat FILE Id ht 510 69 510 67 510 65 510 62 510 59 601 29 601 26 601 21 601 20
Awk in a shell :
$ awk '
NR>1{
arr[$1] += $2
count[$1] += 1
}
END{
for (a in arr) {
print "id avg " a " = " arr[a] / count[a]
}
}
' FILE
Or with Perl in a shell :
$ perl -lane '
END {
foreach my $key (keys(%hash)) {
print "id avg $key = " . $hash{$key} / $count{$key};
}
}
if ($. > 1) {
$hash{$F[0]} += $F[1];
$count{$F[0]} += 1;
}
' FILE
Output is :
id avg 601 = 24 id avg 510 = 64.4
And last for the joke, a Perl dark-obfuscated one-liner =)
perl -lane'END{for(keys(%h)){print"$_:".$h{$_}/$c{$_}}}($.>1)&&do{$h{$F[0]}+=$F[1];$c{$F[0]}++}' FILE
Method 2
#!/usr/bin/perl
use strict;
use warnings;
my %sum_so_far;
my %count_so_far;
while ( <> ) {
# Skip lines that don't start with a digit
next if m/^[^d]/;
# Accumulate the sum and the count
my @line = split();
$sum_so_far{$line[0]} += $line[1];
$count_so_far{$line[0]} += 1;
}
# Dump the output
print "Id Avg.htn";
foreach my $id ( keys %count_so_far ) {
my $avg = $sum_so_far{$id}/$count_so_far{$id};
print " $id $avgn";
}
Output:
<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="721b0017321e1d11131e1a1d0106">[email protected]</a>$ perl make_average.pl input.txt Id Avg.ht 510 64.4 601 24
Note that your sample output is wrong. There’s no way you can get an average of 52 when every value for that id is 59 or larger.
Also, you have a letter l in one of your columns, masquerading as the number 1…
Method 3
With gnu datamash:
datamash -H -s -g 1 mean 2 <file
GroupBy(Id) mean() 510 64.4 601 24
This sorts and groups by 1st field calculating 2nd field mean value, preserving Headers. It assumes the fields are separated by single tab. Use -W, --whitespace if they’re separated by multiple blanks or -t, --field-separator= to define another field separator (space, comma etc). Since datamash requires sorted input, the output will be sorted by the grouped column.
Method 4
Take a look at what is done here: http://www.sugihartono.com/programming/group-by-count-and-sorting-using-perl-script/
The essential difficult part is doing a ‘group by’ operation.
The linked script does that using a hash.
In that link they are calculating the sum, but getting the average will not be far different.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0