Is there a way to get the min, max, median, and average of a list of numbers in a single command?

I have a list of numbers in a file, one per line. How can I get the minimum, maximum, median and average values? I want to use the results in a bash script.

Although my immediate situation is for integers, a solution for floating-point numbers would be useful down the line, but a simple integer method is fine.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

With GNU datamash:

$ printf '%sn' 1 2 4 | datamash max 1 min 1 mean 1 median 1
4   1   2.3333333333333 2

Method 2

You can use the R programming language.

Here is a quick and dirty R script:

#! /usr/bin/env Rscript
d<-scan("stdin", quiet=TRUE)
cat(min(d), max(d), median(d), mean(d), sep="n")

Note the "stdin" in scan which is a special filename to read from standard input (that means from pipes or redirections).

Now you can redirect your data over stdin to the R script:

$ cat datafile
1
2
4
$ ./mmmm.r < datafile
1
4
2
2.333333

Also works for floating points:

$ cat datafile2
1.1
2.2
4.4
$ ./mmmm.r < datafile2
1.1
4.4
2.2
2.566667

If you don’t want to write an R script file you can invoke a true one-liner (with linebreak only for readability) in the command line using Rscript:

$ Rscript -e 'd<-scan("stdin", quiet=TRUE)' 
          -e 'cat(min(d), max(d), median(d), mean(d), sep="n")' < datafile
1
4
2
2.333333

Read the fine R manuals at http://cran.r-project.org/manuals.html.

Unfortunately the full reference is only available in PDF. Another way to read the reference is by typing ?topicname in the prompt of an interactive R session.


For completeness: there is an R command which outputs all the values you want and more. Unfortunately in a human friendly format which is hard to parse programmatically.

> summary(c(1,2,4))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.500   2.000   2.333   3.000   4.000

Method 3

I actually keep a little awk program around to give the sum, data count, minimum datum, maximum datum, mean and median of a single column of numeric data (including negative numbers):

#!/bin/sh
sort -n | awk '
  BEGIN {
    c = 0;
    sum = 0;
  }
  $1 ~ /^(-)?[0-9]*(.[0-9]*)?$/ {
    a[c++] = $1;
    sum += $1;
  }
  END {
    ave = sum / c;
    if( (c % 2) == 1 ) {
      median = a[ int(c/2) ];
    } else {
      median = ( a[c/2] + a[c/2-1] ) / 2;
    }
    OFS="t";
    print sum, c, ave, median, a[0], a[c-1];
  }
'

The above script reads from stdin, and prints tab-separated columns of output on a single line.

Method 4

Minimum:

jq -s min
awk 'NR==1||$0<x{x=$0}END{print x}'

Maximum:

jq -s max
awk 'NR==1||$0>x{x=$0}END{print x}'

Median:

jq -s 'sort|if length%2==1 then.[length/2|floor]else[.[length/2-1,length/2]]|add/2 end'
sort -n|awk '{a[NR]=$0}END{print(NR%2==1)?a[int(NR/2)+1]:(a[NR/2]+a[NR/2+1])/2}'

Average:

jq -s add/length
awk '{x+=$0}END{print x/NR}'

Combined to one command (modified from a comment):

$ seq 100|jq -s '{minimum:min,maximum:max,average:(add/length),median:(sort|if length%2==1 then.[length/2|floor]else[.[length/2-1,length/2]]|add/2 end)}'
{
  "minimum": 1,
  "maximum": 100,
  "average": 51.5,
  "median": 51.5
}

In jq, the -s (--slurp) option creates an array for the input lines after parsing each line as JSON, or as a number in this case.

Method 5

Min, max and average are pretty easy to get with awk:

% echo -e '6n2n4n3n1' | awk 'NR == 1 { max=$1; min=$1; sum=0 }
   { if ($1>max) max=$1; if ($1<min) min=$1; sum+=$1;}
   END {printf "Min: %dtMax: %dtAverage: %fn", min, max, sum/NR}'
Min: 1  Max: 6  Average: 3,200000

Calculating median is a bit more tricky, since you need to sort numbers and store them all in memory for a while or read them twice (first time to count them, second – to get median value). Here is example which stores all numbers in memory:

% echo -e '6n2n4n3n1' | sort -n | awk '{arr[NR]=$1}
   END { if (NR%2==1) print arr[(NR+1)/2]; else print (arr[NR/2]+arr[NR/2+1])/2}' 
3

Method 6

pythonpy works well for this sort of thing:

cat file.txt | py --ji -l 'min(l), max(l), numpy.median(l), numpy.mean(l)'

Method 7

And a Perl one-(long)liner, including median:

cat numbers.txt 
| perl -M'List::Util qw(sum max min)' -MPOSIX -0777 -a -ne 'printf "%-7s : %dn"x4, "Min", min(@F), "Max", max(@F), "Average", sum(@F)/@F,  "Median", sum( (sort {$a<=>$b} @F)[ int( $#F/2 ), ceil( $#F/2 ) ] )/2;'

The special options used are:

  • -0777 : read the whole file at once instead of line by line
  • -a : autosplit into the @F array

A more readable script version of the same thing would be :

#!/usr/bin/perl

use List::Util qw(sum max min);
use POSIX;

@F=<>;

printf "%-7s : %dn" x 4,
    "Min", min(@F),
    "Max", max(@F),
    "Average", sum(@F)/@F,
    "Median", sum( (sort {$a<=>$b} @F)[ int( $#F/2 ), ceil( $#F/2 ) ] )/2;

If you want decimals, replace %d with something like %.2f.

Method 8

nums=$(<file.txt); 
list=(`for n in $nums; do printf "%015.06fn" $n; done | sort -n`); 
echo min ${list[0]}; 
echo max ${list[${#list[*]}-1]}; 
echo median ${list[${#list[*]}/2]};

Method 9

Just for the sake of having a variety of options presented on this page, Here are two more ways:

1: octave

  • GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments.

Here is a quick octave example.

octave -q --eval 'A=1:10;
  printf ("# %ft%ft%ft%fn", min(A), max(A), median(A), mean(A));'  
# 1.000000        10.000000       5.500000        5.500000

2: bash + single-purpose tools.

For bash to handle floating-point numbers, this script uses numprocess and numaverage from package num-utils.

PS. I’ve also had a reasonable look at bc, but for this particular job, it doesn’t offer anything beyond what awk does. It is (as the ‘c’ in ‘bc’ states) a calculator—a calculator which requires a much programming as awk and this bash script…


arr=($(sort -n "LIST" |tee >(numaverage 2>/dev/null >stats.avg) ))
cnt=${#arr[@]}; ((cnt==0)) && { echo -e "0t0t0t0t0"; exit; }
mid=$((cnt/2)); 
if [[ ${cnt#${cnt%?}} == [02468] ]] 
   then med=$( echo -n "${arr[mid-1]}" |numprocess /+${arr[mid]},%2/ )
   else med=${arr[mid]}; 
fi     #  count   min       max           median        average
echo -ne "$cntt${arr[0]}t${arr[cnt-1]}t$medt"; cat stats.avg

Method 10

Simple-r is the answer:

r summary file.txt
r -e 'min(d); max(d); median(d); mean(d)' file.txt

It uses R environment to simplify statistical analysis.

Method 11

The num is a tiny awk wrapper which exactly does this and more, e.g.

$ echo "1 2 3 4 5 6 7 8 9" | num max
9
$ echo "1 2 3 4 5 6 7 8 9" | num min max median mean
..and so on

it saves you from reinventing the wheel in the ultra-portable awk.
The docs are given above, and the direct link here (check also the GitHub page).

Method 12

I’ll second lesmana’s choice of R and offer my first R program. It reads one number per line on standard input and writes four numbers (min, max, average, median) separated by spaces to standard output.

#!/usr/bin/env Rscript
a <- scan(file("stdin"), c(0), quiet=TRUE);
cat(min(a), max(a), mean(a), median(a), "n");

Method 13

The below sort/awk tandem does it:

sort -n | awk '{a[i++]=$0;s+=$0}END{print a[0],a[i-1],(a[int(i/2)]+a[int((i-1)/2)])/2,s/i}'

(it calculates median as mean of the two central values if value count is even)

Method 14

Taking cues from Bruce’s code, here is a more efficient implementation
which does not keep the whole data in memory. 
As stated in the question,
it assumes that the input file has (at most) one number per line. 
It counts the lines in the input file that contain a qualifying number
and passes the count to the awk command
along with (preceding) the sorted data. 
So, for example, if the file contains

6.0
4.2
8.3
9.5
1.7

then the input to awk is actually

5
1.7
4.2
6.0
8.3
9.5

Then the awk script captures the data count in the NR==1 code block
and saves the middle value
(or the two middle values, which are averaged to yield the median)
when it sees them.

FILENAME="Salaries.csv"

(awk 'BEGIN {c=0} $1 ~ /^[-0-9]*(.[0-9]*)?$/ {c=c+1;} END {print c;}' "$FILENAME"; 
        sort -n "$FILENAME") | awk '
  BEGIN {
    c = 0
    sum = 0
    med1_loc = 0
    med2_loc = 0
    med1_val = 0
    med2_val = 0
    min = 0
    max = 0
  }

  NR==1 {
    LINES = $1
    # We check whether numlines is even or odd so that we keep only
    # the locations in the array where the median might be.
    if (LINES%2==0) {med1_loc = LINES/2-1; med2_loc = med1_loc+1;}
    if (LINES%2!=0) {med1_loc = med2_loc = (LINES-1)/2;}
  }

  $1 ~ /^[-0-9]*(.[0-9]*)?$/  &&  NR!=1 {
    # setting min value
    if (c==0) {min = $1;}
    # middle two values in array
    if (c==med1_loc) {med1_val = $1;}
    if (c==med2_loc) {med2_val = $1;}
    c++
    sum += $1
    max = $1
  }
  END {
    ave = sum / c
    median = (med1_val + med2_val ) / 2
    print "sum:" sum
    print "count:" c
    print "mean:" ave
    print "median:" median
    print "min:" min
    print "max:" max
  }
'

Method 15

With perl:

$ printf '%sn' 1 2 4 |
   perl -MList::Util=min,max -MStatistics::Basic=mean,median -w -le '
     chomp(@l = <>); print for min(@l), max(@l), mean(@l), median(@l)'
1
4
2.33
2

Method 16

With an R one-liner:

R -q -e 'summary(as.numeric(read.table("your_single_col_file")[,1]))'

For example, for my file, I got such output:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  550.4   628.3   733.1   706.5   778.4   832.9

Method 17

cat/python only solution – not empty-input proof!

cat data |  python3 -c "import fileinput as FI,statistics as STAT; i = [int(l) for l in FI.input()]; print('min:', min(i), ' max: ', max(i), ' avg: ', STAT.mean(i), ' median: ', STAT.median(i))"

Method 18

function median()
{
    declare -a nums=($(cat))
    printf '%sn' "${nums[@]}" | sort -n | tail -n $((${#nums[@]} / 2 + 1)) | head -n 1
}

Method 19

Extending nisetama’s answer:

oneliner with jq

jq -s '{ min:min, max:max, sum:add, count:length, avg: (add/length), median: (sort|.[(length/2|floor)])

Example:

echo 1 2 3 4 | jq -s '{ min:min, max:max, sum:add, count:length, avg: (add/length), median: (sort|.[(length/2|floor)]) }'

Gives you:

{
  "min": 1,
  "max": 5,
  "sum": 15,
  "count": 5,
  "avg": 3,
  "median": 3
}

Note: Median is not quite right when the # of items is even, but close enough IMHO.

Method 20

If you’re more interested in utility rather than being cool or clever, then perl is an easier choice than awk. By and large it will be on every *nix with consistent behaviour, and is easy and free to install on windows.
I think it’s also less cryptic than awk, and there will be some stats modules you could use if you wanted a halfway house between writing it yourself and something like R.
My fairly untested (in fact I know it has bugs but it works for my purposes) perl script took about a minute to write, and I’d guess the only cryptic part would be the while(<>), which is the very useful shorthand, meaning take the file(s) passed as command line arguments, read a line at a time and put that line in the special variable $_.
So you could put this in a file called count.pl and run it as perl count.pl myfile.
Apart from that it should be painfully obvious what’s going on.

$max = 0;
while (<>) {
 $sum = $sum + $_;
 $max = $_ if ($_ > $max);
 $count++;
}
$avg=$sum/$count;
print "$count numbers total=$sum max=$max mean=$avgn";

Method 21

I wrote a perl script called ‘stats’ that does this and more.
(& you can subselect the bits you want with options like ‘–sum’ ‘–median’, etc’

$ ls -lR | grep $USER| scut -f=4 | stats 
Sum       1.22435e+08
Number    428
Mean      286064
Median    4135
Mode      0
NModes    4
Min       0
Max       8.47087e+07
Range     8.47087e+07
Variance  1.69384e+13
Std_Dev   4.11563e+06
SEM       198936
95% Conf  -103852 to 675979
          [for a normal distribution (ND) - see skew]
Quantiles (5)
        Index   Value
1       85      659
2       171     2196
3       256     11015
4       342     40210
Skew      20.3201
          [Skew=0 for a symmetric dist]
Std_Skew  171.621
Kurtosis  413.679
          [Kurtosis=3 for a ND]
PopKurt   0.975426
          [Pop'n Kurtosis is normalized to sample size; PK=0 for a ND]

It’s bundled with scut (a perlish cut/join thingy) at:
https://github.com/hjmangalam/scut

Method 22

I got tired of scratching my head over the most convenient and efficient approach to gathering basic statistics. Many of the other answers here are reasonably elegant, but I thought it would be nice to have a simple command that’s intuitive and flexible.

This is an early work-in-progress, but I’ll be adding more functionality as time permits and need arises.

https://github.com/auerlab/basic-stats

Method 23

If your list of numbers is short, and you don’t need the result programmatically, it’s worth noting that sometimes the best move is to convert the column of numbers into an array:

tr 'n' ',' | awk '{printf("a = [%s]n", $1)}'

Then paste this into your interpreter of choice, e.g., the Python interpreter, and you can calculate min/max/mean/median/mode/etc. as desired.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x