Joining two files with unique identifier

I have two files with approximately 12900 and 4400 entries respectively, that I want to join. The files contain location information for all landbased weather observing stations around the globe.
The largest file is updated biweekly, and the smaller once a year or so. The original files can be found here (http://www.wmo.int/pages/prog/www/ois/volume-a/vola-home.htm and
http://weather.rap.ucar.edu/surface/stations.txt). The files I have are already manipulated by me with some mixed awk, sed, and bash script. I use the files to visualize data using the
GEMPAK package, that is freely available from Unidata. The largest file will work with GEMPAK, but only not with its full capability. For this a join is needed.

File 1 contains location information for weather observing stations, where the first 6 digits is the unique station identifier.
The different parameters (station number, station name, country code, latitude longitude, and station elevation) are defined only by its position in the line, i.e. no tabs.

         060090 AKRABERG FYR                        DN  6138   -666     101
         060100 VAGA FLOGHAVN                       DN  6205   -728      88
         060110 TORSHAVN                            DN  6201   -675      55
         060120 KIRKJA                              DN  6231   -631      55
         060130 KLAKSVIK HELIPORT                   DN  6221   -656      75
         060160 HORNS REV A                         DN  5550    786      21
         060170 HORNS REV B                         DN  5558    761      10
         060190 SILSTRUP                            DN  5691    863       0
         060210 HANSTHOLM                           DN  5711    858       0
         060220 TYRA OEST                           DN  5571    480      43
         060240 THISTED LUFTHAVN                    DN  5706    870       8
         060290 GROENLANDSHAVNEN                    DN  5703   1005       0
         060300 FLYVESTATION AALBORG                DN  5708    985      13
         060310 TYLSTRUP                            DN  5718    995       0
         060320 STENHOEJ                            DN  5736   1033      56
         060330 HIRTSHALS                           DN  5758    995       0
         060340 SINDAL FLYVEPLADS                   DN  5750   1021      28

File 2 contains the unique identifier in File 1 and a second, 4 character identifier (ICAO locator).

060100 EKVG
060220 EKGF
060240 EKTS
060300 EKYT
060340 EKSN
060480 EKHS
060540 EKHO
060600 EKKA
060620 EKSV
060660 EKVJ
060700 EKAH
060780 EKAT

I want to join the two files, so that the resulting file will have the 4 character identifier in the first 4 positions in the line, i.e. the identifier should replace the 4 spaces.

         060090 AKRABERG FYR                        DN  6138   -666     101
EKVG     060100 VAGA FLOGHAVN                       DN  6205   -728      88
         060110 TORSHAVN                            DN  6201   -675      55
         060120 KIRKJA                              DN  6231   -631      55
         060130 KLAKSVIK HELIPORT                   DN  6221   -656      75
         060160 HORNS REV A                         DN  5550    786      21
         060170 HORNS REV B                         DN  5558    761      10
         060190 SILSTRUP                            DN  5691    863       0
         060210 HANSTHOLM                           DN  5711    858       0
EKGF     060220 TYRA OEST                           DN  5571    480      43
EKTS     060240 THISTED LUFTHAVN                    DN  5706    870       8
         060290 GROENLANDSHAVNEN                    DN  5703   1005       0
EKYT     060300 FLYVESTATION AALBORG                DN  5708    985      13
         060310 TYLSTRUP                            DN  5718    995       0
         060320 STENHOEJ                            DN  5736   1033      56
         060330 HIRTSHALS                           DN  5758    995       0
EKSN     060340 SINDAL FLYVEPLADS                   DN  5750   1021      28

Is it possible to accomplish this task with some bash and/or awk script?

Contents hide

Answers:

Method 1

Method 2

The command

Example

Details

Outstanding issues

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

awk 'BEGIN { while(getline < "file2" ) { codes[$1] = $2 } }
     { printf "%4s%sn", codes[$1], substr($0, 5) }' file1

Method 2

A couple of us wanted to see if we could solve this problem using join only. This is my attempt to do that. Since it partially works @Terdon owes me a dinner 8-).

The command

$ join -a1 -1 1 -2 1 -o 2.2 1.1 1.2 1.3 1.4 1.5 1.6 1.7 -e "N/A" 
     <(sort file1) <(sort file2)

Example

$ join -a1 -1 1 -2 1 -o 2.2 1.1 1.2 1.3 1.4 1.5 1.6 1.7 -e "N/A" <(sort file1) <(sort file2) | column -t
N/A   060090  AKRABERG          FYR         DN    6138  -666  101
EKVG  060100  VAGA              FLOGHAVN    DN    6205  -728  88
N/A   060110  TORSHAVN          DN          6201  -675  55    N/A
N/A   060120  KIRKJA            DN          6231  -631  55    N/A
N/A   060130  KLAKSVIK          HELIPORT    DN    6221  -656  75
N/A   060160  HORNS             REV         A     DN    5550  786
N/A   060170  HORNS             REV         B     DN    5558  761
N/A   060190  SILSTRUP          DN          5691  863   0     N/A
N/A   060210  HANSTHOLM         DN          5711  858   0     N/A
EKGF  060220  TYRA              OEST        DN    5571  480   43
EKTS  060240  THISTED           LUFTHAVN    DN    5706  870   8
N/A   060290  GROENLANDSHAVNEN  DN          5703  1005  0     N/A
EKYT  060300  FLYVESTATION      AALBORG     DN    5708  985   13
N/A   060310  TYLSTRUP          DN          5718  995   0     N/A
N/A   060320  STENHOEJ          DN          5736  1033  56    N/A
N/A   060330  HIRTSHALS         DN          5758  995   0     N/A
EKSN  060340  SINDAL            FLYVEPLADS  DN    5750  1021  28

Details

The above is making use of pretty much every option available to join which tells my gut that we’re using it wrong, as in some type of Frankenstein way, but we’re all learning here, so that’s OK…I guess.

The switch -a1 tells join to include any lines that don’t has a corresponding match from file2 in file1. So this is what’s driving these lines to get displayed:

N/A   060330  HIRTSHALS         DN          5758  995   0     N/A

The -1 1 and -2 1 are saying which columns to join the lines from the 2 files on, mainly their 1st columns. The -o ... is saying which columns from the 2 files to display and in which order.

The -e "N/A" says to use the string “N/A” as a place holder value to print for fields that are deemed empty by join.

The last 2 arguments are feeding the 2 files, file1 & file2 as sorted into the join command.

Please be kind, since this is a work in progress and we’re trying to demonstrate how one would solve this type of problem using the join command, since this would seem to be the type of problem it was meant for.

Outstanding issues

3rd column
The major one is how to contend with the 3rd column since it’s a mix of 1 word and 2 word values. This seems like a major stumbling block to join and I can’t figure out a way around it. Any guidance would be appreciated.
Spacing
All the original spacing is lost with join and I don’t see a way to keep it around either. So join might not be the right way to deal with these types of problems after all.
Seems to work though?
After much bending with the command line the general solution is there so this does seem like it can work at least partially, so this could be used at the core of a solution, and then make use of other tools such as awk and sed to clean it up. This begs the question though: “If you’re cleaning it up with awk & sed any way, then you might as well just use them directly?”.

Method 3

This should be possible using join but I can’t figure out how to make it print spaces and empty fields correctly. Anyway, this little Perl script will do the trick:

#!/usr/bin/env perl

## Open file2, the one that contains the codes
## it is expected to be the 1st argument given to the script.
open($a,"$ARGV[0]"); 

## Read the number<=>code pairs into a hash (an associative array)
## called 'k'
while (<$a>) {
    chomp; @f=split(/s+/); $k{$f[0]}=$f[1];
}

## Open file1, the one that contains the data
## it is expected to be the 2nd argument given to the script.
open($b,"$ARGV[1]"); 
## Go through the file
while (<$b>) {
    ## Split each line at white space into the array @f
    @f=split(/s+/);  

    ## $f[1] is the 6 digit number that defines the different stations.
    ## If this number has an entry in the hash %k, if it was found
    ## in file2, replace the first 4 spaces with its value from the hash.
    s/^s{4}/$k{$f[1]}/ if defined($k{$f[1]});

    ## Print each line of the file
    print; 
}

Save this as foo.pl and run as follows:

$ perl foo.pl file2 file1
         060090 AKRABERG FYR                        DN  6138   -666     101
EKVG     060100 VAGA FLOGHAVN                       DN  6205   -728      88
         060110 TORSHAVN                            DN  6201   -675      55
         060120 KIRKJA                              DN  6231   -631      55
         060130 KLAKSVIK HELIPORT                   DN  6221   -656      75
         060160 HORNS REV A                         DN  5550    786      21
         060170 HORNS REV B                         DN  5558    761      10
         060190 SILSTRUP                            DN  5691    863       0
         060210 HANSTHOLM                           DN  5711    858       0
EKGF     060220 TYRA OEST                           DN  5571    480      43
EKTS     060240 THISTED LUFTHAVN                    DN  5706    870       8
         060290 GROENLANDSHAVNEN                    DN  5703   1005       0
EKYT     060300 FLYVESTATION AALBORG                DN  5708    985      13
         060310 TYLSTRUP                            DN  5718    995       0
         060320 STENHOEJ                            DN  5736   1033      56
         060330 HIRTSHALS                           DN  5758    995       0
EKSN     060340 SINDAL FLYVEPLADS                   DN  5750   1021      28

Method 4

Bash will do.

#!/usr/bin/env bash

# ### create a psuedo hash of icao locator id's
# read each line into an array
while read -a line; do
  # set icao_nnnnnn variable to the value
  declare "icao_${line[0]}"=${line[1]}
done <file2


# ### match up icao id's from file1
# read in file line at a time
while IFS=$'n' read line; do
  # split the line into array
  read -a arr <<< "$line"
  # if the icao_nnnnnn variable exists, it will print out
  var="icao_${arr[0]}"
  printf "%-8s %sn" "${!var}" "$line"
done <file1

See this SO answer for the details of what is going on with the “hash”
Bash 4 supports associative array’s natively, but this should work in 3+4 (maybe 2?)

You might need to left trim the line from file1 to get your formatting.

Method 5

Here’s a simple way to do it with join (+ a couple more tools) and preserve the spacing. Both files appear to be sorted by station number so no additional sorting is needed:

join -j1 -a1 -o 2.2 -e "    " file1 file2 | paste -d' ' - <(cut -c6- file1)

The part before the pipe is very similar to what slm used in his answer so I won’t go over it again. The only difference is that I’m using -e " " – a four-spaces string as a replacement for missing input fields and -o 2.2 to output only the 2nd field of file2
So join -j1 -a1 -o 2.2 -e " " file1 file2 produces a four-char-wide column (it’s not visible below but there’s nothing after EK** and empty lines are actually four spaces):

EKVG







EKGF
EKTS

EKYT



EKSN

we then paste this (using space as a delimiter) to file1 from which we cut the first 5 characters | paste -d' ' - <(cut -c6- file1)
End result:

         060090 AKRABERG FYR                        DN  6138   -666     101
EKVG     060100 VAGA FLOGHAVN                       DN  6205   -728      88
         060110 TORSHAVN                            DN  6201   -675      55
         060120 KIRKJA                              DN  6231   -631      55
         060130 KLAKSVIK HELIPORT                   DN  6221   -656      75
         060160 HORNS REV A                         DN  5550    786      21
         060170 HORNS REV B                         DN  5558    761      10
         060190 SILSTRUP                            DN  5691    863       0
         060210 HANSTHOLM                           DN  5711    858       0
EKGF     060220 TYRA OEST                           DN  5571    480      43
EKTS     060240 THISTED LUFTHAVN                    DN  5706    870       8
         060290 GROENLANDSHAVNEN                    DN  5703   1005       0
EKYT     060300 FLYVESTATION AALBORG                DN  5708    985      13
         060310 TYLSTRUP                            DN  5718    995       0
         060320 STENHOEJ                            DN  5736   1033      56
         060330 HIRTSHALS                           DN  5758    995       0
EKSN     060340 SINDAL FLYVEPLADS                   DN  5750   1021      28

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating