combine text files column-wise

I have two text files. The first one has content:

Languages
Recursively enumerable
Regular

while the second one has content:

Minimal automaton
Turing machine
Finite

I want to combine them into one file column-wise. So I tried paste 1 2 and its output is:

Languages   Minimal automaton
Recursively enumerable  Turing machine
Regular Finite

However I would like to have the columns aligned well such as

Languages               Minimal automaton
Recursively enumerable  Turing machine
Regular                 Finite

I was wondering if it would be possible to achieve that without manually handling?


Added:

Here is another example, where Bruce method almost nails it, except some slight misalignment about which I wonder why?

$ cat 1
Chomsky hierarchy
Type-0
—

$ cat 2
Grammars
Unrestricted

$ paste 1 2 | pr -t -e20
Chomsky hierarchy   Grammars
Type-0              Unrestricted
—                    (no common name)

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You just need the column command, and tell it to use tabs to separate columns

paste file1 file2 | column -s $'t' -t

To address the “empty cell” controversy, we just need the -n option to column:

$ paste <(echo foo; echo; echo barbarbar) <(seq 3) | column -s $'t' -t
foo        1
2
barbarbar  3

$ paste <(echo foo; echo; echo barbarbar) <(seq 3) | column -s $'t' -tn
foo        1
           2
barbarbar  3

My column man page indicates -n is a “Debian GNU/Linux extension.” My Fedora system does not exhibit the empty cell problem: it appears to be derived from BSD and the man page says “Version 2.23 changed the -s option to be non-greedy”

Method 2

You’re looking for the handy dandy pr command:

paste file1 file2 | pr -t -e24

The “-e24” is “expand tab stops to 24 spaces”. Luckily, paste puts a tab-character between columns, so pr can expand it. I chose 24 by counting the characters in “Recursively enumerable” and adding 2.

Method 3

Update: Here ia a much simpler script (that the one at the end of the question) for tabulated output. Just pass filename to it as you would to paste… It uses html to make the frame, so it is tweakable. It does preserve multiple spaces, and the column alignment is preserved when it encounters unicode characters. However, the way the editor or viewer renderers the unicode is another matter entirely…

┌──────────────────────┬────────────────┬──────────┬────────────────────────────┐
│ Languages            │ Minimal        │ Chomsky  │ Unrestricted               │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ Recursive            │ Turing machine │ Finite   │     space indented         │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ Regular              │ Grammars       │          │ ➀ unicode may render oddly │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│ 1 2  3   4    spaces │                │ Symbol-& │ but the column count is ok │
├──────────────────────┼────────────────┼──────────┼────────────────────────────┤
│                      │                │          │ Context                    │
└──────────────────────┴────────────────┴──────────┴────────────────────────────┘

#!/bin/bash
{ echo -e "<html>n<table border=1 cellpadding=0 cellspacing=0>"
  paste "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="f9ddb9">[email protected]</a>" |sed -re 's#(.*)#x091x09#' -e 's#x09# </pre></td>n<td><pre> #g' -e 's#^ </pre></td>#<tr>#' -e 's#n<td><pre> $#n</tr>#'
  echo -e "</table>n</html>"
} |w3m -dump -T 'text/html'

A synopsis of the tools presented in the answers (so far).
I’ve had a pretty close look at them; here is what I’ve found:

paste # This tool is common to all the answers presented so far
# It can handle multiple files; therefore multiple columns… Good!
# It delimits each column with a Tab… Good.
# Its output is not tabulated.

All the tools below all remove this delimiter!… Bad if you need a delimiter.

column # It removes the Tab delimiter, so field identificaton is purely by columns which it seems to handle quite well.. I haven’t spotted anything awry… # Aside from not having a unique delimiter, it works fine!

expand # Only has a single tab setting, so it is unpredictable beyond 2 columns # The alignment of columns is not accurate when handling unicode, and it removes the Tab delimiter, so field identificaton is purely by column alignment

pr # Only has a single tab setting, so it is unpredictable beyond 2 columns. # The alignment of columns is not accurate when handling unicode, and it removes the Tab delimiter, so field identificaton is purely by column alignment

To me, column it the obvious best soluton as a one-liner.. It you want either the delimiter, or an ASCII-art tabluation of your files, read on, otherwise.. columns is pretty darn good :)…


Here is a script which takes any numper of files and creates an ASCII-art tabulated presentation.. (Bear in mind that unicode may not render to the expected width, eg. ௵ which is a single character. This is quite different to the column numbers being wrong, as is the case in some of the utilities mentioned above.) … The script’s output, shown below, is from 4 input files, named F1 F2 F3 F4…

+------------------------+-------------------+-------------------+--------------+
| Languages              | Minimal automaton | Chomsky hierarchy | Grammars     |
| Recursively enumerable | Turing machine    | Type-0            | Unrestricted |
| Regular                | Finite            | —                 |              |
| Alphabet               |                   | Symbol            |              |
|                        |                   |                   | Context      |
+------------------------+-------------------+-------------------+--------------+

#!/bin/bash

# Note: The next line is for testing purposes only!
set F1 F2 F3 F4 # Simulate commandline filename args $1 $2 etc...

p=' '                                # The pad character
# Get line and column stats
cc=${#@}; lmax=                      # Count of columns (== input files)
for c in $(seq 1 $cc) ;do            # Filenames from the commandline 
  F[$c]="${!c}"        
  wc=($(wc -l -L <${F[$c]}))         # File length and width of longest line 
  l[$c]=${wc[0]}                     # File length  (per file)
  L[$c]=${wc[1]}                     # Longest line (per file) 
  ((lmax<${l[$c]})) && lmax=${l[$c]} # Length of longest file
done
# Determine line-count deficits  of shorter files
for c in $(seq 1 $cc) ;do  
  ((${l[$c]}<lmax)) && D[$c]=$((lmax-${l[$c]})) || D[$c]=0 
done
# Build 'n' strings to cater for short-file deficits
for c in $(seq 1 $cc) ;do
  for n in $(seq 1 ${D[$c]}) ;do
    N[$c]=${N[$c]}$'n'
  done
done
# Build the command to suit the number of input files
source=$(mktemp)
>"$source" echo 'paste '
for c in $(seq 1 $cc) ;do
    ((${L[$c]}==0)) && e="x" || e=":a -e "s/^.{0,$((${L[$c]}-1))}$/&$p/;ta""
    >>"$source" echo '<(sed -re '"$e"' <(cat "${F['$c']}"; echo -n "${N['$c']}")) '
done
# include the ASCII-art Table framework
>>"$source" echo ' | sed  -e "s/.*/| & |/" -e "s/t/ | /g" '   # Add vertical frame lines
>>"$source" echo ' | sed -re "1 {h;s/[^|]/-/g;s/|/+/g;p;g}" ' # Add top and botom frame lines 
>>"$source" echo '        -e "$ {p;s/[^|]/-/g;s/|/+/g}"'
>>"$source" echo  
# Run the code
source "$source"
rm     "$source"
exit

Here is my original answer (trimmed a bit in lieu of the above script)

Using wc to get the column width, and sed to right pad with a visible character . (just for this example)… and then paste to join the two columns with a Tab char…

paste <(sed -re :a -e 's/^.{1,'"$(($(wc -L <F1)-1))"'}$/&./;ta' F1) F2

# output (No trailing whitespace)
Languages.............  Minimal automaton
Recursively enumerable  Turing machine
Regular...............  Finite

If you want to pad out the right column:

paste <( sed -re :a -e 's/^.{1,'"$(($(wc -L <F1)-1))"'}$/&./;ta' F1 ) 
      <( sed -re :a -e 's/^.{1,'"$(($(wc -L <F2)-1))"'}$/&./;ta' F2 )  

# output (With trailing whitespace)
Languages.............  Minimal automaton
Recursively enumerable  Turing machine...
Regular...............  Finite...........

Method 4

You’re almost there. paste puts a tab character between each column, so all you need to do is expand the tabs. (I assume your files don’t contain tabs.) You do need to determine the width of the left column. With (recent enough) GNU utilities, wc -L shows the length of the longest line. On other systems, make a first pass with awk. The +1 is the amount of blank space you want between columns.

paste left.txt right.txt | expand -t $(($(wc -L <left.txt) + 1))
paste left.txt right.txt | expand -t $(awk 'n<length {n=length} END {print n+1}')

If you have the BSD column utility, you can use it to determine the column width and expand the tabs in one go. ( is a literal tab character; under bash/ksh/zsh you can use $'t' instead, and in any shell you can use "$(printf 't')".)

paste left.txt right.txt | column -s '␉' -t

Method 5

I’m unable to comment on glenn jackman’s answer, so am adding this to address the issue of empty cells that Peter.O noted. Adding a null char prior to each tab eliminates the runs of delimiters that are treated as a single break and addresses the issue. (I originally used spaces, but using the null char eliminates the extra space between columns.)

paste file1 file2 | sed 's/t/t/g' | column -s $'t' -t

If the null char causes problems for various reasons, try either:

paste file1 file2 | sed 's/t/ t/g' | column -s $'t' -t

or

paste file1 file2 | sed $'s/t/ t/g' | column -s $'t' -t

Both sed and column appear to vary in implementation across flavors and versions of Unix/Linux, especially BSD (and Mac OS X) vs. GNU/Linux.

Method 6

This is multi-step, so it’s non-optimal, but here goes.

1) Find the length of the longest line in file1.txt.

while read line
do
echo ${#line}
done < file1.txt | sort -n | tail -1

With your example, the longest line is 22.

2) Use awk to pad file1.txt, padding the each line less than 22 characters up to 22 with the printf statement.

awk 'FS="---" {printf "%-22sn", $1}' < file1.txt > file1-pad.txt

Note: For FS, use a string that does not exist in file1.txt.

3) Use paste as you did before.

$ paste file1-pad.txt file2.txt
Languages               Minimal automaton
Recursively enumerable  Turing machine
Regular                 Finite

If this is something you do often, this can easily be turned into a script.

Method 7

Building on bahamat’s answer:
this can be done entirely in awk,
reading the files only once and not creating any temporary files. 
To solve the problem as stated, do

awk '
        NR==FNR { if (length > max_length) max_length = length
                  max_FNR = FNR
                  save[FNR] = $0
                  next
                }
                { printf "%-*s", max_length+2, save[FNR]
                  print
                }
        END     { if (FNR < max_FNR) {
                        for (i=FNR+1; i <= max_FNR; i++) print save[i]
                  }
                }
    '   file1 file2

As with many awk scripts of this ilk, the above first reads file1,
saving all the data in the save array
and simultaneously computing the maximum line length. 
Then it reads file2
and prints the saved (file1) data side-by-side
with the current (file2) data. 
Finally, if file1 is longer than file2 (has more lines),
we print the last few lines of file1
(the ones for which there is no corresponding line in the second column).

Regarding the printf format:

  • "%-nns" prints a string left-justified
    in a field nn characters wide.
  • "%-*s", nn does the same thing —
    the * tells it to take the field width from the next parameter.
  • By using maxlength+2 for nn,
    we get two spaces between the columns. 
    Obviously the +2 can be adjusted.

The above script works only for two files. 
It can trivially be modified to handle three files,
or to handle four files, etc.,
but this would be tedious and is left as an exercise. 
However, it turns out not to be hard to modify it to handle
any number of files:

awk '
        FNR==1  { file_num++ }
                { if (length > max_length[file_num]) max_length[file_num] = length
                  max_FNR[file_num] = FNR
                  save[file_num,FNR] = $0
                }
        END     { for (j=1; j<=file_num; j++) {
                        if (max_FNR[j] > global_max_FNR) global_max_FNR = max_FNR[j]
                  }
                  for (i=1; i<=global_max_FNR; i++) {
                        for (j=1; j<file_num; j++) printf "%-*s", max_length[j]+2, save[j,i]
                        print save[file_num,i]
                  }
                }
    '   file*

This is very similar to my first script, except

  • It turns max_length into an array.
  • It turns max_FNR into an array.
  • It turns save into a two-dimensional array.
  • It reads all the files, saving all the contents. 
    Then it writes out all the output from the END block.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x