Concatenate multiple files with same header

I have multiple files with the same header and different vectors below that. I need to concatenate all of them but I want only the header of first file to be concatenated and I don’t want other headers to be concatenated since they are all same.

for example:
file1.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C

file2.txt

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
D
E 
F

I need the output to be

<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B
C
D
E 
F

I could write a script in R but I need it in shell?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Another solution, similar to “cat+grep” from above, using tail and head:

  1. Write the header of the first file into the output:
    head -2 file1.txt > all.txt

    head -2 gets 2 first lines of the file.

  2. Add the content of all the files:
    tail -n +3 -q file*.txt >> all.txt

    -n +3 makes tail print lines from 3rd to the end,
    -q tells it not to print the header with the file name (read man),
    >> adds to the file, not overwrites it as >.

And sure you can put both commands in one line:

head -2 file1.txt > all.txt; tail -n +3 -q file*.txt >> all.txt

or instead of ; put && between them for success check.

Method 2

If you know how to do it in R, then by all means do it in R. With classical unix tools, this is most naturally done in awk.

awk '
    FNR==1 && NR!=1 { while (/^<header>/) getline; }
    1 {print}
' file*.txt >all.txt

The first line of the awk script matches the first line of a file (FNR==1) except if it’s also the first line across all files (NR==1). When these conditions are met, the expression while (/^<header>/) getline; is executed, which causes awk to keep reading another line (skipping the current one) as long as the current one matches the regexp ^<header>. The second line of the awk script prints everything except for the lines that were previously skipped.

Method 3

Try doing this :

$ cat file1.txt; grep -v "^<header" file2.txt
<header>INFO=<ID=DP,Number=1,Type=Integer>
<header>INFO=<ID=DP4,Number=4,Type=Integer>
A
B 
C
D
E 
F

NOTE

  • the -v flag means to invert the match of
  • ^ in REGEX, means beginning of the string
  • if you have a bunch of files, you can do

:

array=( files*.txt )
{ cat ${array[@]:0:1}; grep -v "^<header" ${array[@]:1}; } > new_file.txt

It’s a array slicing technique.

Method 4

The tail command (on GNU, at least) has an option to skip a given number of initial lines. To print from the second line onward, i.e. skip a one-line header, do: tail -n+2 myfile

So, to keep the two-line header of the first file but not the second, in Bash:

cat file1.txt <(tail -n+3 file2.txt) > combined.txt

Or, for many files:

head -n1 file1.txt > combined.txt
for fname in *.txt
do
    tail -n+3 $fname >> combined.txt
done

If a certain string is known to be present in all header lines but never in the rest of the input files, grep -v is a simpler approach, as sputnik showed.

Method 5

array=( *.txt );head -1 ${array[0]} > all.txt; tail -n +2 -q
${array[@]:0} >> all.txt

Assuming you are using a folder with .txt files with the same header that need to be combined/concatenated , this code would combine the txt files all into all.txt with just one header. the first line (lines separated by semicolons) gathers all the text files to concatenate, the second lines outputs the header from the first txt file into all.txt, and the last line concatenates all the text files gathered without the header (by starting the concatenation from row 2 onwards) and appends it to all.txt.

Method 6

Shorter (not necessarily faster) with sed:

sed -e '3,${/^<header>/d' -e '}' file*.txt > all.txt

This will delete all lines beginning with <header>... starting from line 3, so the first header is preserved and the other headers are removed. If there’s a different number of lines in the header adjust the command accordingly (e.g. for 6-line header use 7 instead of 3).
If the number of lines in the header is unknown you could try like this:

sed '1{
: again
n
/^<header>/b again
}
/^<header>/d
' file*.txt > all.txt

Method 7

Here’s a lazy script to help with this. Not totally robust, but good enough.

function concat_with_header() {
  # Quoted suffix to pattern match for concatenation (e.g. '*.csv')
  local suffix="${1}"
  # Name of the output file
  local output="${2:-combined.out}"
  # Number of lines to use for the header
  local header_length="${3:-1}"
  # Grab the header from the first file
  local header=`echo -e "$(ls -b *$suffix | head -n$header_length)"`
  head -1 $header_file > $output; tail -n +"`expr $header_length + 1`" -q *$suffix >> $output
}


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x