split file into two parts, at a pattern

How to split a large file into two parts, at a pattern?

Given an example file.txt:

ABC
EFG
XYZ
HIJ
KNL

I want to split this file at XYZ such that file1 contains lines up-to XYZ and rest of the lines in file2.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This is a job for csplit:

csplit -sf file -n 1 large_file /XYZ/

would silently split the file, creating pieces with prefix file and numbered using a single digit, e.g. file0 etc. Note that using /regex/ would split up to, but not including the line that matches regex. To split up to and including the line matching regex add a +1 offset:

csplit -sf file -n 1 large_file /XYZ/+1

This creates two files, file0 and file1. If you absolutely need them to be named file1 and file2 you could always add an empty pattern to the csplit command and remove the first file:

csplit -sf file -n 1 large_file // /XYZ/+1

creates file0, file1 and file2 but file0 is empty so you can safely remove it:

rm -f file0

Method 2

With awk you can do:

awk '{print >out}; /XYZ/{out="file2"}' out=file1 largefile

Explanation: The first awk argument (out=file1) defines a variable with the filename that will be used for output while the subsequent argument (largefile) is processed. The awk program will print all lines to the file specified by the variable out ({print >out}). If the pattern XYZ will be found the output variable will be redefined to point to the new file ({out="file2}") which will be used as target to print the subsequent data lines.

References:

Method 3

With a modern ksh here’s a shell variant (i.e. without sed) of one of the sed based answers above:

{ read in <##XYZ ; print "$in" ; cat >file2 ;} <largefile >file1

And another variant in ksh alone (i.e. also omitting the cat):

{ read in <##XYZ ; print "$in" ; { read <##"" ;} >file2 ;} <largefile >file1

(The pure ksh solution seem to be quite performant; on a 2.4 GB test file it needed 19-21 sec, compared to 39-47 sec with the sed/cat based approach).

Method 4

{ sed '/XYZ/q' >file1; cat >file2; } <infile

With GNU sed you should use the -unbuffered switch. Most other seds should just work though.

To leave XYZ out…

{ sed -n '/XYZ/q;p'; cat >file2; } <infile >file1

Method 5

Try this with GNU sed:

sed -n -e '1,/XYZ/w file1' -e '/XYZ/,${/XYZ/d;w file2' -e '}' large_file

Method 6

An easy hack is to print either to STDOUT or STDERR, depending on whether the target pattern has been matched. You can then use the shell’s redirection operators to redirect the output accordingly. For example, in Perl, assuming the input file is called f and the two output files f1 and f2:

  1. Discarding the line that matches the split pattern:
    perl -ne 'if(/XYZ/){$a=1; next} ; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2
  2. Including the matched line:
    perl -ne '$a=1 if /XYZ/; $a==1 ? print STDERR : print STDOUT;' f >f1 2>f2

Alternatively, print to different file handles:

  1. Discarding the line that matches the split pattern:
    perl -ne 'BEGIN{open($fh1,">","f1");open($fh2,">","f2");}
    if(/XYZ/){$a=1; next}$a==1 ? print $fh1 "$_" : print $fh2 "$_";' f
  2. Including the matched line:
    perl -ne 'BEGIN{open($fh1,">","f1"); open($fh2,">","f2");}
              $a=1 if /XYZ/; $a==1 ? print $fh1 "$_" : print $fh2 "$_";' f


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x