How can I split a text file into multiple text files?

I have a text file called entry.txt that contains the following:

[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631
[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631
[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

I would like to split it into three text files: entry1.txt, entry2.txt, entry3.txt. Their contents are as follows.

entry1.txt:

[ entry1 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3633 3634 3636 3690 3691 3693 3766
3767 3769 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5628 5629 5631

entry2.txt:

[ entry2 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4526
4527 4529 4583 4584 4586 4773 4774 4776 5153 5154
5156 5628 5629 5631

entry3.txt:

[ entry3 ]
1239 1240 1242 1391 1392 1394 1486 1487 1489 1600
1601 1603 1657 1658 1660 2075 2076 2078 2322 2323
2325 2740 2741 2743 3082 3083 3085 3291 3292 3294
3481 3482 3484 3690 3691 3693 3766 3767 3769 4241
4242 4244 4526 4527 4529 4583 4584 4586 4773 4774
4776 5153 5154 5156 5495 5496 5498 5628 5629 5631

In other words, the [ character indicates a new file should begin. The entries ([ entry*], where * is an integer) are always in numerical order and are consecutive integers starting from 1 to N (in my actual input file, N = 200001).

Is there any way I can accomplish automatic text file splitting in bash? My actual input entry.txt actually contains 200,001 entries.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

With csplit from GNU coreutils (non-embedded Linux, Cygwin):

csplit -f entry -b '%d.txt' entry.txt '/^[ .* ]$/' '{*}'

You’ll end up with an extra empty file entry0.txt (containing the part before the first header).

Standard csplit lacks the {*} indefinite repeater and the -b option to specify the suffix format, so on other systems you’ll have to count the number of sections first and rename the output files afterwards.

csplit -f entry -n 9 entry.txt '/^[ .* ]$/' "{$(egrep -c '^'[ .* ]$' <entry.txt)}"
for x in entry?????????; do
  y=$((1$x - 1000000000))
  mv "entry$x" "entry$y.txt"
done

Method 2

And here’s a nice, simple, gawk one-liner :

$ gawk '/^[/{match($0, /^[ (.+?) ]/, k)} {print >k[1]".txt" }' entry.txt

This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]. Notice the space just after the opening [ and just before the closing ].


EXPLANATION:

awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling gawk to match anything within square brackets, and save its match into the array k.

So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, “entry1”, “entry2” or “entry3” or “entryN”.

Finally, we print each line into a file called <whatever value k currently has>.txt, ie entry1.txt, entry2.txt … entryN.txt.

This method will be much faster than perl for larger files.

Method 3

In perl it can be done much simplier:

perl -ne 'open(F, ">", ($1).".txt") if /[ (entryd+) ]/; print F;' file

Method 4

Here’s a short awk one-liner:

awk '/^[/ {ofn=$2 ".txt"} ofn {print > ofn}' input.txt

How does this work?

  • /^[/ matches lines starting with a left square bracket, and
  • {ofn=$2 ".txt"} sets a variable to the second white-spaced-delimited word as our output file name. Then,
  • ofn is a condition that evaluates to true if the variable is set (thus causing lines before your first header to be ignored)
  • {print > ofn} redirects the current line to the specified file.

Note that all of the spaces in this awk script can be removed, if compactness makes you happy.

Note also that the above script really needs the section headers to have spaces around and not within them. If you wanted to be able to handle section headers like [foo] and [ this that ], you’d need ever so slightly more code:

awk '/^[/ {sub(/^[ */,""); sub(/ *] *$/,""); ofn=$0 ".txt"} ofn {print > ofn}' input.txt

This uses awk’s sub() function to strip leading and trailing square-brackets-plus-whitespace. Note that per standard awk behaviour, this will collapse whitespace (the field separator) into a single space (i.e. [ this that ] is saved to "this that.txt"). If maintaining the original whitespace in your output filenames is important, you can experiment by setting FS.

Method 5

It can be done from the command line in python as:

paddy$ python3 -c 'out=0
> with open("entry.txt") as f: 
>   for line in f:
>     if line[0] == "[":
>       if out: out.close()
>       out = open(line.split()[1] + ".txt", "w")
>     else: out.write(line)'

Method 6

This is a somewhat crude, but easily understood way to do it:
use grep -l '[ entry ]' FILENAME to get the line numbers to split at [ entry ].
Use a combination off head and tail to get the right pieces.

Like I said; it isn’t pretty, but is easy to comprehend.

Method 7

What about using awk with [ as a record separator and space as the field separator. This gives us easily the data to be put in the file as $0 where he have to put back the removed leading [ and the filename as $1. We then only have to handle the special case of the 1st record which is empty. This gives us:

awk -v "RS=[" -F " " 'NF != 0 {print "[" $0 > $1}' entry.txt

Method 8

terdon’s answer works for me but I needed to use gawk, not awk. The gawk manual ( search for ‘match(‘ ) explains that the array argument in match() is a gawk extension. Maybe it depends on your Linux install and your awk/nawk/gawk versions but on my Ubuntu machine only gawk ran terdon’s excellent answer:

$ gawk '{if(match($0, /^[ (.+?) ]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

Method 9

Here’s a perl solution. This script detects the [ entryN ] lines and changes the output file accordingly, but doesn’t validate, parse or process the data in each section, it just prints the input line to the output file.

#! /usr/bin/perl 

# default output file is /dev/null - i.e. dump any input before
# the first [ entryN ] line.

$outfile='/dev/null';
open(OUTFILE,">",$outfile) || die "couldn't open $outfile: $!";

while(<>) {
  # uncomment next two lines to optionally remove comments (starting with
  # '#') and skip blank lines.  Also removes leading and trailing
  # whitespace from each line.
  # s/#.*|^s*|s*$//g;
  # next if (/^$/)

  # if line begins with '[', extract the filename
  if (m/^[/) {
    (undef,$outfile,undef) = split ;
    close(OUTFILE);
    open(OUTFILE,">","$outfile.txt") || die "couldn't open $outfile.txt: $!";
  } else {
    print OUTFILE;
  }
}
close(OUTFILE);

Method 10

Hi i wrote this simple script using ruby to solve your problem

#!ruby
# File Name: split.rb

fout = nil

while STDIN.gets
  line = $_
  if line.start_with? '['
    fout.close if fout
    fname = line.split(' ')[1] + '.txt'
    fout = File.new fname,'w'
  end
  fout.write line if fout
end

fout.close if fout

you can use it this way:

ruby split.rb < entry.txt

i have tested it, and it works fine..

Method 11

I prefer the csplit option but as an alternative here’s a GNU awk solution:

parse.awk

BEGIN { 
  RS="\[ entry[0-9]+ \]n"  # Record separator
  ORS=""                      # Reduce whitespace on output
}
NR == 1 { f=RT }              # Entries are of-by-one relative to matched RS
NR  > 1 {
  split(f, a, " ")            # Assuming entries do not have spaces 
  print f  > a[2] ".txt"      # a[2] now holds the bare entry name
  print   >> a[2] ".txt"
  f = RT                      # Remember next entry name
}

Run it like this:

gawk -f parse.awk entry.txt


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x