I have a text file called entry.txt that contains the following:
[ entry1 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3633 3634 3636 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631 [ entry2 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631 [ entry3 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4241 4242 4244 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5495 5496 5498 5628 5629 5631
I would like to split it into three text files: entry1.txt, entry2.txt, entry3.txt. Their contents are as follows.
entry1.txt:
[ entry1 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3633 3634 3636 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631
entry2.txt:
[ entry2 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5628 5629 5631
entry3.txt:
[ entry3 ] 1239 1240 1242 1391 1392 1394 1486 1487 1489 1600 1601 1603 1657 1658 1660 2075 2076 2078 2322 2323 2325 2740 2741 2743 3082 3083 3085 3291 3292 3294 3481 3482 3484 3690 3691 3693 3766 3767 3769 4241 4242 4244 4526 4527 4529 4583 4584 4586 4773 4774 4776 5153 5154 5156 5495 5496 5498 5628 5629 5631
In other words, the [ character indicates a new file should begin. The entries ([ entry*], where * is an integer) are always in numerical order and are consecutive integers starting from 1 to N (in my actual input file, N = 200001).
Is there any way I can accomplish automatic text file splitting in bash? My actual input entry.txt actually contains 200,001 entries.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
With csplit from GNU coreutils (non-embedded Linux, Cygwin):
csplit -f entry -b '%d.txt' entry.txt '/^[ .* ]$/' '{*}'
You’ll end up with an extra empty file entry0.txt (containing the part before the first header).
Standard csplit lacks the {*} indefinite repeater and the -b option to specify the suffix format, so on other systems you’ll have to count the number of sections first and rename the output files afterwards.
csplit -f entry -n 9 entry.txt '/^[ .* ]$/' "{$(egrep -c '^'[ .* ]$' <entry.txt)}"
for x in entry?????????; do
y=$((1$x - 1000000000))
mv "entry$x" "entry$y.txt"
done
Method 2
And here’s a nice, simple, gawk one-liner :
$ gawk '/^[/{match($0, /^[ (.+?) ]/, k)} {print >k[1]".txt" }' entry.txt
This will work for any file size, irrespective of the number of lines in each entry, as long as each entry header looks like [ blahblah blah blah ]. Notice the space just after the opening [ and just before the closing ].
EXPLANATION:
awk and gawk read an input file line by line. As each line is read, its contents are saved in the $0 variable. Here, we are telling gawk to match anything within square brackets, and save its match into the array k.
So, every time that regular expression is matched, that is, for every header in your file, k[1] will have the matched region of the line. Namely, “entry1”, “entry2” or “entry3” or “entryN”.
Finally, we print each line into a file called <whatever value k currently has>.txt, ie entry1.txt, entry2.txt … entryN.txt.
This method will be much faster than perl for larger files.
Method 3
In perl it can be done much simplier:
perl -ne 'open(F, ">", ($1).".txt") if /[ (entryd+) ]/; print F;' file
Method 4
Here’s a short awk one-liner:
awk '/^[/ {ofn=$2 ".txt"} ofn {print > ofn}' input.txt
How does this work?
/^[/matches lines starting with a left square bracket, and{ofn=$2 ".txt"}sets a variable to the second white-spaced-delimited word as our output file name. Then,ofnis a condition that evaluates to true if the variable is set (thus causing lines before your first header to be ignored){print > ofn}redirects the current line to the specified file.
Note that all of the spaces in this awk script can be removed, if compactness makes you happy.
Note also that the above script really needs the section headers to have spaces around and not within them. If you wanted to be able to handle section headers like [foo] and [ this that ], you’d need ever so slightly more code:
awk '/^[/ {sub(/^[ */,""); sub(/ *] *$/,""); ofn=$0 ".txt"} ofn {print > ofn}' input.txt
This uses awk’s sub() function to strip leading and trailing square-brackets-plus-whitespace. Note that per standard awk behaviour, this will collapse whitespace (the field separator) into a single space (i.e. [ this that ] is saved to "this that.txt"). If maintaining the original whitespace in your output filenames is important, you can experiment by setting FS.
Method 5
It can be done from the command line in python as:
paddy$ python3 -c 'out=0
> with open("entry.txt") as f:
> for line in f:
> if line[0] == "[":
> if out: out.close()
> out = open(line.split()[1] + ".txt", "w")
> else: out.write(line)'
Method 6
This is a somewhat crude, but easily understood way to do it:
use grep -l '[ entry ]' FILENAME to get the line numbers to split at [ entry ].
Use a combination off head and tail to get the right pieces.
Like I said; it isn’t pretty, but is easy to comprehend.
Method 7
What about using awk with [ as a record separator and space as the field separator. This gives us easily the data to be put in the file as $0 where he have to put back the removed leading [ and the filename as $1. We then only have to handle the special case of the 1st record which is empty. This gives us:
awk -v "RS=[" -F " " 'NF != 0 {print "[" $0 > $1}' entry.txt
Method 8
terdon’s answer works for me but I needed to use gawk, not awk. The gawk manual ( search for ‘match(‘ ) explains that the array argument in match() is a gawk extension. Maybe it depends on your Linux install and your awk/nawk/gawk versions but on my Ubuntu machine only gawk ran terdon’s excellent answer:
$ gawk '{if(match($0, /^[ (.+?) ]/, k)){name=k[1]}} {print >name".txt" }' entry.txt
Method 9
Here’s a perl solution. This script detects the [ entryN ] lines and changes the output file accordingly, but doesn’t validate, parse or process the data in each section, it just prints the input line to the output file.
#! /usr/bin/perl
# default output file is /dev/null - i.e. dump any input before
# the first [ entryN ] line.
$outfile='/dev/null';
open(OUTFILE,">",$outfile) || die "couldn't open $outfile: $!";
while(<>) {
# uncomment next two lines to optionally remove comments (starting with
# '#') and skip blank lines. Also removes leading and trailing
# whitespace from each line.
# s/#.*|^s*|s*$//g;
# next if (/^$/)
# if line begins with '[', extract the filename
if (m/^[/) {
(undef,$outfile,undef) = split ;
close(OUTFILE);
open(OUTFILE,">","$outfile.txt") || die "couldn't open $outfile.txt: $!";
} else {
print OUTFILE;
}
}
close(OUTFILE);
Method 10
Hi i wrote this simple script using ruby to solve your problem
#!ruby
# File Name: split.rb
fout = nil
while STDIN.gets
line = $_
if line.start_with? '['
fout.close if fout
fname = line.split(' ')[1] + '.txt'
fout = File.new fname,'w'
end
fout.write line if fout
end
fout.close if fout
you can use it this way:
ruby split.rb < entry.txt
i have tested it, and it works fine..
Method 11
I prefer the csplit option but as an alternative here’s a GNU awk solution:
parse.awk
BEGIN {
RS="\[ entry[0-9]+ \]n" # Record separator
ORS="" # Reduce whitespace on output
}
NR == 1 { f=RT } # Entries are of-by-one relative to matched RS
NR > 1 {
split(f, a, " ") # Assuming entries do not have spaces
print f > a[2] ".txt" # a[2] now holds the bare entry name
print >> a[2] ".txt"
f = RT # Remember next entry name
}
Run it like this:
gawk -f parse.awk entry.txt
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0