Splitting text files based on a regular expression

I have a text file which I want to split into 64 unequal parts, according to the 64 hexagrams of the Yi Jing. Since the passage for each hexagram begins with some digit(s), a period, and two newlines, the regex should be pretty easy to write.

But how do I actually split the text file into 64 new files according to this regex? It seems like more of a task for perl. But maybe there’s a more obvious way that I’m just totally missing.

Contents hide

Answers:

Method 1

Method 2

awk

gawk

perl

how to get the example file:

Method 3

Method 4

Method 5

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This would be csplit except that the regex has to be a single line. That also makes sed difficult; I’d go with Perl or Python.

You could see if

csplit foo.txt '/^[0-9][0-9]*.$/' '{64}'

is good enough for your purposes. (csplit requires a POSIX BRE, so it can’t use d or +, among others.)

Method 2

I think the best way is awk and gawk.

awk

awk -F "([.] )|( / )" '/^[0-9]{1,3}[.]/{x="F"$1"("$2").txt";}{print >x;}' I_Ching_Wilhelm_Translation.txt

-F will specify fields seperator for each line. It is a regex, here we use multiple seperators: ". " and " / ". Thus a line like 1. Ch'ien / The Creative will be split into 3 fields: 1 Ch'ien and The Creative. Later we can refer to these fields with $n. $0 is the entire line.

We then tell the awk to match the lines with pattern ^[0-9]{1,3}[.] If there is match we then assign value to x. The value x will be used as file name for print operation. In this example we use "F"$1"("$2").txt" so the line 1. Ch'ien / The Creative gives a filename F1(Ch'ien).txt

gawk

In gawk, we can also access captured group. So we can simplify the command to:

gawk 'match($0, /^([0-9]{1,3})[.] (.*) / (.*)$/, ary){x="F"ary[1]"("ary[2]")";}{print >x;}' I_Ching_Wilhelm_Translation.txt

here we use match the capture the groups and put them into variable list ary. $0 is the entire line. ary[0] is everything matched. ary[1...n] is each group.

perl

We can also do it with perl:

perl -ne 'if(/^([0-9]{1,3})[.] (.*) / (.*)$/) {close F; open F, ">", sprintf("F$1($2).txt");} print F' I_Ching_Wilhelm_Translation.txt

Results:

> ls F*
F10(Lü).txt         F22(Pi).txt       F34(Ta Chuang).txt  F46(Shêng).txt     F58(Tui).txt
F11(T'ai).txt       F23(Po).txt       F35(Chin).txt       F47(K'un).txt      F59(Huan).txt
F12(P'i).txt        F24(Fu).txt       F36(Ming I).txt     F48(Ching).txt     F5(Hsü).txt
F13(T'ung Jên).txt  F25(Wu Wang).txt  F37(Chia Jên).txt   F49(Ko).txt        F60(Chieh).txt
F14(Ta Yu).txt      F26(Ta Ch'u).txt  F38(K'uei).txt      F4(Mêng).txt       F61(Chung Fu).txt
F15(Ch'ien).txt     F27(I).txt        F39(Chien).txt      F50(Ting).txt      F62(Hsiao Kuo).txt
F16(Yü).txt         F28(Ta Kuo).txt   F3(Chun).txt        F51(Chên).txt      F63(Chi Chi).txt
F17(Sui).txt        F29(K'an).txt     F40(Hsieh).txt      F52(Kên).txt       F64(Wei Chi).txt
F18(Ku).txt         F2(K'un).txt      F41(Sun).txt        F53(Chien).txt     F6(Sung).txt
F19(Lin).txt        F30(Li).txt       F42(I).txt          F54(Kuei Mei).txt  F7(Shih).txt
F1(Ch'ien).txt      F31(Hsien).txt    F43(Kuai).txt       F55(Fêng).txt      F8(Pi).txt
F20(Kuan).txt       F32(Hêng).txt     F44(Kou).txt        F56(Lü).txt        F9(Hsiao Ch'u).txt
F21(Shih Ho).txt    F33(TUN).txt      F45(Ts'ui).txt      F57(Sun).txt

how to get the example file:

curl http://www2.unipr.it/~deyoung/I_Ching_Wilhelm_Translation.html|html2text -o I_Ching_Wilhelm_Translation.plain
sed 's|^[[:blank:]]*||g' I_Ching_Wilhelm_Translation.plain > I_Ching_Wilhelm_Translation.txt

Method 3

With GNU coreutils, you can use csplit to break a file into regexp-delimited pieces, as shown by geekosaur.

Here’s a portable awk script to break a file into pieces. It works by

calling getline to deal with the multiline (2-line) separator;
setting a variable outfile to the name of the file to print to, when a section header is encountered.

BEGIN {outfile="header.txt"}
{
    while (/^[0-9]+.$/) {
        prev = $0; getline;
        if ($0 == "") outfile = prev "txt";
        print prev >outfile
    }
    print >outfile
}

Method 4

On MacOS, split has a -p parameter that takes a regexp. Each time split encounters a line matching the regexp, a new output file is opened, starting with that line.

The MacOS manpage I have is dated Aug 21st 2005.

Method 5

IF your reason for splitting is to process each block with a different command, GNU Parallel may be the tool of choice:

cat I_Ching_Wilhelm_Translation.txt |
  parallel -N1 --pipe --regexp --recend 'n' --recstart '[0-9]{1,3}[.] [^n]+ /' 'cat > {#}'

Here you can replace cat > {#} with the command to run.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating