How do I display all the characters between two specific strings?

I want to display all the characters in a file between strings “xxx” and “yyy” (the quotes are not part of the delimiters). How can I do that ? For example, if i have input “Hello world xxx this is a file yyy”, the output should be ” this is a file “

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can use the pattern matching flag in sed as follows:

echo "Hello world xxx this is a file yyy" | sed 's/.*xxx (.*)yyy/1/'

So .*xxx will match from the beginning up to xxx. This is best shown using grep:

enter image description here

1 is a ‘Remember pattern’ that remembers everything that is within (.*) so from xxx up to yyy but not yyy.

Finally the remembered string is printed.

Method 2

This should do what you are trying to do :

sed -e 's/xxx(.*)yyy/1/'

This assumes both delimiter strings are on the same line

Method 3

The question is only interesting if the delimiters are not necessarily on the same line. It can be done several ways (even with sed), but awk is more flexible:

    #!/bin/sh
    awk '
    BEGIN { found = 0; }
    /xxx/ {
        if (!found) {
            found = 1;
            $0 = substr($0, index($0, "xxx") + 3);
        }
    }
    /yyy/ {
        if (found) {
            found = 2;
            $0 = substr($0, 0, index($0, "yyy") - 1);
        }
    }   
        { if (found) {
            print;
            if (found == 2)
                found = 0;
        }
    }
    '

This is tested lightly for the cases where at most one substring is on a line, using this data:

    this is xxx yy
    first
    second yyy

    xxx.x
    yyy

    xxx#yyy

and this output (script is “foo”, data is “foo.in”):

    $ cat foo.in|./foo
     yy
    first
    second 
    .x

    #

The way it works, is that the input data is in $0, and awk matches the patterns xxx and yyy in sequence, allowing more than one thing to change $0 on its way to the last step, where it is printed.

By the way, this example would not work for

xxxxHelloyyyxxxWorldyyy

since it checks only the first match. The Perl script will give different results, since it uses a greedy match rather than the index/substr which I used in the awk example. Perl, of course, can do the same — with a script.

Awk (like Perl) is free-format, so one could express the command as something like

awk 'BEGIN{found=0;}/xxx/{if(!found){found=1;$0=substr($0,index($0, "xxx")+3);}}/yyy/{if(found){found=2;$0=substr($0,0,index($0,"yyy")-1);}}{ if(found){print;if(found==2)found=0;}}'

but that is rarely done, except for the sake of example. Likewise, sed scripts (line-oriented), can be combined into a single line with some restrictions. Again, complex scripts in sed are rarely dealt with in that manner. Rather, they are treated like real programs (see example).

Further reading:

Method 4

Here is a solution with python :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile("xxx((?:.|n)*)yyy")
for match in reg.finditer(text):
    print match.groups()[0]

Save this script as a file “post.py” and launch it with:

python post.py your_file_to_search_in.txt

The script compiles a regular expression and print all occurences found in the text of the file.

(?:.|n) is a non capturing group matching any character including newline

Edit : solution improved thanks to 1_CR tips :

import sys
import re
F=open(sys.argv[1])
text=F.read()
reg=re.compile(r'xxx(.*)yyy',re.DOTALL)
for match in reg.finditer(text):
    print match.groups()[0]

Method 5

A solution that also works when xxxand yyy is not on the same line:

cat /tmp/xxx-to-yyy| perl -ne '(/xxx/../yyy/) && print' | perl -pe 's/.*(xxx.*)/$1/' | perl -pe 's/(.*yyy).*/$1/'

Not exactly pretty…

The -e switch to perl is just to give the script on the command line.
The -n and -p makes it loop over the input lines, with -p they are printed after the script, with -n they aren’t. So basically this just sends the file through three perl loops.

.. is a range operator, that returns false until the left condition returns true, and false after the right condition returns true, so the first loop cut down the file to the lines between the two strings (both included. The last two perl commands remove the text before xxx and after yyy.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x