How to delete the words between two delimiters?

I have a noisy data..something like

<@ """@$ FSDF >something something <more noise>

Now I just want to extract "something something".
Is there a way on how to delete the text between those two delimiters "<" and ">"?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use regular expressions:

>>> import re
>>> s = '<@ """@$ FSDF >something something <more noise>'
>>> re.sub('<[^>]+>', '', s)
'something something '

[Update]
If you tried a pattern like <.+>, where the dot means any character and the plus sign means one or more, you know it does not work.

>>> re.sub(r'<.+>', s, '')
''

Why!?! It happens because regular expressions are “greedy” by default. The expression will match anything until the end of the string, including the > – and this is not what we want. We want to match < and stop on the next >, so we use the [^x] pattern which means “any character but x” (x being >).

The ? operator turns the match “non-greedy”, so this has the same effect:

>>> re.sub(r'<.+?>', '', s)
'something something '

The previous is more explicit, this one is less typing; be aware that x? means zero or one occurrence of x.

Method 2

Of course, you can use regular expressions.

import re
s = #your string here
t = re.sub('<.*?>', '', s)

The above code should do it.

Method 3

First thank you Paulo Scardine, I used your re to do great thing. The idea was to have tag free LibreOffice po file for printing purposes. And I made the following script which will clean the help file for smaller and easier ones.

import re
f = open('a.csv')
text = f.read()
f.close()

clean = re.sub('<[^>]+>', ' ', text)

f = open('b.csv', 'w')
f.write(clean)
f.close()

Method 4

import re
my_str = '<@ """@$ FSDF >something something <more noise>'
re.sub('<.*?>', '', my_str)
'something something '

The re.sub function takes a regular expresion and replace all the matches in the string with the second parameter. In this case, we are searching for all characters between < and > ('<.*?>') and replacing them with nothing ('').

The ? is used in re for non-greedy searches.

More about the re module.


If that “noises” are actually html tags, I suggest you to look into BeautifulSoup

Method 5

Just for interest, you could write some code such as:

with open('blah.txt','w') as f:
    f.write("""<sdgsa>one<as<>asfd<asdf>
<asdf>two<asjkdgai><iasj>three<fasdlojk>""")

def filter_line(line):
    count=0
    ignore=False
    result=[]
    for c in line:
        if c==">" and count==1:
            count=0
            ignore=False
        if not ignore:
            result.append(c)
        if c=="<" and count==0:
            ignore=True
            count=1
    return "".join(result)

with open('blah.txt') as f:
    print "".join(map(filter_line,f.readlines()))

>>> 
<>one<>asfd<>
<>two<><>three<>


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x