This code almost does what I need it to..
for line in all_lines:
s = line.split('>')
Except it removes all the ‘>’ delimiters.
So,
<html><head>
Turns into
['<html','<head']
Is there a way to use the split() method but keep the delimiter, instead of removing it?
With these results..
['<html>','<head>']
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
d = ">"
for line in all_lines:
s = [e+d for e in line.split(d) if e]
Method 2
If you are parsing HTML with splits, you are most likely doing it wrong, except if you are writing a one-shot script aimed at a fixed and secure content file. If it is supposed to work on any HTML input, how will you handle something like <a title='growth > 8%' href='#something'>?
Anyway, the following works for me:
>>> import re
>>> re.split('(<[^>]*>)', '<body><table><tr><td>')[1::2]
['<body>', '<table>', '<tr>', '<td>']
Method 3
How about this:
import re
s = '<html><head>'
re.findall('[^>]+>', s)
Method 4
Just split it, then for each element in the array/list (apart from the last one) add a trailing “>” to it.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0