How can I use regex in python to find words between tags?
s = """<person>John</person>went to<location>London</location>""" ...... ....... print 'person of name:' John print 'location:' London
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use BeautifulSoup for this HTML parsing.
input = """"<person>John</person>went to<location>London</location>"""
soup = BeautifulSoup(input)
print soup.findAll("person")[0].renderContents()
print soup.findAll("location")[0].renderContents()
Also, it’s not a good practice to use str as a variable name in python as str() means a different thing in python.
By the way, the regex can be:
import re
print re.findall("<person>(.*?)</person>", input, re.DOTALL)
print re.findall("<location>(.*?)</location>", input, re.DOTALL)
Method 2
import re
# simple example
pattern = r"<person>(.*?)</person>"
string = "<person>My name is Jo</person>"
re.findall(pattern, string, flags=0)
# multiline string example
string = "<person>My name is:n Jo</person>"
re.findall(pattern, string, flags=re.DOTALL)
This example works for simple parsing only. Have a look at python official documentation on re
To parse HTML, you should consider @sabuj-hassan answer but please remember to check this Stack Overflow gem as well.
Method 3
probably you are looking for **XML tree and elements**
XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.
19.7.1.2. Parsing XML
We’ll be using the following XML document as the sample data for this section:
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
We have a number of ways to import the data. Reading the file from disk:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
Reading the data from a string:
root = ET.fromstring(country_data_as_string)
Other python Xml & Html parser
https://wiki.python.org/moin/PythonXml
http://docs.python.org/2/library/htmlparser.html
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0