I have an XML document which reads like this:
<xml> <web:Web> <web:Total>4000</web:Total> <web:Offset>0</web:Offset> </web:Web> </xml>
my question is how do I access them using a library like BeautifulSoup in python?
xmlDom.web[“Web”].Total ? does not work?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
BeautifulSoup isn’t a DOM library per se (it doesn’t implement the DOM APIs). To make matters more complicated, you’re using namespaces in that xml fragment. To parse that specific piece of XML, you’d use BeautifulSoup as follows:
from BeautifulSoup import BeautifulSoup
xml = """<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>"""
doc = BeautifulSoup( xml )
print doc.find( 'web:total' ).string
print doc.find( 'web:offset' ).string
If you weren’t using namespaces, the code could look like this:
from BeautifulSoup import BeautifulSoup
xml = """<xml>
<Web>
<Total>4000</Total>
<Offset>0</Offset>
</Web>
</xml>"""
doc = BeautifulSoup( xml )
print doc.xml.web.total.string
print doc.xml.web.offset.string
The key here is that BeautifulSoup doesn’t know (or care) anything about namespaces. Thus web:Web is treated like a web:web tag instead of as a Web tag belonging to th eweb namespace. While BeautifulSoup adds web:web to the xml element dictionary, python syntax doesn’t recognize web:web as a single identifier.
You can learn more about it by reading the documentation.
Method 2
This is an old question but somebody might not know that at least BeautifulSoup 4 does handle namespaces well if you pass 'xml' as second argument to the constructor:
soup = BeautifulSoup("""<xml>
<web:Web>
<web:Total>4000</web:Total>
<web:Offset>0</web:Offset>
</web:Web>
</xml>""", 'xml')
print soup.prettify()
<?xml version="1.0" encoding="utf-8"?>
<xml>
<Web>
<Total>
4000
</Total>
<Offset>
0
</Offset>
</Web>
</xml>
Method 3
Environment
import bs4 bs4.__version__ --- 4.10.0' import sys print(sys.version) --- 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0]
BS4/XML Parser on XML with namespace definition
from bs4 import BeautifulSoup
xbrl_with_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<xbrl
xmlns:dei="http://xbrl.sec.gov/dei/2020-01-31"
>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_with_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant.prettify())
---
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
BS4/XML Parser on XML without namespace definition
xbrl_without_namespace = """
<?xml version="1.0" encoding="UTF-8"?>
<dei:EntityRegistrantName>
Hoge, Inc.
</dei:EntityRegistrantName>
</xbrl>
"""
soup = BeautifulSoup(xbrl_without_namespace, 'xml')
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
BS4/HTML Parser on XML without namespace definition
BS4/HTML parser regards <namespace>:<tag> as a single tag, besides it lower the letters.
soup = BeautifulSoup(xbrl_without_namespace, 'html.parser')
registrant = soup.find("dei:EntityRegistrantName".lower())
print(registrant)
---
<dei:entityregistrantname>
Hoge, Inc.
</dei:entityregistrantname>
Does not match with capital letters as they have been converted into lower letters.
registrant = soup.find("dei:EntityRegistrantName")
print(registrant)
---
None
Conclusion
- Provide the namespace definitions to use namespaces with XML parser, OR
- Use HTML parser and handle with all small letters.
Method 4
You should explicitly define your namespace on root element, using xmlns:prefix="URI"syntax (see examples here), and then you access you attribute via prefix:tag from BeautifulSoup. Keep in mind,what you also should explicitly define, how BeautifulSoup should process you document, in that case:
xml = BeautifulSoup(xml_content, ‘xml’)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0