Python pretty XML printer with lxml

After reading from an existing file with ‘ugly’ XML and doing some modifications, pretty printing doesn’t work. I’ve tried etree.write(FILE_NAME, pretty_print=True).

I have the following XML:

<testsuites tests="14" failures="0" disabled="0" errors="0" time="0.306" name="AllTests">
    <testsuite name="AIR" tests="14" failures="0" disabled="0" errors="0" time="0.306">
....

And I use it like this:

tree = etree.parse('original.xml')
root = tree.getroot()

...    
# modifications
...

with open(FILE_NAME, "w") as f:
    tree.write(f, pretty_print=True)

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

For me, this issue was not solved until I noticed this little tidbit here:

http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output

Short version:

Read in the file with this command:

>>> parser = etree.XMLParser(remove_blank_text=True)
>>> tree = etree.parse(filename, parser)

That will “reset” the already existing indentation, allowing the output to generate it’s own indentation correctly. Then pretty_print as normal:

>>> tree.write(<output_file_name>, pretty_print=True)

Method 2

Well, according to the API docs, there is no method “write” in the lxml etree module. You’ve got a couple of options in regards to getting a pretty printed xml string into a file. You can use the tostring method like so:

f = open('doc.xml', 'w')
f.write(etree.tostring(root, pretty_print=True))
f.close()

Or, if your input source is less than perfect and/or you want more knobs and buttons to configure your out put you could use one of the python wrappers for the tidy lib.

http://utidylib.berlios.de/

import tidy
f.write(tidy.parseString(your_xml_str, **{'output_xml':1, 'indent':1, 'input_xml':1}))

http://countergram.com/open-source/pytidylib

from tidylib import tidy_document
document, errors = tidy_document(your_xml_str, options={'output_xml':1, 'indent':1, 'input_xml':1})
f.write(document)

Method 3

fp = file('out.txt', 'w')
print(e.tree.tostring(...), file=fp)
fp.close()

Method 4

Here is an answer that is fixed to work with Python 3:

from lxml import etree
from sys import stdout
from io import BytesIO

parser = etree.XMLParser(remove_blank_text = True)
file_obj = BytesIO(text)
tree = etree.parse(file_obj, parser)
tree.write(stdout.buffer, pretty_print = True)

where text is the xml code as a sequence of bytes.

Method 5

I am not sure why other answers did not mention this. If you want to obtain the root of the xml there is a method called getroot(). I hope I answered your question (though a little late).

tree = et.parse(xmlFile)
root = tree.getroot()

Method 6

Of course – pretty print of lxml.etree is possible.

In my case, the old trick with remove_blank_text=True and pretty_print=True was not working as I expected (was too delicate), so I decided to write it by myself.

Here is it – a modern, forcible, native pythonic way to correct lxml.etee.Element tree indentation.
This gives a nicely prettified XML string:

from typing import Optional

import lxml.etree


def indent_lxml(element: lxml.etree.Element, level: int = 0, is_last_child: bool = True) -> None:
    space = "    "
    indent_str = "n" + level * space

    element.text = strip_or_null(element.text)
    if element.text:
        element.text = f"{indent_str}{space}{element.text}"

    num_children = len(element)
    if num_children:
        element.text = f"{element.text or ''}{indent_str}{space}"

        for index, child in enumerate(element.iterchildren()):
            is_last = index == num_children - 1
            indent_lxml(child, level + 1, is_last)

    elif element.text:
        element.text += indent_str

    tail_level = max(0, level - 1) if is_last_child else level
    tail_indent = "n" + tail_level * space
    tail = strip_or_null(element.tail)
    element.tail = f"{indent_str}{tail}{tail_indent}" if tail else tail_indent


def strip_or_null(text: Optional[str]) -> Optional[str]:
    if text is not None:
        return text.strip() or None

It’s decent fast, because it doesn’t allocate any additional structures in memory and also traversing the tree – it visits each node only once, giving the best possible – O x N computational complexity.

It rearranges all the existing indentation “in place” in the tree (the DOM) by correcting contents of Element.text and Element.tail attributes (affects white-spaces only).

Naturally, it also can be used with HTML parsed by lxml.

In order to use it, do something like that:

root = lxml.etree.parse("path/to/the_file.xml").getroot()
# or
root = lxml.etree.fromstring("<xml><body><leaf1/><leaf2/></body></xml>")

indent_lxml(root)  # corrects indentation "in place"

result = lxml.etree.tostring(root, encoding="unicode")
print(result)

Which prints:

<xml>
    <body>
        <leaf1/>
        <leaf2/>
    </body>
</xml>

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating