ElementTree and lxml for pythonic XML processing in Python

ElementTree is a "pythonic" XML parser interface developed by Fredrik Lundh which is included in the Python standard library since version 2.5. It provides a very simple and intuitive API to process XML (well, much simpler and more intuitive than usual parsers). lxml is a more efficient parser with a compatible interface. Here are some useful tips to use ElementTree and lxml.

ElementTree

The ElementTree documentation included in the Python 2.5 manual is far from complete. Fredrik Lundh's pages on his effbot website are still necessary to take advantage of all useful ElementTree features:

How to import ElementTree:

In order to have a portable code, it is necessary to support versions of ElementTree before and after Python 2.5. It can be done this way:

try:
    # Python 2.5+: batteries included
    import xml.etree.ElementTree as ET
except ImportError:
    try:
        # Python <2.5: standalone ElementTree install
        import elementtree.ElementTree as ET
    except ImportError:
        raise ImportError, "ElementTree is not installed, see http://effbot.org/zone/element-index.htm"

 You may also replace ElementTree by cElementTree to get an optimized version of the parser developed in C. See below for a performance comparison.

lxml

lxml is another module providing an ElementTree-compatible API with additional features thanks to the use of libxml2 and libxslt libraries:

  • full XPath support
  • XSLT
  • XML Schemas and Relax NG
  • canonicalization (C14N)
  • Xinclude
  • namespaces preservation
  • and many other features and subtleties

Official website: http://codespeak.net/lxml/

How to import lxml:

It is possible to easily switch from ElementTree to lxml simply by changing the import lines:

try:
    import lxml.etree as ET
except ImportError:
    raise ImportError, "lxml is not installed, see http://codespeak.net/lxml/"

Performance comparison

When parsing large XML files, performance matters. For example I parsed a large and complex 11MB XML file using ElementTree, cElementTree and lxml, first in a normal environment and then with psyco enabled. Here are the results:

1) parsing with lxml...
   lxml: 1.231 s
2) parsing with cElementTree...
   cElementTree: 4.416 s
3) parsing with ElementTree...
   ElementTree: 15.927 s
same tests with psyco.full() enabled:
4) parsing with lxml...
   lxml: 4.486 s
5) parsing with cElementTree...
   cElementTree: 2.731 s
6) parsing with ElementTree...
   ElementTree: 14.419 s

This simple test may not be very representative, but it clearly shows two things:

  • lxml is roughly three times faster than cElementTree and twelve times than ElementTree, when parsing a large XML file.
  • using psyco improves cElementTree performance, but it slows down lxml!

So as a conclusion I would recommend lxml for most XML processing, with a fallback to cElementTree for portability, such as this:

try:
    # lxml: best performance for XML processing
    import lxml.etree as ET
except ImportError:
    try:
        # Python 2.5+: batteries included
        import xml.etree.cElementTree as ET
    except ImportError:
        try:
            # Python <2.5: standalone ElementTree install
            import elementtree.cElementTree as ET
        except ImportError:
            raise ImportError, "lxml or ElementTree are not installed, "\
                +"see http://codespeak.net/lxml "\
                +"or http://effbot.org/zone/element-index.htm"

To be continued...