pywordform - a Python module to parse MS Word forms (docx) to extract field values and tags

pywordform is a python module to parse Microsoft Word forms in docx format, and extract all field values with their tags into a python dictionary.

News

  • 2012-04-19 v0.02: added support for multiline text fields
  • 2012-03-04 v0.01: published initial version

Download:

The archive is available on the project page.

License

BSD, open-source. See LICENCE.txt for more info.

How to use this module:

Open the file sample_form.docx (provided with the source code)  in MS Word, and edit field values. You may also add or edit fields, and create your own Word form (see below).

From the shell, you may use the module as a tool to extract all fields with tags:

> python pywordform.py sample_form.docx
field1 = "hello, world."
field2 = "hello,"
field3 = "value B"
field4 = "04-03-2012"

In a python script, the parse_form function returns a dictionary of field values indexed by tags:

import pywordform
fields = pywordform.parse_form('sample_form.docx')
print fields

Output:

{'field2': 'hello,\nworld.', 'field3': 'value B', 'field1': 'hello, world.', 'field4': '04-03-2012'}

For more information, see the main program at the end of the module, and also docstrings.

How to create your own MS Word forms:

  1. In MS Word (2007 or higher), go to the developer tab (you might need to enable "show developer tab in the ribbon", in Word options).
  2. enable design mode.
  3. click on one of the icons such as "Aa" to insert a field.
  4. when a field is selected, click on the properties button, and make sure you set a unique identifier as tag. It will be used when pywordform parses the form.
  5. when done, disable design mode in order to enter values for the fields.
  6. you may also protect the document (in developer tab) so that users can only enter values into fields and not modify the rest of the document.

Known limitations:

  • Only the recent "docx" format is supported, not the legacy MS Word "doc" format.
  • Only new-style Word form fields are supported. Legacy form fields have a different structure, this module does not parse them yet.
  • For now, multi-line text fields are not parsed correctly.
  • It is not yet possible to edit field values and save the document.

How to contribute:

The code is available in a Mercurial repository on bitbucket. You may use it to submit enhancements or to report any issue.

To report a bug, please use the issue reporting page, or send me an e-mail.