OleFileIO_PL - a Python module to read MS OLE2 files

OleFileIO_PL is a Python module to read/write Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format), such as Microsoft Office documents, Image Composer and FlashPix files, Outlook messages, ... This my improved version of the OleFileIO module from PIL, the excellent Python Imaging Library, created and maintained by Fredrik Lundh. The API is still compatible with PIL, but I have improved the internal implementation significantly, with many bugfixes and a more robust design.

Quick links: Download - Documentation - Contact - Report issues - Updates on Twitter

As far as I know, this module is now the most complete and robust Python implementation to read MS OLE2 files, portable on several operating systems. (please tell me if you know other similar Python modules)

OleFileIO_PL can be used as an independent module or with PIL. It has been integrated into Pillow, the friendly fork of PIL.

OleFileIO_PL is mostly meant for developers. If you are looking for tools to analyze OLE files or to extract data, then please also check python-oletools, which are built upon OleFileIO_PL. It includes olebrowse, a graphical tool to browse and extract OLE streams, oleid to quickly identify characteristics of malicious documents, and pyxswf to extract Flash objects (SWF) from OLE files.

Features

  • Parse/read/write any OLE file such as Microsoft Office 97-2003 legacy document formats (Word .doc, Excel .xls, PowerPoint .ppt, Visio .vsd, Project .mpp), Image Composer and FlashPix files, Outlook messages, StickyNotes, Zeiss AxioVision ZVI files, ...
  • List all the streams and storages contained in an OLE file
  • Open streams as files
  • Parse and read property streams, containing metadata of the file

Main improvements over the original version of OleFileIO in PIL:

  • Compatible with Python 3.x and 2.6+
  • Many bug fixes
  • Support for files larger than 6.8MB
  • Support for 64 bits platforms and big-endian CPUs
  • Robust: many checks to detect malformed files
  • Runtime option to choose if malformed files should be parsed or raise exceptions
  • Improved API
  • Metadata extraction, stream/storage timestamps (e.g. for document forensics)
  • Can open file-like objects
  • Added setup.py and install.bat to ease installation
  • More convenient slash-based syntax for stream paths
  • Write features (experimental for now)

News

  • 2014-07-31 v0.32alpha: started adding experimental write features
  • 2014-07-27 v0.31: fixed support for large files with 4K sectors (ZVI, OIB), thanks to Niko Ehrenfeuchter, Martijn Berger and Dave Jones. Added test scripts from Pillow (by hugovk). Fixed setup for Python 3 (Martin Panter)
  • 2014-03-17: OleFileIO_PL 0.30 integrated into Pillow.
  • 2014-02-04 v0.30: now compatible with Python 3.x, thanks to Martin Panter who did most of the hard work. I also updated the documentation significantly.
  • 2013-07-24 v0.26: added methods to parse stream/storage timestamps, improved listdir to include storages, fixed parsing of direntry timestamps
  • 2013-05-27 v0.25: improved metadata extraction, properties parsing and exception handling, fixed issue #12
  • 2013-05-07 v0.24: new features to extract metadata (get_metadata method and OleMetadata class), improved getproperties to convert timestamps to Python datetime
  • 2012-10-09: published python-oletools, a package of analysis tools based on OleFileIO_PL
  • 2012-09-11 v0.23: added support for file-like objects, fixed issue #8
  • 2012-02-17 v0.22: fixed issues #7 (bug in getproperties) and #2 (added close method)
  • 2011-10-20: code hosted on bitbucket to ease contributions and bug tracking
  • 2010-01-24 v0.21: fixed support for big-endian CPUs, such as PowerPC Macs.
  • 2009-12-11 v0.20: small bugfix in OleFileIO.open when filename is not plain str.
  • 2009-12-10 v0.19: fixed support for 64 bits platforms (thanks to Ben G. and Martijn for reporting the bug)
  • see changelog in source code for more info.

Download:

The archive is available on the project page.

License

OleFileIO_PL changes are Copyright (c) 2005-2014 by Philippe Lagadec.

The Python Imaging Library (PIL) is

- Copyright (c) 1997-2005 by Secret Labs AB

- Copyright (c) 1995-2005 by Fredrik Lundh

By obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions:

Permission to use, copy, modify, and distribute this software and its associated documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appears in all copies, and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Secret Labs AB or the author not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

How to use this module:

See the overview page on Bitbucket for a more complete and up-to-date documentation.

See sample code at the end of the module, and also docstrings.

Here are a few examples:

import OleFileIO_PL

# Test if a file is an OLE container:
assert OleFileIO_PL.isOleFile('myfile.doc')

# Open an OLE file:
ole = OleFileIO_PL.OleFileIO('myfile.doc')

# Get list of streams:
print ole.listdir()

# Test if known streams/storages exist:
if ole.exists('worddocument'):
    print "This is a Word document."
    print "size :", ole.get_size('worddocument')
    if ole.exists('macros/vba'):
         print "This document seems to contain VBA macros."
# Extract the "Pictures" stream from a PPT file:
if ole.exists('Pictures'):
    pics = ole.openstream('Pictures')
    data = pics.read()
    f = open('Pictures.bin', 'w')
    f.write(data)
    f.close()

# Extract metadata (new in v0.24) - see source code for all attributes: meta = ole.get_metadata()
print 'Author:', meta.author
print 'Title:', meta.title
print 'Creation date:', meta.create_time
# print all metadata:
meta.dump()

# Close the OLE file: ole.close()

# Work with a file-like object (e.g. StringIO) instead of a file on disk: data = open('myfile.doc', 'rb').read()
f = StringIO.StringIO(data)
ole = OleFileIO_PL.OleFileIO(f)
print ole.listdir()
ole.close()

It can also be used as a script from the command-line to display the structure of an OLE file, for example:

OleFileIO_PL.py myfile.doc

A real-life example: using OleFileIO_PL for malware analysis and forensics.

See also this paper about python tools for forensics, which features OleFileIO_PL.

I have published python-oletools, a package of python tools to analyze OLE files based on OleFileIO_PL, mainly for malware analysis and debugging. It includes olebrowse, a graphical tool to browse and extract OLE streams, oleid to quickly identify characteristics of malicious documents, and pyxswf to extract Flash objects (SWF) from OLE files.

How to contribute:

The code is available in a Mercurial repository on bitbucket. You may use it to submit enhancements or to report any issue.

If you would like to help us improve this module, or simply provide feedback, you may also send an e-mail to decalage(at)laposte.net. You can help in many ways:

  • test this module on different platforms / Python versions
  • find and report bugs
  • improve documentation, code samples, docstrings
  • write unittest test cases
  • provide tricky malformed files

To report a bug, for example a normal file which is not parsed correctly, please use the issue reporting page, or send an e-mail with an attachment containing the debugging output of OleFileIO_PL.

For this, launch the following command :

OleFileIO_PL.py -d -c file >debug.txt 

Other projects using OleFileIO_PL

  • ExeFilter: to scan and clean active content in file formats 
  • py-office-tools:  to display records inside Excel and PowerPoint files
  • pyew: a malware analysis tool
  • pyOLEscanner: a malware analysis tool
  • PPTExtractor: to extract images from PowerPoint presentations
  • pyhwp: hwp file format python parser
  • python-oletools: a package of python tools to analyze OLE files based on OleFileIO_PL, mainly for malware analysis and debugging. It includes olebrowse, a graphical tool to browse and extract OLE streams, oleid to quickly identify characteristics of malicious documents, and pyxswf to extract Flash objects (SWF) from OLE files.
  • RC4-40-brute-office: a tool to crack MS Office files using RC4 40-bit encryption
  • punbup: a tool to extract files from McAfee antivirus quarantine files (.bup)
  • Viper: a framework to store, classify and investigate binary files of any sort for malware analysis (also includes code from oleid)
  • Pillow: the friendly fork of PIL, the Python Image Library
  • Ghiro: a digital image forensics tool

 

Comments

I did managed to extract embedded using OleFileIO_PL alone

def extract_embedded_ole()
ole = OleFileIO_PL.OleFileIO( fname )
i = 0
for stream in ole.listdir():
for s in stream:
if type( stream ) == type( [] ) and len( stream ) > 1:
i += 1
if ole.get_type( stream ) == 2 and s in ['Workbook', 'WordDocument', 'Package', 'WordDocument','VisioDocument' ,'PowerPoint Document', "Book", "CONTENTS"]:
ole_stream = ole.openstream( stream )
ole_props = ole.getproperties( ['\x05SummaryInformation'] )
out_dir = fname + ".embeddings/" + "/".join( stream[:-1] )
try:
os.makedirs( out_dir )
except OSError:
pass

#Write out Streams
out_name = out_dir + "/" + os.path.split( fname )[1] + "-emb-" + s + "-" + str( i ) + ".ole"
out_file = open( out_name, 'w+b' )
out_file.write( ole_stream.read() )
out_file.close()