olefile - a Python module to read/write MS OLE2 files

olefile (formerly OleFileIO_PL) is a Python module to read/write Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format), such as Microsoft Office 97-2003 documents, Image Composer and FlashPix files, Outlook messages, StickyNotes, several Microscopy file formats, McAfee antivirus quarantine files, etc.

Quick links: Download - Documentation - Report Issues/Suggestions/Questions - Contact the author - Repository - Updates on Twitter

olefile is based on the OleFileIO module from PIL, the excellent Python Imaging Library, created and maintained by Fredrik Lundh. The olefile API is still compatible with PIL, but since 2005 I have improved the internal implementation significantly, with new features, bugfixes and a more robust design. From 2005 to 2014 the project was called OleFileIO_PL, and in 2014 I changed its name to olefile to celebrate its 9 years and its new write features.

As far as I know, this module is the most complete and robust Python implementation to read MS OLE2 files, portable on several operating systems. (please tell me if you know other similar Python modules)

Since 2014 olefile/OleFileIO_PL has been integrated into Pillow, the friendly fork of PIL. olefile will continue to be improved as a separate project, and new versions will be merged into Pillow regularly.

olefile can be used as an independent module or with PIL/Pillow.

olefile is mostly meant for developers. If you are looking for tools to analyze OLE files or to extract data (especially for security purposes such as malware analysis and forensics), then please also check my python-oletools, which are built upon olefile and provide a higher-level interface.

News

  • 2014-10-01 v0.40: renamed OleFileIO_PL to olefile, added initial write support for streams >4K, updated doc and license, improved the setup script.
  • 2014-07-31 v0.32alpha: started adding experimental write features
  • 2014-07-27 v0.31: fixed support for large files with 4K sectors (ZVI, OIB), thanks to Niko Ehrenfeuchter, Martijn Berger and Dave Jones. Added test scripts from Pillow (by hugovk). Fixed setup for Python 3 (Martin Panter)
  • 2014-03-17: OleFileIO_PL 0.30 integrated into Pillow.
  • 2014-02-04 v0.30: now compatible with Python 3.x, thanks to Martin Panter who did most of the hard work. I also updated the documentation significantly.
  • 2013-07-24 v0.26: added methods to parse stream/storage timestamps, improved listdir to include storages, fixed parsing of direntry timestamps
  • 2013-05-27 v0.25: improved metadata extraction, properties parsing and exception handling, fixed issue #12
  • 2013-05-07 v0.24: new features to extract metadata (get_metadata method and OleMetadata class), improved getproperties to convert timestamps to Python datetime
  • 2012-10-09: published python-oletools, a package of analysis tools based on OleFileIO_PL
  • 2012-09-11 v0.23: added support for file-like objects, fixed issue #8
  • 2012-02-17 v0.22: fixed issues #7 (bug in getproperties) and #2 (added close method)
  • 2011-10-20: code hosted on bitbucket to ease contributions and bug tracking
  • 2010-01-24 v0.21: fixed support for big-endian CPUs, such as PowerPC Macs.
  • 2009-12-11 v0.20: small bugfix in OleFileIO.open when filename is not plain str.
  • 2009-12-10 v0.19: fixed support for 64 bits platforms (thanks to Ben G. and Martijn for reporting the bug)
  • see changelog in source code for more info.

Features

  • Parse/read/write any OLE file such as Microsoft Office 97-2003 legacy document formats (Word .doc, Excel .xls, PowerPoint .ppt, Visio .vsd, Project .mpp), Image Composer and FlashPix files, Outlook messages, StickyNotes, Zeiss AxioVision ZVI files, ...
  • List all the streams and storages contained in an OLE file
  • Open streams as files
  • Parse and read property streams, containing metadata of the file

Main improvements over the original version of OleFileIO in PIL:

  • Compatible with Python 3.x and 2.6+
  • Many bug fixes
  • Support for files larger than 6.8MB
  • Support for 64 bits platforms and big-endian CPUs
  • Robust: many checks to detect malformed files
  • Runtime option to choose if malformed files should be parsed or raise exceptions
  • Improved API
  • Metadata extraction, stream/storage timestamps (e.g. for document forensics)
  • Can open file-like objects
  • Added setup.py and install.bat to ease installation
  • More convenient slash-based syntax for stream paths
  • Write features

 

Download and Install

 

See the Install page of the documentation.

The archive is available on the project page.

Documentation

Please see the online documentation for more information, especially the OLE overview and the API page which describe how to use olefile in Python applications. A copy of the same documentation is also provided in the doc subfolder of the olefile package.

License

See https://bitbucket.org/decalage/olefileio_pl/wiki/License

 

How to contribute:

 

The code is available in a Mercurial repository on bitbucket. You may use it to submit enhancements or to report any issue.

If you would like to help us improve this module, or simply provide feedback, you may also send an e-mail to decalage(at)laposte.net. You can help in many ways:

  • test this module on different platforms / Python versions
  • find and report bugs
  • improve documentation, code samples, docstrings
  • write unittest test cases
  • provide tricky malformed files

To report a bug, for example a normal file which is not parsed correctly, please use the issue reporting page, or send an e-mail with an attachment containing the debugging output of OleFileIO_PL.

For this, launch the following command :

OleFileIO_PL.py -d -c file >debug.txt 

Other projects using olefile / OleFileIO_PL

  • ExeFilter: to scan and clean active content in file formats 
  • py-office-tools:  to display records inside Excel and PowerPoint files
  • pyew: a malware analysis tool
  • pyOLEscanner: a malware analysis tool
  • PPTExtractor: to extract images from PowerPoint presentations
  • pyhwp: hwp file format python parser
  • python-oletools: a package of python tools to analyze OLE files based on OleFileIO_PL, mainly for malware analysis and debugging. It includes olebrowse, a graphical tool to browse and extract OLE streams, oleid to quickly identify characteristics of malicious documents, and pyxswf to extract Flash objects (SWF) from OLE files.
  • RC4-40-brute-office: a tool to crack MS Office files using RC4 40-bit encryption
  • punbup: a tool to extract files from McAfee antivirus quarantine files (.bup)
  • Viper: a framework to store, classify and investigate binary files of any sort for malware analysis (also includes code from oleid)
  • Pillow: the friendly fork of PIL, the Python Image Library
  • Ghiro: a digital image forensics tool
  • Nightmare: A distributed fuzzing testing suite, using olefile to fuzz OLE streams and write them back to OLE files.

 

Comments

I did managed to extract embedded using OleFileIO_PL alone

def extract_embedded_ole()
ole = OleFileIO_PL.OleFileIO( fname )
i = 0
for stream in ole.listdir():
for s in stream:
if type( stream ) == type( [] ) and len( stream ) > 1:
i += 1
if ole.get_type( stream ) == 2 and s in ['Workbook', 'WordDocument', 'Package', 'WordDocument','VisioDocument' ,'PowerPoint Document', "Book", "CONTENTS"]:
ole_stream = ole.openstream( stream )
ole_props = ole.getproperties( ['\x05SummaryInformation'] )
out_dir = fname + ".embeddings/" + "/".join( stream[:-1] )
try:
os.makedirs( out_dir )
except OSError:
pass

#Write out Streams
out_name = out_dir + "/" + os.path.split( fname )[1] + "-emb-" + s + "-" + str( i ) + ".ole"
out_file = open( out_name, 'w+b' )
out_file.write( ole_stream.read() )
out_file.close()