OleFileIO_PL - a Python module to read MS OLE2 files

OleFileIO_PL is a Python module to read Microsoft OLE2 files (also called Structured Storage or Compound Document File Format), such as Microsoft Office documents, Image Composer and FlashPix files, Outlook messages, ... This is an improved version of the OleFileIO module from PIL, the excellent Python Imaging Library, created and maintained by Fredrik Lundh. The API is still compatible with PIL, but the internal implementation has been improved a lot, with bugfixes and a more robust design.

As far as I know, this module is now the most complete and robust Python implementation to read MS OLE2 files, portable on several OSes. (please tell me if you know other similar Python modules)

WARNING: THIS IS (STILL) WORK IN PROGRESS.

Main improvements over PIL version:

  • Better compatibility with Python 2.4 up to 2.7
  • Support for files larger than 6.8MB
  • Robust: many checks to detect malformed files
  • Improved API
  • Added setup.py and install.bat to ease installation

News

  • 2011-10-20: code hosted on bitbucket to ease contributions and bug tracking
  • 2010-01-24 v0.21: fixed support for big-endian CPUs, such as PowerPC Macs.
  • 2009-12-11 v0.20: small bugfix in OleFileIO.open when filename is not plain str.
  • 2009-12-10 v0.19: fixed support for 64 bits platforms (thanks to Ben G. and Martijn for reporting the bug)
  • see changelog in source code for more info.

Download:

The archive is available on the project page.

License

OleFileIO_PL changes are Copyright (c) 2005-2011 by Philippe Lagadec.

The Python Imaging Library (PIL) is

- Copyright (c) 1997-2005 by Secret Labs AB

- Copyright (c) 1995-2005 by Fredrik Lundh

By obtaining, using, and/or copying this software and/or its associated documentation, you agree that you have read, understood, and will comply with the following terms and conditions:

Permission to use, copy, modify, and distribute this software and its associated documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appears in all copies, and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Secret Labs AB or the author not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

SECRET LABS AB AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL SECRET LABS AB OR THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

How to use this module:

See sample code at the end of the module, and also docstrings.

Here are a few examples:

import OleFileIO_PL

# Test if a file is an OLE container:
assert OleFileIO_PL.isOleFile('myfile.doc')

# Open OLE file:
ole = OleFileIO_PL.OleFileIO('myfile.doc')

# Get list of streams:
print ole.listdir()

# Test if known streams/storages exist:
if ole.exists('worddocument'):
    print "This is a Word document."
    print "size :", ole.get_size('worddocument')
    if ole.exists('macros/vba'):
         print "This document seems to contain VBA macros."
# Extract the "Pictures" stream from a PPT file:
if ole.exists('Pictures'):
    pics = ole.openstream('Pictures')
    data = pics.read()
    f = open('Pictures.bin', 'w')
    f.write(data)
    f.close()

It can also be used as a script from the command-line to display the structure of an OLE file, for example:

OleFileIO_PL.py myfile.doc

How to contribute:

The code is available in a Mercurial repository on bitbucket. You may use it to submit enhancements or to report any issue.

If you would like to help us improve this module, or simply provide feedback, you may also send an e-mail to decalage(at)laposte.net. You can help in many ways:

  • test this module on different platforms / Python versions
  • find and report bugs
  • improve documentation, code samples, docstrings
  • write unittest test cases
  • provide tricky malformed files

To report a bug, for example a normal file which is not parsed correctly, please use the issue reporting page, or send an e-mail with an attachment containing the debugging output of OleFileIO_PL.

For this, launch the following command :

OleFileIO_PL.py -d -c file >debug.txt 

 

AttachmentSize
OleFileIO_PL-0.21.zip23.9 KB

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

question

My question is: Can I extract all images from MS OLE2 documents with OleFileIO_PL ?

Extracting images from MS OLE2 documents

Not directly: images are not always stored the same way, and it also depends on the format.

For example in Powerpoint presentations, you may find a stream named "Pictures" when running "OleFileIO_PL yourfile.ppt". You may extract the stream by using the openstream() method on the OleFileIO object, but you will usually get a binary stream containing several picture files. You may also extract it manually using tools such as SSView (http://www.mitec.cz/ssv.html).

Then the only way I've found so far is to use file carving tools which are able to determine the beginning and the end of each picture in a binary file. These tools are not always easy to use but if you're interested have a look at http://pypi.python.org/pypi/hachoir-subfile and http://www.forensicswiki.org/wiki/Tools:Data_Recovery#Carving.

If you really need to automate the process then you have to study Microsoft specifications (at http://www.microsoft.com/interop/docs/officebinaryformats.mspx) and find the right way to parse MS Office documents...

A lot of people (including me) would be very interested if you find a solution! ;-)

How can i Extract Documents Embedded inside

I am trying to extract MS documents (xls,doc,ppt) embedded inside as their original documents , How can i achieve this?

here is what i get:

n [6]: ole.dumpdirectory()
'Root Entry' (root) 2816 bytes
{00020820-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 114 bytes
'\x05DocumentSummaryInformation' (stream) 676 bytes
'\x05SummaryInformation' (stream) 200 bytes
'MBD0005263B' (storage)
{B801CA65-A1FC-11D0-85AD-444553540000}
'\x01CompObj' (stream) 93 bytes
'\x01Ole' (stream) 20 bytes
'CONTENTS' (stream) 66833 bytes
'MBD00053027' (storage)
{00020906-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 121 bytes
'\x01Ole' (stream) 20 bytes
'\x05DocumentSummaryInformation' (stream) 5640 bytes
'\x05SummaryInformation' (stream) 384 bytes
'1Table' (stream) 8095 bytes
'Data' (stream) 4563 bytes
'ObjectPool' (storage)
'_1347688647' (storage)
{00020820-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 114 bytes
'\x01Ole' (stream) 20 bytes
'\x03ObjInfo' (stream) 6 bytes
'\x05DocumentSummaryInformation' (stream) 244 bytes
'\x05SummaryInformation' (stream) 200 bytes
'MBD000465A6' (storage)
{B801CA65-A1FC-11D0-85AD-444553540000}
'\x01CompObj' (stream) 93 bytes
'\x01Ole' (stream) 20 bytes
'CONTENTS' (stream) 66833 bytes
'Workbook' (stream) 36816 bytes
'WordDocument' (stream) 15924 bytes
'Workbook' (stream) 175989 bytes

Embedded documents

Unfortunately there is currently no way to extract embedded MS Office documents with OleFileIO alone, because they are not stored as a single stream but as a collection of streams in a storage object (see the ones starting with "MBD" in your example). So extracting them requires to create a new OLE document from scratch, and to rebuild their structure with several streams.

There might be alternative solutions: see the message about Excel below, or try the pywin32 modules if your code runs on Windows (see pythoncom.StgOpenStorageEx and then maybe the EnumElements, OpenStorage and CopyTo methods of the PyIStorage object).

I did managed to extract embedded using OleFileIO_PL alone

def extract_embedded_ole()
ole = OleFileIO_PL.OleFileIO( fname )
i = 0
for stream in ole.listdir():
for s in stream:
if type( stream ) == type( [] ) and len( stream ) > 1:
i += 1
if ole.get_type( stream ) == 2 and s in ['Workbook', 'WordDocument', 'Package', 'WordDocument','VisioDocument' ,'PowerPoint Document', "Book", "CONTENTS"]:
ole_stream = ole.openstream( stream )
ole_props = ole.getproperties( ['\x05SummaryInformation'] )
out_dir = fname + ".embeddings/" + "/".join( stream[:-1] )
try:
os.makedirs( out_dir )
except OSError:
pass

#Write out Streams
out_name = out_dir + "/" + os.path.split( fname )[1] + "-emb-" + s + "-" + str( i ) + ".ole"
out_file = open( out_name, 'w+b' )
out_file.write( ole_stream.read() )
out_file.close()

array.array should use 'I' for 64-bit compatibility

On 64-bit systems, array.array('L', ...) expects the buffer to be 64-bit aligned, so OleFileIO_PL doesn't work there.

The fix is to change all calls like array.array('L', ...) to array.array('I', ...).

Small bug

Hoi,

Nice library. Did find a problem with it while using it on a 64-bit system. The construct

a = array.array("L", string)

is used often and doesn't work on 64-bit system where for some reason the above eats chunks of 8-bytes. Replacing all the occurrences with array.array("I", string) fixes the issue.

Works perfectly otherwise.

v0.19 fixed for 64 bits platforms

Thanks a lot Ben and Martijn for reporting that bug.

I have made the suggested change in v0.19. Please tell me if it works.

Philippe.

Tested ok

On the 64-bit systems I have access to it works fine, thanks.

Reading MSGraph workbook data

Hi,

First, thanks for writing this, it is much helpful.

I need to get the data values (sheet) from MSGraph.
I did:
f=OleFileIO_PL.OleFileIO('mygraphfile')
f.listdir()
output: [['\x01CompObj'], ['\x01Ole'], ['Workbook']]

and now:
f.openstream('Workbook').read()
gave me a binary stream, where I recognized the data in.
Is there a way to grep the data from the binary stream?

Thanks again,

Naor.

reading Excel data

Naor, OleFileIO is only meant to parse the OLE2 structure, not the binary streams inside which are different for each application (MS Word, Excel, Powerpoint, etc). Here are a few potential solutions:

Extracting just the text from Doc files?

I'm interested in just extracting all the text for .doc files, for the purpose of building a search index. Any ideas on how to do this?

When I read a docfile and I go to print ole.openstream("WordDocument"), I get the text, as well as tons of other binary gibberish. Is there another format inside this stream I'd have to parse to just extract the text?

zvi file format

I am trying to use this plugin for reading in a ZVI file format for Zeiss Microscopy products, which is based upon OLE2.

In the process I discovered what I think is a bug based upon the assumption that the sectorsize is 512 bytes.

line 1274 was
self.directory_fp = self._open(sect)
now i have it
self.directory_fp = self._open(sect,sectorsize=self.SectorSize)

line 1330 was
def _open(self, start, size = 0x7FFFFFFF, force_FAT=False)
now i have it
def _open(self, start, size = 0x7FFFFFFF, force_FAT=False,sectorsize=512):

lines 1359-1360 were
return _OleStream(self.fp, start, size, sectorsize,
512, self.fat)
now i have
return _OleStream(self.fp, start, size, sectorsize,
self.sectorsize, self.fat)

This made the basic test program given above go from failing to working on a test zvi file format which has a 4096 byte sectorsize.

I'm still playing around with using it further, but I hope that the success of reading the directory structure means the rest will work as designed.

Forrest

sectorsize >512

Thanks a lot for reporting the bug and providing a solution, Forrest. I will publish an updated version soon, with other improvements. In the meantime, could you please send me sample ZVI files by e-mail, so that I check if everything works fine?

\listdir() gives empty list on Outlook MSG Files

Hello decleage

I want to detect if ole file is Outlook MSG or not (in case of MS Outlooks with changed extensions) .

i do this :
ole = OleFileIO_PL.OleFileIO("./ol-msg.msg ")
ole.listdir()
>>[]

it gives empty List

what i need to do to list out contents?

i tested with 7zip .

7z -l ol-msg.msg

and it prints out contents fine :

Listing archive: ID0020.msg

--
Path = ID0020.msg
Type = Compound
Cluster Size = 4096
Sector Size = 64

Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2011-07-16 02:28:00 D.... __recip_version1.0_#00000003
2011-07-16 02:28:00 D.... __recip_version1.0_#00000002
2011-07-16 02:28:00 D.... __recip_version1.0_#00000001
...

ALready dead?

Is this awesome project already dead or abandoned?

If you abandoned can you host your code at github or bitbucket so people can clone easy and contiue supporting it.

project code now on bitbucket

This project is not dead, but that's true I haven't touched the code for a while. I just created a repository on bitbucket for it, so that it is easier to contribute: https://bitbucket.org/decalage/olefileio_pl

See the issues page for known bugs and enhancements that have not yet been fixed in the code. Please use it to report any other bug you might have found.