olefile - a Python module to read/write MS OLE2 files

olefile (formerly OleFileIO_PL) is a Python package to parse, read and write Microsoft OLE2 files (also called Structured Storage, Compound File Binary Format or Compound Document File Format), such as Microsoft Office 97-2003 documents, vbaProject.bin in MS Office 2007+ files, Image Composer and FlashPix files, Outlook MSG files, StickyNotes, several Microscopy file formats, McAfee antivirus quarantine files, etc.

Quick links: Download/Install - Documentation - Report Issues/Suggestions/Questions - Contact the author - Repository - Updates on Twitter

News

  • 2018-09-09 v0.46: OleFileIO can now be used as a context manager (with…as), to close the file automatically (see doc). Improved handling of malformed files, fixed several bugs.
  • 2018-01-24 v0.45: olefile can now overwrite streams of any size, improved handling of malformed files, fixed several bugs, end of support for Python 2.6 and 3.3.
  • 2017-01-06 v0.44: several bugfixes, removed support for Python 2.5 (olefile2), added support for incomplete streams and incorrect directory entries (to read malformed documents), added getclsid, improved documentation with API reference.
  • 2017-01-04: moved the documentation to ReadTheDocs
  • 2016-05-20: moved olefile repository to GitHub
  • 2016-02-02 v0.43: fixed issues #26 and #27, better handling of malformed files, use python logging.
  • 2015-01-25 v0.42: improved handling of special characters in stream/storage names on Python 2.x (using UTF-8 instead of Latin-1), fixed bug in listdir with empty storages.
  • 2014-11-25 v0.41: OleFileIO.open and isOleFile now support OLE files stored in byte strings, fixed installer for python 3, added support for Jython (Niko Ehrenfeuchter)
  • 2014-10-01 v0.40: renamed OleFileIO_PL to olefile, added initial write support for streams >4K, updated doc and license, improved the setup script.

Download and Install

If you have pip or setuptools installed (pip is included in Python 2.7.9+), you may simply run pip install olefile or easy_install olefile for the first installation.

To update olefile, run pip install -U olefile.

Otherwise, see http://olefile.readthedocs.io/en/latest/Install.html

Features

  • Parse/read/write any OLE file such as Microsoft Office 97-2003 legacy document formats (Word .doc, Excel .xls, PowerPoint .ppt, Visio .vsd, Project .mpp), Image Composer and FlashPix files, Outlook messages, StickyNotes, Zeiss AxioVision ZVI files, ...
  • List all the streams and storages contained in an OLE file
  • Open streams as files
  • Parse and read property streams, containing metadata of the file

olefile can be used as an independent module or with PIL/Pillow.

olefile is mostly meant for developers. If you are looking for tools to analyze OLE files or to extract data (especially for security purposes such as malware analysis and forensics), then please also check my python-oletools, which are built upon olefile and provide a higher-level interface.

 

Documentation

 

Please see the online documentation for more information.

License

See http://olefile.readthedocs.io/en/latest/License.html

 

Other projects using olefile / OleFileIO_PL

 

  • python-oletools: a package of python tools to analyze OLE files and MS Office documents, mainly for malware analysis and debugging. It includes olebrowse, a graphical tool to browse and extract OLE streams, oleid to quickly identify characteristics of malicious documents, olevba to detect/extract/analyze VBA macros, and pyxswf to extract Flash objects (SWF) from OLE files.
  • oledump: a tool to analyze malicious MS Office documents and extract VBA macros
  • ExeFilter: to scan and clean active content in file formats (e.g. MS Office VBA macros)
  • py-office-tools:  to display records inside Excel and PowerPoint files
  • pyew: a malware analysis tool
  • pyOLEscanner: a malware analysis tool
  • PPTExtractor: to extract images from PowerPoint presentations
  • msg-extractor: to parse MS Outlook MSG files
  • pyhwp: hwp file format python parser
  • RC4-40-brute-office: a tool to crack MS Office files using RC4 40-bit encryption
  • punbup: a tool to extract files from McAfee antivirus quarantine files (.bup)
  • Viper: a framework to store, classify and investigate binary files of any sort for malware analysis (also includes code from oleid)
  • Pillow: the friendly fork of PIL, the Python Image Library
  • Ghiro: a digital image forensics tool
  • Nightmare: A distributed fuzzing testing suite, using olefile to fuzz OLE streams and write them back to OLE files.

 

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

question

My question is: Can I extract all images from MS OLE2 documents with OleFileIO_PL ?

Extracting images from MS OLE2 documents

Not directly: images are not always stored the same way, and it also depends on the format.

For example in Powerpoint presentations, you may find a stream named "Pictures" when running "OleFileIO_PL yourfile.ppt". You may extract the stream by using the openstream() method on the OleFileIO object, but you will usually get a binary stream containing several picture files. You may also extract it manually using tools such as SSView (http://www.mitec.cz/ssv.html).

Then the only way I've found so far is to use file carving tools which are able to determine the beginning and the end of each picture in a binary file. These tools are not always easy to use but if you're interested have a look at http://pypi.python.org/pypi/hachoir-subfile and http://www.forensicswiki.org/wiki/Tools:Data_Recovery#Carving.

If you really need to automate the process then you have to study Microsoft specifications (at http://www.microsoft.com/interop/docs/officebinaryformats.mspx) and find the right way to parse MS Office documents...

A lot of people (including me) would be very interested if you find a solution! ;-)

How can i Extract Documents Embedded inside

I am trying to extract MS documents (xls,doc,ppt) embedded inside as their original documents , How can i achieve this?

here is what i get:

n [6]: ole.dumpdirectory()
'Root Entry' (root) 2816 bytes
{00020820-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 114 bytes
'\x05DocumentSummaryInformation' (stream) 676 bytes
'\x05SummaryInformation' (stream) 200 bytes
'MBD0005263B' (storage)
{B801CA65-A1FC-11D0-85AD-444553540000}
'\x01CompObj' (stream) 93 bytes
'\x01Ole' (stream) 20 bytes
'CONTENTS' (stream) 66833 bytes
'MBD00053027' (storage)
{00020906-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 121 bytes
'\x01Ole' (stream) 20 bytes
'\x05DocumentSummaryInformation' (stream) 5640 bytes
'\x05SummaryInformation' (stream) 384 bytes
'1Table' (stream) 8095 bytes
'Data' (stream) 4563 bytes
'ObjectPool' (storage)
'_1347688647' (storage)
{00020820-0000-0000-C000-000000000046}
'\x01CompObj' (stream) 114 bytes
'\x01Ole' (stream) 20 bytes
'\x03ObjInfo' (stream) 6 bytes
'\x05DocumentSummaryInformation' (stream) 244 bytes
'\x05SummaryInformation' (stream) 200 bytes
'MBD000465A6' (storage)
{B801CA65-A1FC-11D0-85AD-444553540000}
'\x01CompObj' (stream) 93 bytes
'\x01Ole' (stream) 20 bytes
'CONTENTS' (stream) 66833 bytes
'Workbook' (stream) 36816 bytes
'WordDocument' (stream) 15924 bytes
'Workbook' (stream) 175989 bytes

Embedded documents

Unfortunately there is currently no way to extract embedded MS Office documents with OleFileIO alone, because they are not stored as a single stream but as a collection of streams in a storage object (see the ones starting with "MBD" in your example). So extracting them requires to create a new OLE document from scratch, and to rebuild their structure with several streams.

There might be alternative solutions: see the message about Excel below, or try the pywin32 modules if your code runs on Windows (see pythoncom.StgOpenStorageEx and then maybe the EnumElements, OpenStorage and CopyTo methods of the PyIStorage object).

I did managed to extract embedded using OleFileIO_PL alone

def extract_embedded_ole()
ole = OleFileIO_PL.OleFileIO( fname )
i = 0
for stream in ole.listdir():
for s in stream:
if type( stream ) == type( [] ) and len( stream ) > 1:
i += 1
if ole.get_type( stream ) == 2 and s in ['Workbook', 'WordDocument', 'Package', 'WordDocument','VisioDocument' ,'PowerPoint Document', "Book", "CONTENTS"]:
ole_stream = ole.openstream( stream )
ole_props = ole.getproperties( ['\x05SummaryInformation'] )
out_dir = fname + ".embeddings/" + "/".join( stream[:-1] )
try:
os.makedirs( out_dir )
except OSError:
pass

#Write out Streams
out_name = out_dir + "/" + os.path.split( fname )[1] + "-emb-" + s + "-" + str( i ) + ".ole"
out_file = open( out_name, 'w+b' )
out_file.write( ole_stream.read() )
out_file.close()

array.array should use 'I' for 64-bit compatibility

On 64-bit systems, array.array('L', ...) expects the buffer to be 64-bit aligned, so OleFileIO_PL doesn't work there.

The fix is to change all calls like array.array('L', ...) to array.array('I', ...).

Small bug

Hoi,

Nice library. Did find a problem with it while using it on a 64-bit system. The construct

a = array.array("L", string)

is used often and doesn't work on 64-bit system where for some reason the above eats chunks of 8-bytes. Replacing all the occurrences with array.array("I", string) fixes the issue.

Works perfectly otherwise.

v0.19 fixed for 64 bits platforms

Thanks a lot Ben and Martijn for reporting that bug.

I have made the suggested change in v0.19. Please tell me if it works.

Philippe.

Tested ok

On the 64-bit systems I have access to it works fine, thanks.

Reading MSGraph workbook data

Hi,

First, thanks for writing this, it is much helpful.

I need to get the data values (sheet) from MSGraph.
I did:
f=OleFileIO_PL.OleFileIO('mygraphfile')
f.listdir()
output: [['\x01CompObj'], ['\x01Ole'], ['Workbook']]

and now:
f.openstream('Workbook').read()
gave me a binary stream, where I recognized the data in.
Is there a way to grep the data from the binary stream?

Thanks again,

Naor.

reading Excel data

Naor, OleFileIO is only meant to parse the OLE2 structure, not the binary streams inside which are different for each application (MS Word, Excel, Powerpoint, etc). Here are a few potential solutions:

Extracting just the text from Doc files?

I'm interested in just extracting all the text for .doc files, for the purpose of building a search index. Any ideas on how to do this?

When I read a docfile and I go to print ole.openstream("WordDocument"), I get the text, as well as tons of other binary gibberish. Is there another format inside this stream I'd have to parse to just extract the text?

zvi file format

I am trying to use this plugin for reading in a ZVI file format for Zeiss Microscopy products, which is based upon OLE2.

In the process I discovered what I think is a bug based upon the assumption that the sectorsize is 512 bytes.

line 1274 was
self.directory_fp = self._open(sect)
now i have it
self.directory_fp = self._open(sect,sectorsize=self.SectorSize)

line 1330 was
def _open(self, start, size = 0x7FFFFFFF, force_FAT=False)
now i have it
def _open(self, start, size = 0x7FFFFFFF, force_FAT=False,sectorsize=512):

lines 1359-1360 were
return _OleStream(self.fp, start, size, sectorsize,
512, self.fat)
now i have
return _OleStream(self.fp, start, size, sectorsize,
self.sectorsize, self.fat)

This made the basic test program given above go from failing to working on a test zvi file format which has a 4096 byte sectorsize.

I'm still playing around with using it further, but I hope that the success of reading the directory structure means the rest will work as designed.

Forrest

sectorsize >512

Thanks a lot for reporting the bug and providing a solution, Forrest. I will publish an updated version soon, with other improvements. In the meantime, could you please send me sample ZVI files by e-mail, so that I check if everything works fine?

\listdir() gives empty list on Outlook MSG Files

Hello decleage

I want to detect if ole file is Outlook MSG or not (in case of MS Outlooks with changed extensions) .

i do this :
ole = OleFileIO_PL.OleFileIO("./ol-msg.msg ")
ole.listdir()
>>[]

it gives empty List

what i need to do to list out contents?

i tested with 7zip .

7z -l ol-msg.msg

and it prints out contents fine :

Listing archive: ID0020.msg

--
Path = ID0020.msg
Type = Compound
Cluster Size = 4096
Sector Size = 64

Date Time Attr Size Compressed Name
------------------- ----- ------------ ------------ ------------------------
2011-07-16 02:28:00 D.... __recip_version1.0_#00000003
2011-07-16 02:28:00 D.... __recip_version1.0_#00000002
2011-07-16 02:28:00 D.... __recip_version1.0_#00000001
...

ALready dead?

Is this awesome project already dead or abandoned?

If you abandoned can you host your code at github or bitbucket so people can clone easy and contiue supporting it.

project code now on bitbucket

This project is not dead, but that's true I haven't touched the code for a while. I just created a repository on bitbucket for it, so that it is easier to contribute: https://bitbucket.org/decalage/olefileio_pl

See the issues page for known bugs and enhancements that have not yet been fixed in the code. Please use it to report any other bug you might have found.

Inserting content

I'm looking to write some python code that picks up TAGs that I'll embed within a word document. These Tags will serve as placeholders to insert content. In my case, I'm trying to develop a "survey/Questionnaire" Python script, that takes as input a word document that serves as a "template" (has all the formatting I want). I want my python code to read the word document, find the tags, and then execute appropriate handlers. For instance, one handler will be simply to look up the associated content from a database, and then inserting that content into the file.

Can anyone show a snippet of code that would read for a string, and replace that string with another? Should I simply be using Win32Com instead?

(yes, I'm a newb - trying to quickly come up to speed)

I'd like to also manage more complex formatting from Python - such as creating tables - setting margins, etc. But that's down the road... but, can anyone comment if such a thing is possible?

Re: Inserting content

Unfortunately you can't do that with OleFileIO, because it is currently still a parser (no editing), and it only parses the OLE2 structure, not the specific Word content.

However, maybe you can achieve this using python-docx? : https://github.com/mikemaccana/python-docx

You may also try win32com, by using OLE to control the MS Word application.

On a similar topic, I just published a new module to parse MS Word forms with tags, called pywordform.

passing file to

Hello,

I passed a file() to OleFileIO() and got an error in this line 979:
filesize = os.path.getsize(filename)
but in line 847 you check for a file object
if hasattr(filename, 'read'):

just for your information.
thanks for this module!

bug with file object

Thanks a lot for reporting this bug, indeed OleFileIO should support file-like objects.

I opened a ticket for this, will fix the code soon: https://bitbucket.org/decalage/olefileio_pl/issue/8/bug-with-file-object

Text from a Word doc

This your library is really easy to get up and running to look at the structure of a Word doc, but what I am mainly interested in is the actual text. I'd like to be able to get the 'WordDocument' portion of the stream and discard anything that isn't actual content so I'm left with just a plain text version of the document.

Following the examples it is easy to get the document to parse but outputting pieces of the stream still includes binary data. Is there a way to get only the text?

I'm trying to parse uploaded documents to pull out key words for search indexing. I have PDF and DOCX working using (PyPDF and python-docx) but old school DOC files are troublesome and unfortunately still extremely commonly used. This library is one of the few I have found that will handle old DOC formats.

Any advice would be appreciated!

Sectors

Hey,

to replace data inside an OLE file I'd like to get a list of where each sector starts and ends for a specific stream of the OLE.

I started by collecting "offset + sectorsize * sect" in _OleStream under "for i in range(nb_sectors):", but that doesn't work for all OLE files, as apparently _OleStream is sometimes used to read somehow pre-processed data.

So, let's say the OLE includes a stream "example.txt", and it's 600 bytes long; I'd like to get a list that might look, for example, like this:
[
[2048, 2175], # 128 bytes
[2176, 2303], # 128 bytes
[2304, 2431], # 128 bytes
[2560, 2687], # 128 bytes
[17408, 17495] # 88 bytes
]

This would enable anyone to quite easily write new data into an OLE file, as long as the size and structure of things remain the same. I'm aware I might be breaking some checksums somewhere or something, but that's not an important issue in my case.

Can I get some help?

Write a stream back to disk

Hi Rudolf, adding the possibility to write sectors and streams back to an OLE file is something that I plan to implement for a long time. I think it would be easier to provide methods to overwrite a single sector, and then to overwrite an existing stream by data of the same size. This is recorded in this ticket: https://bitbucket.org/decalage/olefileio_pl/issue/6/improve-olefileio_pl... - Would that cover your needs?