Origapy - a Python module to sanitize PDF files

Origapy is a Python interface to Origami, a PDF parser written in Ruby. It provides access to pdfclean.rb, in order to sanitize PDF files by disabling all active content (javascript, launch actions, embedded files, etc). Because Origami is a full PDF parser, it is much more effective than PDFiD (when sanitizing/disarming PDF files), but also quite slower.

Origapy uses a simple Python/Ruby bridge based on pipes, as described on this page.

WARNING: this is still work in progress. The current version of the Origami parser may trigger errors on some PDF files.

Changelog

2010-09-12 v0.09: updated Origami engine to v1.0.0-beta3
2009-10-02 v0.08: updated Origami engine to v1.0.0-beta1
2009-09-30 v0.07: detects when a file is clean or cleaned, raise an exception when an error occurs

License

Origapy and Origami are open-source, published under GPL v3.

Download

Pick the attached file below.

Requirements

Python 2.x
Ruby 1.8.x

Install

Unzip and run install.bat on Windows, or "python setup.py install" on other platforms.

Usage

import origapy
pc = origapy.PDF_Cleaner()
pc.clean('file.pdf', 'cleaned.pdf')

Alternatives

pdfid

Attachment	Size
origapy-0.09.zip	141.02 KB

Comments

Fri, 10/01/2010 - 09:34 — anhldbk (not verified)

Question

I really don't understand. PDFid is not a tool for sanitizing PDF files. Primaly, it's a mean for us to get a quick view at a particular PDF file to see if the file is malicious. So what are the advantages of Origami over PDFid?

Fri, 10/01/2010 - 19:21 — decalage

Origami vs. PDFiD

PDFiD is primarily an analysis tool, but it may also be used to disarm PDF files by disabling JavaScript, embedded files, launch actions, etc with the "-d" option on the command line, or the disarm=True parameter. This is the same as using the pdfclean.rb script provided with Origami.

The advantage of Origami over PDFiD is that it is a full PDF parser which can decode all complex PDF structures (such as indirect objects hidden inside object streams), whereas PDFiD is only meant to quickly find well-known keywords in the structure without analyzing the content of objects.

Sun, 01/30/2011 - 04:29 — anhldbk (not verified)

Ok!

Let's me try it out. Anyway, thank you so much for this exciting tool! :)

Languages

Navigation

Primary links

Popular content

Today's:

All time: