Origapy - a Python module to sanitize PDF files

Origapy is a Python interface to Origami, a PDF parser written in Ruby. It provides access to pdfclean.rb, in order to sanitize PDF files by disabling all active content (javascript, launch actions, embedded files, etc). Because Origami is a full PDF parser, it is much more effective than PDFiD (when sanitizing/disarming PDF files), but also quite slower.

Origapy uses a simple Python/Ruby bridge based on pipes, as described on this page.

WARNING: this is still work in progress. The current version of the Origami parser may trigger errors on some PDF files.

Changelog

  • 2010-09-12 v0.09: updated Origami engine to v1.0.0-beta3
  • 2009-10-02 v0.08: updated Origami engine to v1.0.0-beta1
  • 2009-09-30 v0.07: detects when a file is clean or cleaned, raise an exception when an error occurs

License

Origapy and Origami are open-source, published under GPL v3.

Download

Pick the attached file below.

Requirements

  • Python 2.x
  • Ruby 1.8.x

Install

Unzip and run install.bat on Windows, or "python setup.py install" on other platforms.

Usage

import origapy
pc = origapy.PDF_Cleaner()
pc.clean('file.pdf', 'cleaned.pdf')

 Alternatives

AttachmentSize
origapy-0.09.zip141.02 KB

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Question

I really don't understand. PDFid is not a tool for sanitizing PDF files. Primaly, it's a mean for us to get a quick view at a particular PDF file to see if the file is malicious. So what are the advantages of Origami over PDFid?

Origami vs. PDFiD

PDFiD is primarily an analysis tool, but it may also be used to disarm PDF files by disabling JavaScript, embedded files, launch actions, etc with the "-d" option on the command line, or the disarm=True parameter. This is the same as using the pdfclean.rb script provided with Origami.

The advantage of Origami over PDFiD is that it is a full PDF parser which can decode all complex PDF structures (such as indirect objects hidden inside object streams), whereas PDFiD is only meant to quickly find well-known keywords in the structure without analyzing the content of objects.

Ok!

Let's me try it out. Anyway, thank you so much for this exciting tool! :)