Weaponized PDF - Payload Delivery Format

This article describes the PDF file format, related security issues and useful resources. [WORK IN PROGRESS]

The original location of this article is http://www.decalage.info/file_formats_security/pdf

Last update: 2017-11-10 (created 2010-02-13)

File format description

PDF (Portable Document Format) is a file format designed by Adobe. It is mainly used to publish final version of documents on the Internet, by e-mail or on CD-ROMs. Its main purpose is to display or print documents with a fixed layout. The PDF format may also be used to create electronic forms.

More info: http://en.wikipedia.org/wiki/Portable_Document_Format

Main client applications

The main application used to open PDF files for display is Adobe Reader. Many alternative applications are also able to display PDF files, such as Preview on MacOSX and Foxit Reader on Windows.

Adobe Acrobat is one of the applications which can create and edit PDF documents.

Main security issues

PDF is usually considered as a static and safe format for document exchange, which is a wrong perception.

The PDF format is in fact very complex, and contains several features which may lead to security issues:

  • Javascript: Adobe Reader (and possibly other readers) contains a Javascript engine similar to the ones used by web browsers, but with a slightly different API to manipulate PDF content dynamically or to control some viewer features. Potentially dangerous features are restricted for obvious security reasons. However, this means that PDF documents are not purely static, and for example some actions may be used to fool a user (popups) or to send e-mails and HTTP requests automatically. Furthermore, experience shows that many recent vulnerabilities have been exploited using Javascript in PDF.
  • Launch actions: a PDF file may launch any command on the operating system, after user confirmation (popup message). Different command lines may be specified for Windows, Unix and Mac. On Windows only, parameters can be provided for the command. Until Adobe Reader 9.3.2, the CVE-2010-1240 vulnerability made it possible to fool users by modifying the text of the popup message. Since Adobe Reader 9.3.3, a blacklist restricts file formats that can be opened, blocking executable files by default (but a way to bypass it has been found, and finally fixed in v9.3.4).
  • Embedded files: a PDF file may contain attached files, which can be extracted and opened from the reader. This trick may be used to hide malicious executables in order to bypass some antivirus and content analysis engines. Fortunately, Adobe Reader refuses to open embedded files if their extension is part of a blacklist, such as EXE, BAT, CMD, etc. However, this blacklist is not perfect and formats such as HTML or Python scripts may be embedded in PDF and launched from Adobe Reader.
  • GoToE actions: a PDF file may be embedded inside another PDF file, and a GoToE action may be used so that Adobe Reader opens the embedded PDF file automatically without notifying the user. This feature may be used to hide a malicious PDF file within a normal PDF file, to fool many antivirus engines.
  • Embedded Flash applications: a PDF file may contain Flash applications (stored as embedded SWF files), which bring their own security issues, such as ActionScript content and Adobe Flash Player vulnerabilities. Adobe Reader contains its own Flash Player, independent from the one installed in web browsers. For example the CVE-2010-1297 vulnerability was first patched in the Flash Player on the 10 June 2010, whereas the Flash Player shipped with Adobe Reader was only patched on the 29 June 2010.
  • Encryption: a PDF file may be encrypted with a password. However, if an empty password is used, Adobe Reader will open it directly without asking the user. This trick may be used to fool many antivirus and analysis engines that do not support decryption.
  • Parser "flexibility": PDF specifications, Adobe Reader and possibly other applications are very flexible about the structure of PDF files.
    • For example, most people think that PDF files have to start with the "%PDF" magic number, whereas the specifications only say this header has to be in the first 1024 bytes. See the Adobe PDF 1.7 Reference, Appendix H.3, page 1102: "Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file". It is therefore possible to insert around 1000 random bytes at the beginning of a PDF file. This trick may be used to bypass too strict antivirus or content analysis engines, because a fake file header (for example JPEG or HTML) can be inserted.
    • Another example is that the catalog at the end of the PDF structure may not point exactly to each object: Adobe Reader is able to reconstruct malformed files even if some content has been inserted within or between PDF objects.

Potential Solutions

  • Disable JavaScript and Launch features on each client: protects against most current malware, but limits functionality on the client.
  • Convert all incoming PDF files to PDF/A (a subset of PDF without JavaScript, encryption, audio/video, external links, etc): an interesting solution, but PDF/A requires that all fonts are embedded. Links to potential tools: pdfa.org, 3-Heights, gDoc, 7-pdf. However, most of the PDF/A tools have not been designed for security purposes.
  • Sanitize all incoming PDF files with a tool such as ExeFilter: covers most issues by disabling JavaScript, Launch actions, embedded files, etc in incoming PDF files.

Format specifications and technical information

Publications about PDF security issues

Examples of known vulnerabilities and exploits

Obfuscation techniques

Before analyzing malicious documents, it's good to know your enemy. Here are a few hand-picked blog posts and articles that explain known obfuscation and anti-analysis techniques:

Analysis techniques

Useful analysis tools

(listed in no particular order)

Command-line

  • pdfid: PDF analysis tool written in Python (basic parsing, useful to detect malware).
  • pdf-parser: PDF analysis tool written in Python (more complete parser).
  • Origami: PDF analysis framework written in Ruby (full parser/builder, includes many scripts and a GUI).
  • opaf: PDF analysis framework written in Python (full parser) - see also this blog post
  • pdf structazer (documentation)
  • pdftk: PDF manipulation tool, useful to analyze obfuscated PDFs
  • QPDF: another PDF manipulation tool to remove encryption, linearization or object streams
  • jsunpack-n: to extract JavaScript from various formats including PDF - an online version is also available.
  • pyew: a malware analysis tool with PDF analysis features
  • peepdf: malicious PDF analysis tool written in Python
  • caradoc: a parser and validator of PDF files written in OCaml
  • veraPDF: an open source PDF/A validator supported by the PDF industry and funded by the European Union’s PREFORMA project

GUI

  • Origami: PDF analysis framework written in Ruby (full parser/builder, includes many scripts and a GUI).
  • PDF Dissector: a commercial tool to analyze malicious PDF files
  • pdfubar: an open-source GUI written in Python using pdf-parser, Yara and jsunpack-n to analyze PDF files (pretty basic for now but promising)
  • PDF Stream Dumper: malicious PDF analysis tool written in VB with a GUI

Linux distributions

  • REMnux and Mercury: Linux distributions with many malware analysis tools ready to use (including Origami, pdfid, pdf-parser, jsunpack-n, etc)

Online

  • jsunpack-n: to extract JavaScript from various formats including PDF - an online version is also available.
  • wepawet: online malware analysis supporting PDF
  • joedoc: online PDF exploit detection based on sandboxing and tracing
  • PDF Examiner: online PDF analysis tool
  • Gallus: online PDF analysis tool

Parsing tools and libraries

  • pdf-parser: PDF parser for Python
  • Origami: PDF parser and builder for Ruby
  • jsunpack-n: includes a PDF parser in Python
  • opaf: PDF analysis framework written in Python (full parser)
  • PDFbox: PDF parser and builder for Java
  • pyPdf: Python module to read and write PDF files
  • PDFMiner: PDF parser and analyzer written in Python
  • caradoc: a parser and validator of PDF files written in OCaml

Filtering tools and libraries