Tip: How to download thousands of MS Office files for testing

When developing tools related to MS Office files such as olefile and oletools, it is often necessary to test them on many different samples of various types and sizes. It is quite easy to find malicious samples using malwr.com, hybrid-analysis.com and VirusTotal, just to name a few (see my previous post about that topic). However, finding and downloading a large number of legitimate files is a different challenge. Here are some tips to do it:

Well-known corpus

There are a few known corpus that have been created and published for testing purposes, such as:

  • FUSE: around 250,000 Excel spreadsheets collected in 2013 and 2014, used for academic research. Limited to the xls format, and large files above 1MB are truncated.
  • Enron Email Dataset:  e-mail dataset containing a large number of attachments.
  • Govdocs1: a dataset from 2010 containing around 1M files of various formats, built for forensics purposes.

CommonCrawlDocumentDownload

CommonCrawlDocumentDownload is a tool that uses the index published by the CommonCrawl project to find document files of various formats (doc, docx, xls, xlsx, ppt, pptx, etc) and to download them.

After a few hours of processing, the tool had downloaded 40,000 files totalling 5GB.

How to install and use it, in a nutshell:

  1. Install git and the JDK (Java Development Kit), if you do not have them already.
  2. Open a shell/cmd, go to the folder where you would like to download files.
  3. Run the following command to download the tool: git clone git://github.com/centic9/CommonCrawlDocumentDownload
  4. Edit the following files if you want to change which file types will be downloaded:
    1. CommonCrawlDocumentDownload/src/main/java/org/dstadler/commoncrawl/Extensions.java
    2. CommonCrawlDocumentDownload/src/main/java/org/dstadler/commoncrawl/MimeTypes.java
    3. For example, here is how I modified those files to add RTF.
  5. Go to the directory CommonCrawlDocumentDownload
  6. Build the tool from the Java source code: ./gradlew check
  7. Collect the list of URLs matching the requested file types (this may take quite a few hours): ./gradlew lookupURLs
  8. You can monitor the results stored in the file commoncrawl-CC-MAIN-2016-50.txt
  9. Download the corresponding files (this may take quite a few hours): ./gradlew downloadDocuments
  10. You may also query the old index and download additional files this way: ./gradlew downloadOldIndex
  11. The URLs are stored in the file commonurls.txt