SkillAgentSearch skills...

Collisions

Hash collisions and exploitations

Install / Use

/learn @corkami/Collisions
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Hash collisions and exploitations

By Ange Albertini and Marc Stevens.

FAQ (TL;DR)

Q: Is it possible to make a file get an arbitrary MD2/MD4/MD5/MD6/SHA1/SHA2/SHA3, or the same hash as another file?<br/> A: No.

Q: Can one create 2 different files with the same hash?<br/> A: With MD5, in a few seconds on a standard computer. With SHA1, it's possible but not practical for end-users (Complexity: 2^61.2 Price: $11k).

Q: Can one make 2 different files get the same hash by appending stuff?<br/> A: With MD5, in a few hours on a standard computer. With SHA1, it's possible but not practical for end-users (Complexity: 2^63.4 Price: $45K)

Q: Will the 2 files remain valid?<br/> A: In general, yes, as most file formats tolerate appended data. OTOH files signatures will be likely broken.

Q: Can one make 2 different files with arbitrary contents and the same hash?<br/> A: Yes, it can be instant by relying on special file structures:<br/>

  1. a special format header (or pair) with tricks, acting as a switch between 2 contents (some formats won't allow such tricks).<br/>
  2. pre-computed collisions, based on the specific header(s).<br/>
  3. two contents of specific formats, both presents after the collision (added after the computation).

Q: Which formats can I get instant MD5-colliding files pair for?<br/> A: JPG, PNG, GIF, GZIP, Portable Executable, MP4, JPEG2000, PDF, DOCX/PPTX/XSLX, EPUB, 3MF, XPS. Just run the specific script.

Q: What about for SHA1?<br/> A: For SHA1, JPG in a PDF is computed and implemented.

Q: What about formats already supported for MD5 (JPG, PNG...), but for SHA1 instead?<br/> A: They're most likely supported with SHA1 too, but their collisions hasn't been computed.

Q: Are computations faster for similar (but different) contents?<br/> A: No. Any tiny difference requires a full computation.

Q: Which formats don't have such shortcut?<br/> A: ELF, Mach-O, Java Class, TAR, ZIP (among others...)

Q: Are classic collisions (in a few hours) still possible with these formats?<br/> A: Yes, as long as any amount of appended data is tolerated (ie likely not ZIP or Class).

Q: Do you provide examples of collisions?<br/> A: Yes.

Table of Contents

Introduction

The goal is to explore extensively existing attacks - and show on the way how weak MD5 is (instant collisions of any JPG, PNG, PDF, MP4, PE...) - and also explore in detail common file formats to determine how they can be exploited with present or with future attacks.

Indeed, the same file format trick can be used on several hashes (the same JPG tricks were used for MD5, malicious SHA-1 and SHA1), as long as the collisions follow the same byte patterns.

This document is not about new attacks (the most recent one was documented in 2012), but about new forms of exploitations of existing attacks.

Status

Current status of known attacks:

  • get a file to get another file's hash or a given hash: impossible

    • it's still even not practical with MD2 or MD4.
    • works for simpler hashes(*) <!-- Thanks Sven! -->
  • get two different files with the same MD5: instant

    • examples: 12
  • make two arbitrary files get the same MD5: a few hours (72 hours.core)

    • examples: 12
  • make two arbitrary files of specific file formats (PNG, JPG, PE...) get the same MD5: instant

    • read below
  • get two different files with the same SHA1: 6500 years.core

    • get two different PDFs with the same SHA-1 to show a different picture: instant (the prefixes are already computed)

(*) example with crypt - thanks Sven!

>>> import crypt
>>> crypt.crypt("5dUD&66", "br")
'brokenOz4KxMc'
>>> crypt.crypt("O!>',%$", "br")
'brokenOz4KxMc'

Attacks

MD5 and SHA1 work with blocks of 64 bytes.

If two contents A & B have the same hash, then appending the same contents C to both will keep the same hash.

hash(A) = hash(B) -> hash(A + C) = hash(B + C)

Collisions work by inserting at a block boundary a number of computed collision blocks that depends on what came before in the file. These collision blocks are very random-looking with some minor differences (that follow a specific pattern for each attack) and they will introduce tiny differences while eventually getting hashes the same value after these blocks.

These differences are abused to craft valid files with specific properties.

File formats also work top-down, and most of them work by byte-level chunks.

Some 'comment' chunks can be inserted to align file chunks to block boundaries, to align specific structures to collision blocks differences, to hide the rest of the collision blocks randomness from the file parsers, and to hide otherwise valid content from the parser (so that it will see another content).

These 'comment' chunks are often not officially real comments: they are just used as data containers that are ignored by the parser (for example, PNG chunks with a lowercase-starting ID are ancillary, not critical).

Most of the time, a difference in the collision blocks is used to modify the length of a comment chunk, which is typically declared just before the data of this chunk: in the gap between the smaller and the longer version of this chunk, another comment chunk is declared to jump over one file's content A. After this file content A, just append another file content B.

Since file formats usually define a terminator that will make parsers stop after it, A will terminate parsing, which will make the appended content B ignored.

So typically at least two comments are needed - often three:

  1. alignment
  2. hide collision blocks
  3. hide one file content (for re-usable collisions)

These common properties of file formats make it possible - they are not typically seen as weaknesses, but they can be detected or normalized out:

  • dummy chunks - used as comments
  • more than one comment
  • huge comments (lengths: 64b for MP4, 32b for PNG -> trivial collisions. 16b for JPG, 8b for GIF -> no generic collision for GIF, limited for JPG)
  • store any data in a comment (ASCII or UTF8 could be enforced)
  • store anything after the terminator (usually used only for malicious purposes) - can be avoided by using two comments finishing at the same offsets.
  • no integrity check. CRC32 in PNG are usually ignored. However they can be all correct since the collision blocks declare chunks of different lengths - so even if the chunk's data starts differently, the chunk lengths are different
  • flat structure: ASN.1 defines parent structure with the length of all the enclosed substructures, which prevents these constructs: you'd need to abuse a length, but also the length of the parent.
  • put a comment before the header - this makes generic re-usable collisions possible.

Identical prefix

  1. Define an arbitrary prefix - its content and length don't matter.
  2. The prefix is padded to the next 64-byte block.
  3. Collision block(s) are computed depending on the prefix and appended. Both sides are very random. The differences are predetermined by the attack.
  4. After this[these] block[s], the hash value is the same despite the file differences.
  5. Any arbitrary identical suffix c

Related Skills

View on GitHub
GitHub Stars3.4k
CategoryDevelopment
Updated5d ago
Forks209

Languages

Python

Security Score

85/100

Audited on Apr 3, 2026

No findings