About

Nuget Nuget

Recursive Extractor is a Cross-Platform .NET Standard 2.0 Library and Command Line Program for parsing archive files and disk images, including nested archives and disk images.

Supported File Types

| | | | |-|-|-| | 7zip+ | ar | bzip2 | | deb | dmg** | gzip | | iso | rar^ | tar | | vhd | vhdx | vmdk | | wim* | xzip | zip+ |

<details> <summary>Details</summary> * Windows only + Encryption Supported ^ Encryption supported for Rar version 4 only ** Limited support. Unencrypted HFS+ volumes with certain compression schemes. </details>

Variants

Command Line

Installing

Ensure you have the latest .NET SDK.
Run dotnet tool install -g Microsoft.CST.RecursiveExtractor.Cli

This adds RecursiveExtractor to your path so you can run it directly from your shell.

Running

Basic usage is: RecursiveExtractor --input archive.ext --output outputDirectory

<details> <summary>Detailed Usage</summary> <ul> <li>input: The path to the Archive to extract.</li> <li>output: The path a directory to extract into.</li> <li>passwords: A comma separated list of passwords to use for archives.</li> <li>allow-globs: A comma separated list of glob patterns to require each extracted file match.</li> <li>deny-globs: A comma separated list of glob patterns to require each extracted file not match.</li> <li>raw-extensions: A comma separated list of file extensions to not recurse into.</li> <li>no-recursion: Don't recurse into sub-archives.</li> <li>single-thread: Don't attempt to parallelize extraction.</li> <li>printnames: Output the name of each file extracted.</li> </ul>

For example, to extract only ".cs" files:

RecursiveExtractor --input archive.ext --output outputDirectory --allow-globs **/*.cs

Run RecursiveExtractor --help for more details.

</details>

.NET Standard Library

Recursive Extractor is available on NuGet as Microsoft.CST.RecursiveExtractor. Recursive Extractor targets netstandard2.0+ and the latest .NET, currently .NET 6.0, .NET 7.0 and .NET 8.0.

Usage

The most basic usage is to enumerate through all the files in the archive provided and do something with their contents as a Stream.

using Microsoft.CST.RecursiveExtractor;

var path = "path/to/file";
var extractor = new Extractor();
foreach(var file in extractor.Extract(path))
{
    doSomething(file.Content); //Do Something with the file contents (a Stream)
}

<details> <summary>Extracting to Disk</summary> This code adapted from the Cli extracts the contents of given archive located at `options.Input` to a directory located at `options.Output`, including extracting failed archives as themselves.

using Microsoft.CST.RecursiveExtractor;

var extractor = new Extractor();
var extractorOptions = new ExtractorOptions()
{
    ExtractSelfOnFail = true,
};
extractor.ExtractToDirectory(options.Output, options.Input, extractorOptions);

</details> <details> <summary>Async Usage</summary> This example of using the async API prints out all the file names found from the archive located at the path.

var path = "/Path/To/Your/Archive"
var extractor = new Extractor();
try {
    IEnumerable<FileEntry> results = extractor.ExtractFileAsync(path);
    await foreach(var found in results)
    {
        Console.WriteLine(found.FullPath);
    }
}
catch(OverflowException)
{
    // This means Recursive Extractor has detected a Quine or Zip Bomb
}

</details> <details> <summary>The FileEntry Object</summary> The Extractor returns `FileEntry` objects. These objects contain a `Content` Stream of the file contents.

public Stream Content { get; }
public string FullPath { get; }
public string Name { get; }
public FileEntry? Parent { get; }
public string? ParentPath { get; }
public DateTime CreateTime { get; }
public DateTime ModifyTime { get; }
public DateTime AccessTime { get; }

</details> <details> <summary>Extracting Encrypted Archives</summary> You can provide passwords to use to decrypt archives, paired with a Regular Expression that will operate against the Name of the Archive to determine on which archives to try the passwords in each List.

var path = "/Path/To/Your/Archive"
var directory
var extractor = new Extractor();
try {
    IEnumerable<FileEntry> results = extractor.ExtractFile(path, new ExtractorOptions()
    {
        Passwords = new Dictionary<Regex, List<string>>()
        {
            { new Regex("\.zip"), new List<string>(){ "PasswordForZipFiles" } },
            { new Regex("\.7z"), new List<string>(){ "PasswordFor7zFiles" } },
            { new Regex(".*"), new List<string>(){ "PasswordForAllFiles" } }

        }
    });
    foreach(var found in results)
    {
        Console.WriteLine(found.FullPath);
    }
}
catch(OverflowException)
{
    // This means Recursive Extractor has detected a Quine or Zip Bomb
}

</details> <details> <summary>Custom Extractors for Additional File Types</summary> You can extend RecursiveExtractor with custom extractors to support additional archive or file formats not natively supported. This is useful for formats like MSI, MSP, or other proprietary archive formats.

To create a custom extractor, implement the ICustomAsyncExtractor interface and register it with the extractor:

using Microsoft.CST.RecursiveExtractor;
using Microsoft.CST.RecursiveExtractor.Extractors;
using System.IO;
using System.Collections.Generic;
using System.Linq;

// Example: Custom extractor for a hypothetical archive format with magic bytes "MYARC"
public class MyCustomExtractor : ICustomAsyncExtractor
{
    private readonly Extractor context;
    private static readonly byte[] MAGIC_BYTES = System.Text.Encoding.ASCII.GetBytes("MYARC");

    public MyCustomExtractor(Extractor ctx)
    {
        context = ctx;
    }

    // Check if this extractor can handle the file based on binary signatures
    public bool CanExtract(Stream stream)
    {
        if (stream == null || !stream.CanRead || !stream.CanSeek || stream.Length < MAGIC_BYTES.Length)
        {
            return false;
        }

        var initialPosition = stream.Position;
        try
        {
            stream.Position = 0;
            var buffer = new byte[MAGIC_BYTES.Length];
            var bytesRead = stream.Read(buffer, 0, MAGIC_BYTES.Length);
            
            return bytesRead == MAGIC_BYTES.Length && buffer.SequenceEqual(MAGIC_BYTES);
        }
        finally
        {
            // Always restore the original position
            stream.Position = initialPosition;
        }
    }

    // Implement extraction logic
    public IEnumerable<FileEntry> Extract(FileEntry fileEntry, ExtractorOptions options, ResourceGovernor governor, bool topLevel = true)
    {
        // Your extraction logic here
        // For example, parse the archive and yield FileEntry objects for each contained file
        yield break;
    }

    public async IAsyncEnumerable<FileEntry> ExtractAsync(FileEntry fileEntry, ExtractorOptions options, ResourceGovernor governor, bool topLevel = true)
    {
        // Your async extraction logic here
        yield break;
    }
}

// Register the custom extractor via constructor
var customExtractor = new MyCustomExtractor(null);
var extractor = new Extractor(new[] { customExtractor });

// Now the extractor will use your custom extractor for files matching your CanExtract criteria
var results = extractor.Extract("path/to/custom/archive.myarc");

Key points:

The CanExtract method should check the stream's binary signature (like MiniMagic does) and return true if this extractor can handle the format
Always preserve the stream's original position in CanExtract
Custom extractors are provided via the constructor as an IEnumerable<ICustomAsyncExtractor>
Custom extractors are only checked when the file type is UNKNOWN (not recognized by built-in extractors)
Multiple custom extractors can be registered; they are checked in the order provided
Custom extractors are invoked for both synchronous and asynchronous extraction paths

</details>

Exceptions

RecursiveExtractor protects against ZipSlip, Quines, and Zip Bombs. Calls to Extract will throw an OverflowException when a Quine or Zip bomb is detected and a TimeOutException if EnableTiming is set and the specified time period has elapsed before completion.

Otherwise, invalid files found while crawling will emit a logger message and be skipped. You can also enable ExtractSelfOnFail to return the original archive file on an extraction failure.

Notes on Enumeration

Multiple Enumeration

You should not iterate the Enumeration returned from the Extract and ExtractAsync interfaces multiple times, if you need to do so, convert the Enumeration to an in memory collection first.

Parallel Enumeration

If you want to enumerate the output with parallelization you should use a batching mechanism, for example:

var extractedEnumeration = Extract(fileEntry, opts);
using var enumerator = extractedEnumeration.GetEnumerator();
ConcurrentBag<FileEntry> entryBatch = new()

RecursiveExtractor

Install / Use

README