StreamRegex

A .NET Standard 2.1+ Library to perform string parsing operations on Streams and StreamReaders. Includes Extensions for Regex.

Generate Convert Improve

Install / Use

/learn @gfs/StreamRegex

About this skill

Quality Score

0/100

README

StreamRegex

A .NET Library with Extension Methods for performing arbitrary checks on the string content of Streams and StreamReaders, including built-in extension methods for Regex.

The Extensions are available on Nuget: https://www.nuget.org/packages/StreamRegex.Extensions/

Auto-Generated API Documentation is hosted on GitHub Pages.

Motivation

Memory allocation is an expensive operation - in many cases it may be consuming more time than any other operation in your program. .NET introduces an excellent 0 allocation regex implementation for strings and Spans (under the covers the string path uses spans as well).

However, it may be the case that you want to check many arbitrarily large files without reading every file out into a string - an allocation expensive operation. Using the extension methods here you can check your Stream or StreamReader directly with minimal allocations. For a 400MB file, on .NET 7 allocations can be reduced from 1.5GB to ~4MB - see Benchmarks

To use Regex

Here is some simple sample code to get started

StreamReader

// Include this for the extension methods
using StreamRegex.Extensions.RegexExtensions;

// Construct your regex as normal
Regex myRegex = new Regex(expression);

// Create your stream reader
StreamReader reader = new StreamReader(stream);

// Get matches
SlidingBufferMatchCollection<StreamRegexMatch> matchCollection = myRegex.GetMatchCollection(reader);
if (matchCollection.Any())
{
    foreach(StreamRegexMatch match in matchCollection)
    {
        // Do something with matches.
    }
}
else
{
    // No match
}

Alternately check if there is only one match. Note that the position of the Stream or StreamReader is not reset by these methods. Ensure the position of your stream is where you want to start parsing.

// Get only the first match
StreamRegexMatch match = myRegex.GetFirstMatch(reader);
if (match.Matches)
{
    // A match was found
}
else
{
    // No matches
}

Or you can just check if there is any match but not get details on the match

// Check if there is any match
if (myRegex.IsMatch(reader))
{
    // At least one match
}
else
{
    // No matches
}

Stream

You can also call the methods on a Stream directly. If you do so, a StreamReader will be created to read it with leaveOpen = true, reading from the current position of the Stream. The Stream will not have its position reset after reading, and will not be closed or disposed.

// Include this for the extension methods
using StreamRegex.Extensions.RegexExtensions;

// This stream contains the content you want to check
Stream stream;

// Construct your regex as normal
Regex myRegex = new Regex(expression);

// Get matches
SlidingBufferMatchCollection<StreamRegexMatch> matchCollection = myRegex.GetMatchCollection(stream);
if (matchCollection.Any())
{
    foreach(StreamRegexMatch match in matchCollection)
    {
        // Do something with matches.
    }
}
else
{
    // No match
}

Options

You can adjust the internal buffer and overlap sizes, and capture the value that was matched as well as the Index using SlidingBufferOptions.

The BufferSize is the size of the internal Span<char> used for checking.
The OverlapSize is the number of characters from the previous buffer to include at the start of the next to guarantee matches across boundaries.
If CaptureValues is set to true, SlidingBufferMatch objects (including StreamRegexMatch objects) will contain the value of the actual match in addition to the length. If false, match objects will only contain Index and Length of the match.

// Include this for the extension methods
using StreamRegex.Extensions.RegexExtensions;
// Include this for options objects
using StreamRegex;

// Construct your regex as normal
Regex myRegex = new Regex(expression);

var bufferOptions = new SlidingBufferOptions()
{
    BufferSize = 8192, // The number of characters to check at a time, default 4096
    OverlapSize = 512, // Must be as long as your longest desired match, default 256
    DelegateOptions = new DelegateOptions()
    {
        CaptureValues = true // If the actual value matched by the Regex should be included in the SlidingBufferMatch, default false. 
                             // When set to true Will allocate memory to store the captured values
    }
};

StreamRegexMatch match = myRegex.GetFirstMatch(reader, bufferOptions);

To use Custom Method

You can provide your own custom methods for both boolean matches and match metadata.

For Boolean Matches

Implement the IsMatch delegate.

// Include this for the extension methods
using StreamRegex.Extensions.Core;

// Create your stream reader
StreamReader reader = new StreamReader(stream);

bool YourMethod(ReadOnlySpan<char> chunk)
{
    // Your logic here
}

if(reader.IsMatch(YourMethod)
{
    // Your method matched some chunk of the Stream
}
else
{
    // Your method did not match any chunk of the Stream
}

For Value Data

Implement the GetFirstMatch delegate.

// Include this for the extension methods
using StreamRegex.Extensions.Core;

// Create your stream reader
StreamReader reader = new StreamReader(stream);

// Return the index of the target string relative to the chunk. 
// It will be adjusted to the correct relative position for the Stream automatically.
SlidingBufferMatch YourMethod(ReadOnlySpan<char> chunk)
{
    if (SomeCheckOf(chunk))
    {
        return new SlidingBufferMatch(true, idx, target.Length);
    }

    return new SlidingBufferMatch();
}

var match = reader.GetFirstMatch(YourMethod);
if(match.Success)
{
    // Your method matched some chunk of the Stream
}
else
{
    // Your method did not match any chunk of the Stream
}

For a collection

Implement the GetMatchCollection delegate.

// Include this for the extension methods
using StreamRegex.Extensions.Core;

// Create your stream reader
StreamReader reader = new StreamReader(stream);
// Your arbitrary engine that can generate multiple matches
YourEngineHolder matchingEngine = new YourEngineHolder();

public class YourEngineHolder
{
    private YourMatchingEngine _internalEngine;
    
    public YourEngineHolder()
    {
        _internalEngine = new YourMatchingEngine();
    }
    
    public SlidingBufferMatchCollection<SlidingBufferMatch> YourMethod(ReadOnlySpan<char> arg)
    {
        SlidingBufferMatchCollection<SlidingBufferMatch> matchCollection = new SlidingBufferMatchCollection<SlidingBufferMatch>();
        foreach(var match in _internalEngine.MakeMatches(arg))
        {
            matchCollection.Add(match);
        }
        return matchCollection;
    }
}

var collection = reader.GetMatchCollection(matchingEngine.YourMethod);

How it works

A sliding buffer is used across the stream. The OverlapSize parameter is the amount of overlap buffer to use to ensure no matches are missed across buffer boundaries. Always ensure that the Overlap is sufficient for the length of the matches you want to find.

https://github.com/gfs/StreamRegex/blob/fce9cdbbe5bdcf3629ece9547a4c5230b941d072/StreamRegex.Extensions/SlidingBufferExtensions.cs#L206-L245

Benchmarks

The benchmark results below are a selection of the results from the Benchmarks project in the repository.

Performance on Large Files

A Stream is generated of length paddingSegmentLength * numberPaddingSegmentsBefore + paddingSegmentLength * numberPaddingSegmentsAfter + the length of a target string. There is only one match for the target operation in the Stream.
The query used for both regex and string matching was racecar - no regex operators.
The JustReadTheStreamToString reads the full contents of the Stream into a string.
The Enumerate benchmark uses the EnumerateMatches method of a Regex on a Span<char> of the Bytes of the Stream stopping after the first match is found. The cost of converting the Stream into a String before operation is included.
The RegexExtension benchmark uses the IsMatch extension method of a Regex on a StreamReader stopping after the first match is found.

This benchmark iteration finds the only instance of racecar located 200MB into a 400MB Stream. Using the extension method is 12 times faster and allocates .2% of the memory. Memory usage is dependent on the options, and may vary with different buffer/overlap parameters or when CaptureValues is set to true.

We find that the majority of the operation time is spent on reading full Stream to a string before operation, by comparison with the JustReadTheStreamToString benchmark.

| Method | Mean | Error | StdDev | Median | Ratio | Allocated | Alloc Ratio | |--------------------------:|---------------:|---------------:|---------------:|---------------:|-------:|--------------:|------------:| | JustReadTheStreamToString | 464,216.947 us | 9,237.0810 us | 19,684.9728 us | 462,662.900 us | 0.95 | 1566069.56 KB | 1.000 | | CompiledRegexWithSpan | 487,368.232 us | 9,735.8353 us | 22,757.2012 us | 483,413.500 us | 1.00 | 1566069.56 KB | 1.000 | | RegexExtension | 39,002.165 us | 709.5091 us | 1,261.1516 us | 38,862.950 us | 0.08 | 3446.34 KB | 0.002 |

Complete run details

Related Skills

node-connect

354.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。