SkillAgentSearch skills...

PdfPig

Read and extract text and other content from PDFs in C# (port of PDFBox)

Install / Use

/learn @UglyToad/PdfPig

README

<image src="https://raw.githubusercontent.com/UglyToad/Pdf/master/documentation/pdfpig.png" width="128px" height="128px"/>

PdfPig

nuget Build and test Build and test [MacOS]

PdfPig supports reading text and content from PDF files. It also supports basic PDF file creation.

Installation

The package is available via the releases tab or from Nuget:

https://www.nuget.org/packages/PdfPig/

Or from the package manager console:

> Install-Package PdfPig

While the version is below 1.0.0 minor versions will change the public API without warning (SemVer will not be followed until 1.0.0 is reached).

Get Started

See the wiki for more examples

Reading text from a PDF

The simplest usage at this stage is to open a document, reading the words from every page:

// using UglyToad.PdfPig.DocumentLayoutAnalysis.TextExtractor;
// using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
using (PdfDocument document = PdfDocument.Open(@"C:\Documents\document.pdf"))
{
    foreach (Page page in document.GetPages())
    {
        string text = ContentOrderTextExtractor.GetText(page);
        IEnumerable<Word> words = page.GetWords(NearestNeighbourWordExtractor.Instance);
    }
}

You should not use page.Text directly, unless you know what you're doing. The Text property preserves the internal content order which is rarely ever the text in the order you want.

These layout analysis tools should get you the text you want in most cases.

Create PDF Document

To create documents use the class PdfDocumentBuilder. The Standard 14 fonts provide a quick way to get started:

PdfDocumentBuilder builder = new PdfDocumentBuilder();

PdfPageBuilder page = builder.AddPage(PageSize.A4);

// Fonts must be registered with the document builder prior to use to prevent duplication.
PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);

page.AddText("Hello World!", 12, new PdfPoint(25, 700), font);

byte[] documentBytes = builder.Build();

File.WriteAllBytes(@"C:\git\newPdf.pdf", documentBytes);

The output is a 1 page PDF document with the text "Hello World!" in Helvetica near the top of the page:

Image shows a PDF document in Google Chrome's PDF viewer. The text "Hello World!" is visible

Each font must be registered with the PdfDocumentBuilder prior to use enable pages to share the font resources. Only Standard 14 fonts and TrueType fonts (.ttf) are supported.

Document creation supports very limited changes to existing PDF documents. However it does not support any of the following:

  • Editing forms
  • Copying or changing annotations, metadata or document structure data
  • Adding or removing text with existing fonts

Advanced Document Extraction

In this example a more advanced document extraction is performed. PdfDocumentBuilder is used to create a copy of the pdf with debug information (bounding boxes and reading order) added.

//using UglyToad.PdfPig;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.PageSegmenter;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.ReadingOrderDetector;
//using UglyToad.PdfPig.DocumentLayoutAnalysis.WordExtractor;
//using UglyToad.PdfPig.Fonts.Standard14Fonts;
//using UglyToad.PdfPig.Writer;


var sourcePdfPath = "";
var outputPath = "";
var pageNumber = 1;
using (var document = PdfDocument.Open(sourcePdfPath))
{
    var builder = new PdfDocumentBuilder { };
    PdfDocumentBuilder.AddedFont font = builder.AddStandard14Font(Standard14Font.Helvetica);
    var pageBuilder = builder.AddPage(document, pageNumber);
    pageBuilder.SetStrokeColor(0, 255, 0);
    var page = document.GetPage(pageNumber);

    var letters = page.Letters; // no preprocessing

    // 1. Extract words
    var wordExtractor = NearestNeighbourWordExtractor.Instance;

    var words = wordExtractor.GetWords(letters);

    // 2. Segment page
    var pageSegmenter = DocstrumBoundingBoxes.Instance;

    var textBlocks = pageSegmenter.GetBlocks(words);

    // 3. Postprocessing
    var readingOrder = UnsupervisedReadingOrderDetector.Instance;
    var orderedTextBlocks = readingOrder.Get(textBlocks);

    // 4. Add debug info - Bounding boxes and reading order
    foreach (var block in orderedTextBlocks)
    {
        var bbox = block.BoundingBox;
        pageBuilder.DrawRectangle(bbox.BottomLeft, bbox.Width, bbox.Height);
        pageBuilder.AddText(block.ReadingOrder.ToString(), 8, bbox.TopLeft, font);
    }

    // 5. Write result to a file
    byte[] fileBytes = builder.Build();
    File.WriteAllBytes(outputPath, fileBytes); // save to file
}

Image shows a PDF document created by the above code block with the bounding boxes and reading order of the words displayed

See Document Layout Analysis for more information on advanced document analysing.

See Export for more advanced tooling to analyse document layouts.

Usage

PdfDocument

The PdfDocument class provides access to the contents of a document loaded either from file or passed in as bytes. To open from a file use the PdfDocument.Open static method:

using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf"))
{
	int pageCount = document.NumberOfPages;

	// Page number starts from 1, not 0.
	Page page = document.GetPage(1);

	decimal widthInPoints = page.Width;
	decimal heightInPoints = page.Height;

	string text = page.Text;
}

PdfDocument should only be used in a using statement since it implements IDisposable (unless the consumer disposes of it elsewhere).

Encrypted documents can be opened by PdfPig. To provide an owner or user password provide the optional ParsingOptions when calling Open with the Password property defined. For example:

using (PdfDocument document = PdfDocument.Open(@"C:\my-file.pdf",  new ParsingOptions { Password = "password here" }))

You can also provide a list of passwords to try:

using (PdfDocument document = PdfDocument.Open(@"C:\file.pdf", new ParsingOptions
{
	Passwords = new List<string> { "One", "Two" }
}))

The document contains the version of the PDF specification it complies with, accessed by document.Version:

decimal version = document.Version;

Document Creation

The PdfDocumentBuilder creates a new document with no pages or content.

For text content, a font must be registered with the builder. This library supports Standard 14 fonts provided by Adobe by default and TrueType format fonts.

To add a Standard 14 font use:

public AddedFont AddStandard14Font(Standard14Font type)

Or for a TrueType font use:

AddedFont AddTrueTypeFont(IReadOnlyList<byte> fontFileBytes)

Passing in the bytes of a TrueType file (.ttf). You can check the suitability of a TrueType file for embedding in a PDF document using:

bool CanUseTrueTypeFont(IReadOnlyList<byte> fontFileBytes, out IReadOnlyList<string> reasons)

Which provides a list of reasons why the font cannot be used if the check fails. You should check the license for a TrueType font prior to use, since the compressed font file is embedded in, and distributed with, the resultant document.

The AddedFont class represents a key to the font stored on the document builder. This must be provided when adding text content to pages. To add a page to a document use:

PdfPageBuilder AddPage(PageSize size, bool isPortrait = true)

This creates a new PdfPageBuilder with the specified size. The first added page is page number 1, then 2, then 3, etc. The page builder supports adding text, drawing lines and rectangles and measuring the size of text prior to drawing.

To draw lines and rectangles use the methods:

void DrawLine(PdfPoint from, PdfPoint to, decimal lineWidth = 1)
void DrawRectangle(PdfPoint position, decimal width, decimal height, decimal lineWidth = 1)

The line width can be varied and defaults to 1. Rectangles are unfilled and the fill color cannot be changed at present.

To write text to the page you must have a reference to an AddedFont from the methods on PdfDocumentBuilder as described above. You can then draw the text to the page using:

IReadOnlyList<Letter> AddText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Where position is the baseline of the text to draw. Currently only ASCII text is supported. You can also measure the resulting size of text prior to drawing using the method:

IReadOnlyList<Letter> MeasureText(string text, decimal fontSize, PdfPoint position, PdfDocumentBuilder.AddedFont font)

Which does not change the state of the page, unlike AddText.

Changing the RGB color of text, lines and rectangles is supported using:

void SetStrokeColor(byte r, byte g, byte b)
void SetTextAndFillColor(byte r, byte g, byte b)

Which take RGB values between 0 and 255. The color will remain active for all operations called after these methods until reset is called using:

void ResetColor()

Which resets the color for stroke, fill and text drawing to black.

Document Information

The PdfDocument provides access to the document metadata as DocumentInformation defined in the PDF file. These tend not to be provided therefore most of these entries wil

Related Skills

View on GitHub
GitHub Stars2.4k
CategoryContent
Updated1d ago
Forks311

Languages

C#

Security Score

100/100

Audited on Mar 20, 2026

No findings