pdf2html

Convert PDF files to HTML, extract text, generate thumbnails, extract images, and extract metadata using Apache Tika and PDFBox

🚀 Features

PDF to HTML conversion - Maintains formatting and structure
Text extraction - Extract plain text content from PDFs
Page-by-page processing - Process PDFs page by page
Metadata extraction - Extract author, title, creation date, and more
Thumbnail generation - Generate preview images from PDF pages
Image extraction - Extract all embedded images from PDFs
Buffer support - Process PDFs from memory buffers or file paths
TypeScript support - Full type definitions included
Async/Promise based - Modern async API
Configurable - Extensive options for customization

📋 Prerequisites

Node.js >= 12.0.0
Java Runtime Environment (JRE) >= 8
- Required for Apache Tika and PDFBox
- Download Java

📦 Installation

Using npm:

npm install pdf2html

Using yarn:

yarn add pdf2html

Using pnpm:

pnpm add pdf2html

The installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.

🔧 Basic Usage

Convert PDF to HTML

const pdf2html = require('pdf2html');
const fs = require('fs');

// From file path
const html = await pdf2html.html('path/to/document.pdf');
console.log(html);

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const html = await pdf2html.html(pdfBuffer);
console.log(html);

// With options
const html = await pdf2html.html(pdfBuffer, {
    maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});

Extract Text

// From file path
const text = await pdf2html.text('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const text = await pdf2html.text(pdfBuffer);
console.log(text);

Process Pages Individually

// From file path
const htmlPages = await pdf2html.pages('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const htmlPages = await pdf2html.pages(pdfBuffer);
htmlPages.forEach((page, index) => {
    console.log(`Page ${index + 1}:`, page);
});

// Get text for each page
const textPages = await pdf2html.pages(pdfBuffer, {
    text: true,
});

Extract Metadata

// From file path or buffer
const metadata = await pdf2html.meta(pdfBuffer);
console.log(metadata);
// Output: {
//   title: 'Document Title',
//   author: 'John Doe',
//   subject: 'Document Subject',
//   keywords: 'pdf, conversion',
//   creator: 'Microsoft Word',
//   producer: 'Adobe PDF Library',
//   creationDate: '2023-01-01T00:00:00Z',
//   modificationDate: '2023-01-02T00:00:00Z',
//   pages: 10
// }

Generate Thumbnails

// From file path
const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer);
console.log('Thumbnail saved to:', thumbnailPath);

// Custom thumbnail options
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, {
    page: 1, // Page number (default: 1)
    imageType: 'png', // 'png' or 'jpg' (default: 'png')
    width: 300, // Width in pixels (default: 160)
    height: 400, // Height in pixels (default: 226)
});

Extract Images

// From file path
const imagePaths = await pdf2html.extractImages('path/to/document.pdf');
console.log('Extracted images:', imagePaths);
// Output: ['/absolute/path/to/files/image/document1.jpg', '/absolute/path/to/files/image/document2.png', ...]

// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const imagePaths = await pdf2html.extractImages(pdfBuffer);

// With custom output directory
const imagePaths = await pdf2html.extractImages(pdfBuffer, {
    outputDirectory: './extracted-images', // Custom output directory
});

// With custom buffer size for large PDFs
const imagePaths = await pdf2html.extractImages('large-document.pdf', {
    outputDirectory: './output',
    maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});

💻 TypeScript Support

This package includes TypeScript type definitions out of the box. No need to install @types/pdf2html.

Basic TypeScript Usage

import * as pdf2html from 'pdf2html';
// or
import { html, text, pages, meta, thumbnail, extractImages, PDFMetadata, PDFProcessingError } from 'pdf2html';

async function convertPDF() {
    try {
        // All methods accept string paths or Buffers
        const htmlContent: string = await pdf2html.html('document.pdf');
        const textContent: string = await pdf2html.text(Buffer.from(pdfData));

        // Full type safety for options
        const thumbnailPath = await pdf2html.thumbnail('document.pdf', {
            page: 1, // number
            imageType: 'png', // 'png' | 'jpg'
            width: 300, // number
            height: 400, // number
        });

        // TypeScript knows the shape of metadata
        const metadata: PDFMetadata = await pdf2html.meta('document.pdf');
        console.log(metadata['pdf:producer']); // string | undefined
        console.log(metadata.resourceName); // string | undefined
    } catch (error) {
        if (error instanceof pdf2html.PDFProcessingError) {
            console.error('PDF processing failed:', error.message);
            console.error('Exit code:', error.exitCode);
        }
    }
}

Type Definitions

// Input types - all methods accept either file paths or Buffers
type PDFInput = string | Buffer;

// Options interfaces
interface ProcessingOptions {
    maxBuffer?: number; // Maximum buffer size in bytes
}

interface PageOptions extends ProcessingOptions {
    text?: boolean; // Extract text instead of HTML
}

interface ThumbnailOptions extends ProcessingOptions {
    page?: number; // Page number (default: 1)
    imageType?: 'png' | 'jpg'; // Image format (default: 'png')
    width?: number; // Width in pixels (default: 160)
    height?: number; // Height in pixels (default: 226)
}

// Metadata structure with common fields
interface PDFMetadata {
    'pdf:PDFVersion'?: string;
    'pdf:producer'?: string;
    'xmp:CreatorTool'?: string;
    'dc:title'?: string;
    'dc:creator'?: string;
    resourceName?: string;
    [key: string]: any; // Allows additional properties
}

// Error class
class PDFProcessingError extends Error {
    command?: string; // The command that failed
    exitCode?: number; // The process exit code
}

IntelliSense Support

Full IntelliSense support in VS Code and other TypeScript-aware editors:

Auto-completion for all methods and options
Inline documentation on hover
Type checking at compile time
Catch errors before runtime

Advanced TypeScript Usage

import { PDFProcessor, utils } from 'pdf2html';

// Using the PDFProcessor class directly
const html = await PDFProcessor.toHTML('document.pdf');

// Using utility classes
const { FileManager, HTMLParser } = utils;
await FileManager.ensureDirectories();

// Type guards
function isPDFProcessingError(error: unknown): error is pdf2html.PDFProcessingError {
    return error instanceof pdf2html.PDFProcessingError;
}

// Generic helper with proper typing
async function processPDFSafely<T>(operation: () => Promise<T>, fallback: T): Promise<T> {
    try {
        return await operation();
    } catch (error) {
        if (isPDFProcessingError(error)) {
            console.error(`PDF operation failed: ${error.message}`);
        }
        return fallback;
    }
}

// Usage
const pages = await processPDFSafely(
    () => pdf2html.pages('document.pdf', { text: true }),
    [] // fallback to empty array
);

⚙️ Advanced Configuration

Buffer Size Configuration

By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:

const options = {
    maxBuffer: 1024 * 1024 * 50, // 50MB buffer
};

// Apply to any method
await pdf2html.html('large-file.pdf', options);
await pdf2html.text('large-file.pdf', options);
await pdf2html.pages('large-file.pdf', options);
await pdf2html.meta('large-file.pdf', options);
await pdf2html.thumbnail('large-file.pdf', options);

Error Handling

Always wrap your calls in try-catch blocks for proper error handling:

try {
    const html = await pdf2html.html('document.pdf');
    // Process HTML
} catch (error) {
    if (error.code === 'ENOENT') {
        console.error('PDF file not found');
    } else if (error.message.includes('Java')) {
        console.error('Java is not installed or not in PATH');
    } else {
        console.error('PDF processing failed:', error.message);
    }
}

🏗️ API Reference

`pdf2html.html(input, [options])`

Converts PDF to HTML format.

input string | Buffer - Path to the PDF file or PDF buffer
options object (optional)
- maxBuffer number - Maximum buffer size in bytes (default: 2MB)
Returns: Promise<string> - HTML content

`pdf2html.text(input, [options])`

Extracts text from PDF.

input string | Buffer - Path to the PDF file or PDF buffer
options object (optional)
- maxBuffer number - Maximum buffer size in bytes
Returns:

Pdf2html

Install / Use

README