Pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
Install / Use
/learn @jsvine/PdfplumberREADME
pdfplumber
Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.
Works best on machine-generated, rather than scanned, PDFs. Built on pdfminer.six.
Currently tested on Python 3.10, 3.11, 3.12, 3.13, 3.14.
Translations of this document are available in: Chinese (by @hbh112233abc).
To report a bug or request a feature, please file an issue. To ask a question or request assistance with a specific PDF, please use the discussions forum.
Table of Contents
- Installation
- Command line interface
- Python library
- Visual debugging
- Extracting text
- Extracting tables
- Extracting form values
- Demonstrations
- Comparison to other libraries
- Acknowledgments / Contributors
- Contributing
Installation
pip install pdfplumber
Command line interface
Basic example
curl "https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf" > background-checks.pdf
pdfplumber background-checks.pdf > background-checks.csv
The output will be a CSV containing info about every character, line, and rectangle in the PDF.
Options
| Argument | Description |
|----------|-------------|
|--format [format]| csv, json, or text. The csv and json formats return information about each object. Of those two, the json format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The text option returns a plain-text representation of the PDF, using Page.extract_text(layout=True).|
|--pages [list of pages]| A space-delimited, 1-indexed list of pages or hyphenated page ranges. E.g., 1, 11-15, which would return data for pages 1, 11, 12, 13, 14, and 15.|
|--types [list of object types to extract]| Choices are char, rect, line, curve, image, annot, et cetera. Defaults to all available.|
|--laparams| A JSON-formatted string (e.g., '{"detect_vertical": true}') to pass to pdfplumber.open(..., laparams=...).|
|--precision [integer]| The number of decimal places to round floating-point numbers. Defaults to no rounding.|
Python library
Basic example
import pdfplumber
with pdfplumber.open("path/to/file.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.chars[0])
Loading a PDF
To start working with a PDF, call pdfplumber.open(x), where x can be a:
- path to your PDF file
- file object, loaded as bytes
- file-like object, loaded as bytes
The open method returns an instance of the pdfplumber.PDF class.
To load a password-protected PDF, pass the password keyword argument, e.g., pdfplumber.open("file.pdf", password = "test").
To set layout analysis parameters to pdfminer.six's layout engine, pass the laparams keyword argument, e.g., pdfplumber.open("file.pdf", laparams = { "line_overlap": 0.7 }).
To pre-normalize Unicode text, pass unicode_norm=..., where ... is one of the four Unicode normalization forms: "NFC", "NFD", "NFKC", or "NFKD".
Invalid metadata values are treated as a warning by default. If that is not intended, pass strict_metadata=True to the open method and pdfplumber.open will raise an exception if it is unable to parse the metadata.
The pdfplumber.PDF class
The top-level pdfplumber.PDF class represents a single PDF and has two main properties:
| Property | Description |
|----------|-------------|
|.metadata| A dictionary of metadata key/value pairs, drawn from the PDF's Info trailers. Typically includes "CreationDate," "ModDate," "Producer," et cetera.|
|.pages| A list containing one pdfplumber.Page instance per page loaded.|
... and also has the following method:
| Method | Description |
|--------|-------------|
|.close()| Calling this method calls Page.close() on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to pdfplumber). |
The pdfplumber.Page class
The pdfplumber.Page class is at the core of pdfplumber. Most things you'll do with pdfplumber will revolve around this class. It has these main properties:
| Property | Description |
|----------|-------------|
|.page_number| The sequential page number, starting with 1 for the first page, 2 for the second, and so on.|
|.width| The page's width.|
|.height| The page's height.|
|.objects / .chars / .lines / .rects / .curves / .images| Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see "Objects" below.|
... and these main methods:
| Method | Description |
|--------|-------------|
|.crop(bounding_box, relative=False, strict=True)| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values (x0, top, x1, bottom). Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If relative=True, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See Issue #245 for a visual example and explanation.) When strict=True (the default), the crop's bounding box must fall entirely within the page's bounding box.|
|.within_bbox(bounding_box, relative=False, strict=True)| Similar to .crop, but only retains objects that fall entirely within the bounding box.|
|.outside_bbox(bounding_box, relative=False, strict=True)| Similar to .crop and .within_bbox, but only retains objects that fall entirely outside the bounding box.|
|.filter(test_function)| Returns a version of the page with only the .objects for which test_function(obj) returns True.|
... and also has the following method:
| Method | Description |
|--------|-------------|
|.close()| By default, Page objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.|
Additional methods are described in the sections below:
Objects
Each instance of pdfplumber.PDF and pdfplumber.Page provides access to several types of PDF objects, all derived from pdfminer.six PDF parsing. The following properties each return a Python list of the matching objects:
.chars, each representing a single text character..lines, each representing a single 1-dimensional line..rects, each representing a single 2-dimensional rectangle..curves, each representing any series of connected points thatpdfminer.sixdoes not recognize as a line or rectangle..images, each representing an image..annots, each representing a single PDF annotation (cf. Section 8.4 of the official PDF specification for details).hyperlinks, each representing a single PDF annotation of the subtypeLinkand having anURIaction attribute
Each object is represented as a simple Python dict, with the following properties:
char properties
| Property | Description |
|----------|-------------|
|page_number| Page number on which this character was found.|
|text| E.g., "z", or "Z" or " ".|
|fontname| Name of the character's font face.|
|size| Font size.|
|adv| Equal to text width * the font size * scaling factor.|
|upright| Whether the character is upright.|
|height| Height of the character.|
|width| Width of the character.|
|x0| Distance of left side of character from left side of page.|
|x1| Distance of right side of character from left side of page.|
|y0| Distance of bottom of character from bottom of page.|
|y1| Distance of top of character from bottom of page.|
|top| Distance of top of character from top of page.|
|bottom| Distance of bottom of the character from top of page.|
|doctop| Distance of top of character from top of document.|
|matrix| The "current transformation matrix" for this character. (See below for details.)|
|mcid| The marked content section ID for this character if any (otherwise None). Experimental attribute.|
|tag| The marked content section tag for this character if any (otherwise None). Experimental attribute.|
|ncs|TKTK|
|stroking_pattern|TKTK|
|non_stroking_pattern|TKTK|
|stroking_color|The co
Related Skills
node-connect
325.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
summarize
325.9kSummarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).
feishu-doc
325.9k|
