SkillAgentSearch skills...

Docx2python

Extract docx headers, footers, (formatted) text, footnotes, endnotes, properties, and images.

Install / Use

/learn @ShayHill/Docx2python
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

New in docx2python Version 3

  • Better type hints for DocxOutput properties. You should never get an "or" or "Any" type hint for the nested lists returned by Docx2Python.
  • Support for "strict" namespaces. Word uses a superset of the standard Open Office XML format. Work can restrict itself to the standard by saving with the "strict" namespace. This is now supported.
  • Tables exported as nested lists are now always nxm (n rows, m columns). This will simplify converting tables to markdown or other data types. Where duplicate_merged_cells is True, the table will be filled to nxm with content from adjacent cells. Where false, the table will be filled to nxm with empty cells.
  • Tables can now be identified without guessing games (see Par Type).
  • Word's paragraph styles are now exposed (e.g., Heading 2, Subtitle, Subtle Emphasis - see Par Type). If html=True, these will be exported as html tags where an obvious mapping exists (e.g., Heading 1 -> h1).
  • A paragraphs's position in a nested list is now exposed (see Par Type).
  • Input boolean arguments 'html' (False) and 'duplicate_merged_cells' (True) are now keyword only.

Par Type

New in Docx2Python Version 3, the Par type captures some paragraph properties.

elem: lxml.etree._Element

A pointer to the xml element from which the paragraph was extracted. This is useful for fishing around in the xml from a known location. See tests.test_content_control_block_properties.py for an example of how this can be used.

html_style: list[str]

A list of html tags that will be applied to the paragraph if html=True.

style: str

The MS Word paragraph style (e.g., Heading 2, Subtitle, Subtle Emphasis), if any. This will facilitate finding headings, etc.

lineage: ("document", str | None, str | None, str | None, str | None)

Docx2Python partially flattens the xml spaghetti so that a paragraph is always at depth 4. This often means building structure where none exists, so the lineage [ostensibly (great-great-grandparent, great-grandparent, grandparent, parent, self)] is not always straightforward. But there are some patterns you can depend on. The most requested is that paragraphs in table cells will always have a lineage of ("document", "tbl", something, something, "p"). Use iter_tables and is_tbl from the docx2python.iterators module to find tables in your document. There is an example in tests/test_tables_to_markdown.py.

runs: list[Run]

A list of Run instances. Each Run instance has html_style and text attributes. This will facilitate finding and extracting text with specific formatting.

list_position: tuple[str | None, list[int]]

The address of a paragraph in a nested list. The first item in the tuple is a string identifier for the list. These are extracted from Word, and may look like indices, but they are not. List "2" might come before list "1" in the document. The second item is a list of indices to show where you are in that list.

1. paragraph  # list_position = ("list_id", [0])
2. paragraph  # list_position = ("list_id", [1])
   a. paragraph  # list_position = ("list_id", [1, 0])
      i. paragraph  # list_position = ("list_id", [1, 0, 0])
   b. paragraph  # list_position = ("list_id", [1, 1])
3. paragraph  # list_position = ("list_id", [2])

docx2python

Extract docx headers, footers, text, footnotes, endnotes, properties, comments, and images to a Python object.

README_DOCX_FILE_STRUCTURE.md may help if you'd like to extend docx2python.

For a summary of what's new in docx2python 2, scroll down to New in docx2python Version 2

For a summary of what's new in docx2python 3, scroll up to New in docx2python Version 3

The code began as an expansion/contraction of python-docx2txt (Copyright (c) 2015 Ankush Shah). The original code is mostly gone, but some of the bones may still be here.

shared features:

  • extracts text from docx files
  • extracts images from docx files

additions:

  • extracts footnotes and endnotes
  • converts bullets and numbered lists to ascii with indentation
  • converts hyperlinks to <a href="http:/...">link text</a>
  • retains some structure of the original file (more below)
  • extracts document properties (creator, lastModifiedBy, etc.)
  • inserts image placeholders in text ('----image1.jpg----')
  • inserts plain text footnote and endnote references in text ('----footnote1----')
  • (optionally) retains font size, font color, bold, italics, and underscore as html
  • extracts math equations
  • extracts user selections from checkboxes and dropdown menus
  • extracts comments
  • extracts some paragraph properties (e.g., Heading 1)
  • tracks location within numbered lists

subtractions:

  • no command-line interface
  • will only work with Python 3.8+

Installation

pip install docx2python

Use

docx2python opens a zipfile object and (lazily) reads it. Use context management (with ... as) to close this zipfile object or explicitly close with docx_content.close().

from docx2python import docx2python

# extract docx content
with docx2python('path/to/file.docx') as docx_content:
    print(docx_content.text)

docx_content = docx2python('path/to/file.docx')
print(docx_content.text)
docx_content.close()

# extract docx content, write images to image_directory
with docx2python('path/to/file.docx', 'path/to/image_directory') as docx_content:
    print(docx_content.text)

# extract docx content with basic font styles converted to html
with docx2python('path/to/file.docx', html=True) as docx_content:
    print(docx_content.text)

Note on html feature:

  • supports <i>italic, <b>bold, <u>underline, <s>strike, <sup>superscript, <sub>subscript, <span style="font-variant: small-caps">small caps, <span style="text-transform:uppercase">all caps, <span style="background-color: yellow">highlighted, <span style="font-size:32">font size, <span style="color:#ff0000">colored text.
  • hyperlinks will always be exported as html (<a href="http:/...">link text</a>), even if html=False, because I couldn't think of a more canonical representation.
  • every tag open in a paragraph will be closed in that paragraph (and, where appropriate, reopened in the next paragraph). If two subsequenct paragraphs are bold, they will be returned as <b>paragraph a</b>, <b>paragraph b</b>. This is intentional to make each paragraph its own entity.
  • if you specify html=True, &, > and < in your docx text will be encoded as &amp, &gt; and &lt;

Return Value

Function docx2python returns a DocxContent instance with several attributes.

header (_runs, _pars) - contents of the docx headers in the return format described herein

footer (_runs, _pars) - contents of the docx footers in the return format described herein

body (_runs, _pars)- contents of the docx in the return format described herein

footnotes (_runs, _pars) - contents of the docx in the return format described herein

endnotes (_runs, _pars) - contents of the docx in the return format described herein

document (_runs, _pars) - header + body + footer (read only)

text - all docx text as one string, similar to what you'd get from python-docx2txt

properties - docx property names mapped to values (e.g., {"lastModifiedBy": "Shay Hill"})

images - image names mapped to images in binary format. Write to filesystem with

for name, image in result.images.items():
    with open(name, 'wb') as image_destination:
        write(image_destination, image)

# or

with docx2python('path/to/file.docx', 'path/to/image/directory') as docx_content:
    ...

# or

with docx2python('path/to/file.docx') as docx_content:
    docx_content.save_images('path/to/image/directory')

docx_reader - a DocxReader (see docx_reader.py) instance with several methods for extracting xml portions.

Arguments

def docx2python(
    docx_filename: str | os.PathLike[str] | BytesIO,
    image_folder: str | os.PathLike[str] | None = None,
    *,
    html: bool = False,
    duplicate_merged_cells: bool = True
) -> DocxContent:
    """
    Unzip a docx file and extract contents.

    :param docx_filename: path to a docx file
    :param image_folder: optionally specify an image folder
        (images in docx will be copied to this folder)
    :param html: bool, extract some formatting as html
    :param duplicate_merged_cells: bool, duplicate merged cells to return a mxn
        nested list for each table (default True)
    :return: DocxContent object
    """

Return Format

(header, footer, body, footnotes, endnotes, document)

Some structure will be maintained. Text will be returned in a nested list, with paragraphs always at depth 4 (i.e., output.body[i][j][k][l] will be a paragraph).

If your docx has no tables, output.body will appear as one a table with all content in one cell:

[  # document
    [  # table
        [  # row
            [  # cell
                "Paragraph 1",
                "Paragraph 2",
                "-- bulleted list",
                "-- continuing bulleted list",
                "1)  numbered list",
                "2)  continuing numbered list"
                "    a)  sublist",
                "        i)  sublist of sublist",
                "3)  keeps track of indention levels",
                "    a)  resets sublist counters"
            ]
        ]
     ]
 ]

Table cells will appear as table cells. Text outside tables will appear as table cells.

A docx document can be tables within tables within tables. Docx2Python flattens most of this to more easily navigate within the content.

(header_runs, footer_runs, body_runs, footnotes_runs, endnotes_runs, document_runs)

Version2 intruduced _run attributes. Instead of a string for each paragraph, each run is a string. T

View on GitHub
GitHub Stars203
CategoryDevelopment
Updated1d ago
Forks37

Languages

Python

Security Score

95/100

Audited on Apr 6, 2026

No findings