SkillAgentSearch skills...

Html4docx

Convert html to docx

Install / Use

/learn @dfop02/Html4docx
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

HTML FOR DOCX

Tests PyPI Downloads Version Supported Versions

Convert html to docx, this project is a fork from descontinued pqzx/html2docx.

How install

pip install html-for-docx

Usage

The basic usage

Add HTML-formatted content to an existing .docx document

from html4docx import HtmlToDocx

parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)

You can use python-docx to manipulate directly the file, here an example

from docx import Document
from html4docx import HtmlToDocx

document = Document()
parser = HtmlToDocx()

html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)

document.save('your_file_name.docx')

or incrementally add new html to document and save it when finished, new content will always be added at the end

from docx import Document
from html4docx import HtmlToDocx

document = Document()
parser = HtmlToDocx()

for part in ['First', 'Second', 'Third']:
    parser.add_html_to_document(f'<h1>{part} Part</h1>', document)

parser.save('your_file_name.docx')

When you pass a Document object, you can either use document.save() from python-docx or parser.save() from html4docx, both works well.

Both supports saving it in-memory, using BytesIO.

from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx

buffer = BytesIO()
document = Document()
parser = HtmlToDocx()

html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)

# Save the document to the in-memory buffer
parser.save(buffer)

# If you need to read from the buffer again after saving,
# you might need to reset its position to the beginning
buffer.seek(0)

Convert files directly

from html4docx import HtmlToDocx

parser = HtmlToDocx()
parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')

Convert files from a string

from html4docx import HtmlToDocx

parser = HtmlToDocx()
docx = parser.parse_html_string(input_html_file_string)

Change table styles

Tables are not styled by default. Use the table_style attribute on the parser to set a table style before convert html. The style is used for all tables.

from html4docx import HtmlToDocx

parser = HtmlToDocx()
parser.table_style = 'Light Shading Accent 4'
docx = parser.parse_html_string(input_html_file_string)

To add borders to tables, use the Table Grid style:

parser.table_style = 'Table Grid'

All table styles we support can be found here.

Options

There is 5 options that you can use to personalize your execution:

  • Disable Images: Ignore all images.
  • Disable Tables: If you do it, it will render just the raw tables content
  • Disable Styles: Ignore all CSS styles. Also disables Style-Map.
  • Disable Fix-HTML: Use BeautifulSoap to Fix possible HTML missing tags.
  • Disable Style-Map: Ignore CSS classes to word styles mapping
  • Disable Tag-Override: Ignore html tag overrides.
  • Disable HTML-Comments: Ignore all "<!-- ... -->" comments from HTML.

This is how you could disable them if you want:

from html4docx import HtmlToDocx

parser = HtmlToDocx()
parser.options['images'] = False # Default True
parser.options['tables'] = False # Default True
parser.options['styles'] = False # Default True
parser.options['fix-html'] = False # Default True
parser.options['html-comments'] = False # Default False
parser.options['style-map'] = False # Default True
parser.options['tag-override'] = False # Default True
docx = parser.parse_html_string(input_html_file_string)

Extended Styling Features

CSS Class to Word Style Mapping

Map HTML CSS classes to Word document styles:

from html4docx import HtmlToDocx

style_map = {
    'code-block': 'Code Block',
    'numbered-heading-1': 'Heading 1 Numbered',
    'finding-critical': 'Finding Critical'
}

parser = HtmlToDocx(style_map=style_map)
parser.add_html_to_document(html, document)

Tag Style Overrides

Override default tag-to-style mappings:

tag_overrides = {
    'h1': 'Custom Heading 1',  # All <h1> use this style
    'pre': 'Code Block'        # All <pre> use this style
}

parser = HtmlToDocx(tag_style_overrides=tag_overrides)

Custom styles from a Word template: Use a document created from a .docx that already defines the styles (e.g. "Code Block", "Custom Markdown"). Pass that same document to the parser and save it so the custom styles are preserved:

from docx import Document
from html4docx import HtmlToDocx

doc = Document("path/to/template.docx")  # template has Code Block, Custom Markdown, etc.
parser = HtmlToDocx(tag_style_overrides={"code": "Custom Markdown", "pre": "Code Block"})
parser.add_html_to_document(html, doc)
doc.save("output.docx")  # save the template-based doc so custom styles are preserved

If you save a different document (for example, by creating a new Document() instead of loading your template), the output file will not contain the template’s custom styles.

If a referenced custom style does not exist in the document at generation time, a warning will be logged to help you detect the missing style.

Default Paragraph Style

Set custom default paragraph style:

# Use 'Body' as default (default behavior)
parser = HtmlToDocx(default_paragraph_style='Body')

# Use Word's default 'Normal' style
parser = HtmlToDocx(default_paragraph_style=None)

Inline CSS Styles

Full support for inline CSS styles on any element:

<p style="color: red; font-size: 14pt">Red 14pt paragraph</p>
<span style="font-weight: bold; color: blue">Bold blue text</span>

Supported CSS properties:

  • color
  • font-size
  • font-weight (bold)
  • font-style (italic)
  • text-decoration (underline, line-through)
  • font-family
  • text-align
  • background-color
  • Border (for tables)
  • Verticial Align (for tables)

!important Flag Support

Proper CSS precedence with !important:

<span style="color: gray">
  Gray text with <span style="color: red !important">red important</span>.
</span>

The !important flag ensures highest priority.

Style Precedence Order

Styles are applied in this order (lowest to highest priority):

  1. Base HTML tag styles (<b>, <em>, <code>)
  2. Parent span styles
  3. CSS class-based styles (from style_map)
  4. Inline CSS styles (from style attribute)
  5. !important inline CSS styles (highest priority)

Metadata

You're able to read or set docx metadata:

from docx import Document
from html4docx import HtmlToDocx

document = Document()
parser = HtmlToDocx()
parser.set_initial_attrs(document)
metadata = parser.metadata

# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)

# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')

You can find all available metadata attributes here.

Why

My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.

Differences (fixes and new features)

Fixes

  • Fix table_style not working | Dfop02 from Issue
  • Handle missing run for leading br tag | dashingdove from PR
  • Fix base64 images | djplaner from Issue
  • Handle img tag without src attribute | johnjor from PR
  • Fix bug when any style has !important | Dfop02
  • Fix 'style lookup by style_id is deprecated.' | Dfop02
  • Fix background-color not working | Dfop02
  • Fix crashes when img or bookmark is created without paragraph | Dfop02
  • Fix Ordered and Unordered Lists | TaylorN15 from PR
  • Fixed styles was only being applied to span tag. | Dfop02 from Issue
  • Fixed bug on styles parsing when style contains multiple colon. | Dfop02
  • Fixed highlighting a single word | Lynuxen
  • Fix color parsing failing due to invalid colors, falling back to black. | dfop02 from Issue

New Features

  • Add Witdh/Height style to images | maifeeulasad from PR
  • Support px, cm, pt, in, rem, em,
View on GitHub
GitHub Stars59
CategoryDevelopment
Updated1mo ago
Forks15

Languages

Python

Security Score

100/100

Audited on Feb 27, 2026

No findings