Html4docx
Convert html to docx
Install / Use
/learn @dfop02/Html4docxREADME
HTML FOR DOCX
Convert html to docx, this project is a fork from descontinued pqzx/html2docx.
How install
pip install html-for-docx
Usage
The basic usage
Add HTML-formatted content to an existing .docx document
from html4docx import HtmlToDocx
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, filename_docx)
You can use python-docx to manipulate directly the file, here an example
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)
document.save('your_file_name.docx')
or incrementally add new html to document and save it when finished, new content will always be added at the end
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
for part in ['First', 'Second', 'Third']:
parser.add_html_to_document(f'<h1>{part} Part</h1>', document)
parser.save('your_file_name.docx')
When you pass a Document object, you can either use document.save() from python-docx or parser.save() from html4docx, both works well.
Both supports saving it in-memory, using BytesIO.
from io import BytesIO
from docx import Document
from html4docx import HtmlToDocx
buffer = BytesIO()
document = Document()
parser = HtmlToDocx()
html_string = '<h1>Hello world</h1>'
parser.add_html_to_document(html_string, document)
# Save the document to the in-memory buffer
parser.save(buffer)
# If you need to read from the buffer again after saving,
# you might need to reset its position to the beginning
buffer.seek(0)
Convert files directly
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.parse_html_file(input_html_file_path, output_docx_file_path)
# You can also define a encoding, by default is utf-8
parser.parse_html_file(input_html_file_path, output_docx_file_path, 'utf-8')
Convert files from a string
from html4docx import HtmlToDocx
parser = HtmlToDocx()
docx = parser.parse_html_string(input_html_file_string)
Change table styles
Tables are not styled by default. Use the table_style attribute on the parser to set a table style before convert html. The style is used for all tables.
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.table_style = 'Light Shading Accent 4'
docx = parser.parse_html_string(input_html_file_string)
To add borders to tables, use the Table Grid style:
parser.table_style = 'Table Grid'
All table styles we support can be found here.
Options
There is 5 options that you can use to personalize your execution:
- Disable Images: Ignore all images.
- Disable Tables: If you do it, it will render just the raw tables content
- Disable Styles: Ignore all CSS styles. Also disables Style-Map.
- Disable Fix-HTML: Use BeautifulSoap to Fix possible HTML missing tags.
- Disable Style-Map: Ignore CSS classes to word styles mapping
- Disable Tag-Override: Ignore html tag overrides.
- Disable HTML-Comments: Ignore all "<!-- ... -->" comments from HTML.
This is how you could disable them if you want:
from html4docx import HtmlToDocx
parser = HtmlToDocx()
parser.options['images'] = False # Default True
parser.options['tables'] = False # Default True
parser.options['styles'] = False # Default True
parser.options['fix-html'] = False # Default True
parser.options['html-comments'] = False # Default False
parser.options['style-map'] = False # Default True
parser.options['tag-override'] = False # Default True
docx = parser.parse_html_string(input_html_file_string)
Extended Styling Features
CSS Class to Word Style Mapping
Map HTML CSS classes to Word document styles:
from html4docx import HtmlToDocx
style_map = {
'code-block': 'Code Block',
'numbered-heading-1': 'Heading 1 Numbered',
'finding-critical': 'Finding Critical'
}
parser = HtmlToDocx(style_map=style_map)
parser.add_html_to_document(html, document)
Tag Style Overrides
Override default tag-to-style mappings:
tag_overrides = {
'h1': 'Custom Heading 1', # All <h1> use this style
'pre': 'Code Block' # All <pre> use this style
}
parser = HtmlToDocx(tag_style_overrides=tag_overrides)
Custom styles from a Word template: Use a document created from a .docx that already defines the styles (e.g. "Code Block", "Custom Markdown"). Pass that same document to the parser and save it so the custom styles are preserved:
from docx import Document
from html4docx import HtmlToDocx
doc = Document("path/to/template.docx") # template has Code Block, Custom Markdown, etc.
parser = HtmlToDocx(tag_style_overrides={"code": "Custom Markdown", "pre": "Code Block"})
parser.add_html_to_document(html, doc)
doc.save("output.docx") # save the template-based doc so custom styles are preserved
If you save a different document (for example, by creating a new Document() instead of loading your template), the output file will not contain the template’s custom styles.
If a referenced custom style does not exist in the document at generation time, a warning will be logged to help you detect the missing style.
Default Paragraph Style
Set custom default paragraph style:
# Use 'Body' as default (default behavior)
parser = HtmlToDocx(default_paragraph_style='Body')
# Use Word's default 'Normal' style
parser = HtmlToDocx(default_paragraph_style=None)
Inline CSS Styles
Full support for inline CSS styles on any element:
<p style="color: red; font-size: 14pt">Red 14pt paragraph</p>
<span style="font-weight: bold; color: blue">Bold blue text</span>
Supported CSS properties:
- color
- font-size
- font-weight (bold)
- font-style (italic)
- text-decoration (underline, line-through)
- font-family
- text-align
- background-color
- Border (for tables)
- Verticial Align (for tables)
!important Flag Support
Proper CSS precedence with !important:
<span style="color: gray">
Gray text with <span style="color: red !important">red important</span>.
</span>
The !important flag ensures highest priority.
Style Precedence Order
Styles are applied in this order (lowest to highest priority):
- Base HTML tag styles (
<b>,<em>,<code>) - Parent span styles
- CSS class-based styles (from
style_map) - Inline CSS styles (from
styleattribute) - !important inline CSS styles (highest priority)
Metadata
You're able to read or set docx metadata:
from docx import Document
from html4docx import HtmlToDocx
document = Document()
parser = HtmlToDocx()
parser.set_initial_attrs(document)
metadata = parser.metadata
# You can get metadata as dict
metadata_json = metadata.get_metadata()
print(metadata_json['author']) # Jane
# or just print all metadata if if you want
metadata.get_metadata(print_result=True)
# Set new metadata
metadata.set_metadata(author="Jane", created="2025-07-18T09:30:00")
document.save('your_file_name.docx')
You can find all available metadata attributes here.
Why
My goal in forking and fixing/updating this package was to complete my current task at work, which involves converting HTML to DOCX. The original package lacked a few features and had some bugs, preventing me from completing the task. Instead of creating a new package from scratch, I preferred to update this one.
Differences (fixes and new features)
Fixes
- Fix
table_stylenot working | Dfop02 from Issue - Handle missing run for leading br tag | dashingdove from PR
- Fix base64 images | djplaner from Issue
- Handle img tag without src attribute | johnjor from PR
- Fix bug when any style has
!important| Dfop02 - Fix 'style lookup by style_id is deprecated.' | Dfop02
- Fix
background-colornot working | Dfop02 - Fix crashes when img or bookmark is created without paragraph | Dfop02
- Fix Ordered and Unordered Lists | TaylorN15 from PR
- Fixed styles was only being applied to span tag. | Dfop02 from Issue
- Fixed bug on styles parsing when style contains multiple colon. | Dfop02
- Fixed highlighting a single word | Lynuxen
- Fix color parsing failing due to invalid colors, falling back to black. | dfop02 from Issue
New Features
- Add Witdh/Height style to images | maifeeulasad from PR
- Support px, cm, pt, in, rem, em,
