file2txt

Overview

Another tool of ours, txt2stix, takes a .txt file input and then extracts IoCs (indicators of compromise) and TTPs (tactics, techniques and procedures).

However, in many cases the file a user wants to process is not usually in structured plain text file format (e.g. is usually in pdf, docx, etc. formats).

These files also commonly contain images with text that are useful to extract too.

file2txt is a Python library takes common file formats and turns them into plain text (a .md file) with Markdown styling to make it as nice as possible to read.

In addition to the printed text, file2txt can also extract text from images found in the input file.

Essentially file2txt is used by us to produce a text output that can be scanned for IoCs and TTPs (by txt2stix), but could be used for a variety of other use-cases as you see fit.

The general flow of the file2txt is as follows:

https://miro.com/app/board/uXjVKZXyIxA=/

Download and Install

# clone the latest code
git clone https://github.com/dogesec/file2txt
# create a venv
cd file2txt
python3 -m venv file2txt-venv
source file2txt-venv/bin/activate
# install requirements
pip3 install -r requirements.txt

Configuration options

file2txt has various settings that are defined in an .env file.

To create a template for the file:

cp .env.example .env

To see more information about how to set the variables, and what they do, read the .env.markdown file.

Optional: Add Marker API Key (`MARKER_API_KEY`)

file2txt uses the Marker API to process the following filetypes;

PDF .pdf (application/pdf)
Word .doc (application/msword), .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document)
Powerpoint .ppt (application/vnd.ms-powerpoint), .pptx (application/vnd.openxmlformats-officedocument.presentationml.presentation)

You only need a Marker API key if you intend to use one of the above modes.

Get your Marker API key here.

Once it's generated add your API key using the MARKER_API_KEY variable in the .env file.

You do not need a Marker API key if you only intend to convert the following file types;

HTML html (text/html)
HTML Article html_article (text/html)
CSV csv (text/csv)
Image jpg (image/jpg), .jpeg (image/jpeg), .png (image/png), .webp (image/webp)

Optional: Add Google's Cloud Vision API Key (`GOOGLE_VISION_API_KEY`)

file2txt uses Cloud Vision to text from images found in the input documents. This feature is optional. If you do not set a Cloud Vision key, you will not be able to use the extract_text_from_image feature.

If you want to use this feature you must set your Cloud Vision credentials as follows...

The project name can be anything you want. It will only be visible to you in the GCP Console.

This app requires the following Google API's to work:

Cloud Vision API

Go to APIs and Services and create a new API Key. It's a good idea to limit the keys scope to the Cloud Vision API.

Once it's generated add your API key using the GOOGLE_VISION_API_KEY variable in the .env file.

You do not need a Google API key if you don't want to convert images to text.

Run

python3 file2txt.py \
	--mode mode \
	--file path/to/file.extension \
	--output my_document \
	--defang boolean \
	--extract_text_from_image boolean

To upload a new file to be processed to text the following flags are used;

--mode (required, dictionary): must be supported mode. Mode must support the filetype being used, else an error will be returned.
- txt
- md
  - image
  - csv
  - html
  - html_article
  - pdf (requires marker api key)
- word (requires marker api key)
- powerpoint (requires marker api key)
--file (required, string): path to file to be converted to text. Note, if the filetype and mimetype of the document submitted does not match one of those supported by file2txt (and set for mode, an error will be returned.
--output (optional, string): name of output directory name. Default is {input_filename}/.
--defang (optional, boolean): if output should be defanged. Default is true.
--extract_text_from_image (optional, boolean, required Google Vision api key): if images should be converted to text using OCR. Default is true. This flag MUST be false with csv mode and MUST be true with image mode.

The script will output all files to the output/ directory in the following structure;

output
├── {input_filename}
│   ├── {input_filename}.md
│   ├── EXTRACTED_IMAGE_1.FORMAT
│   └── EXTRACTED_IMAGE_2.FORMAT

To ensure images are not lost (in modes that support images), the script also extracts and stores a copy of all identified images in the directory created for the input file.

Examples

You can see the output from the commands below in the examples/ directory of this repository.

If you want to try with the same files I used, read how to download them in tests/README.md

Turn a CSV into markdown table;

python3 file2txt.py \
  --mode csv \
  --file tests/files/csv/csv-test.csv \
  --output examples/csv_input \
  --defang true \
  --extract_text_from_image false

And a spreadsheet;

python3 file2txt.py \
  --mode excel \
  --file tests/files/xls/fanged_data.xlsx \
  --output examples/xls_input \
  --defang true \
  --extract_text_from_image false

Convert a PDF document to human friendly markdown, extract text from images, and defang the text (the most common use-case for cyber-security);

python3 file2txt.py \
  --mode pdf \
  --file tests/files/pdf-real/bitdefender-rdstealer.pdf \
  --output examples/pdf_input \
  --defang true \
  --extract_text_from_image true

Only convert the text in the main article on the webpage into markdown, also extract text from images, and defang the text;

python3 file2txt.py \
  --mode html_article \
  --file tests/files/html-real/unit42-Fighting-Ursa-Luring-Targets-With-Car-for-Sale.html \
  --output examples/html_article_input \
  --defang true \
  --extract_text_from_image true

Now convert the entire HTML content, not just the article

python3 file2txt.py \
  --mode html \
  --file tests/files/html-real/unit42-Fighting-Ursa-Luring-Targets-With-Car-for-Sale.html \
  --output examples/html_input \
  --defang true \
  --extract_text_from_image true

Do not defang this Word file;

python3 file2txt.py \
  --mode word \
  --file tests/files/doc/fanged_data.docx \
  --output examples/word_input_defang_f \
  --defang false \
  --extract_text_from_image true

Defang this word file;

python3 file2txt.py \
  --mode word \
  --file tests/files/doc/fanged_data.docx \
  --output examples/word_input_defang_t \
  --defang true \
  --extract_text_from_image true

Now try a Powerpoint

python3 file2txt.py \
  --mode powerpoint \
  --file tests/files/ppt/fanged_data.pptx \
  --output examples/ppt_input \
  --defang true \
  --extract_text_from_image true

Extract data from an png image;

python3 file2txt.py \
  --mode image \
  --file tests/files/image/example-1.png \
  --output examples/image_input \
  --defang true \
  --extract_text_from_image true

See how file2txt deals with markdown inputs;

python3 file2txt.py \
  --mode md \
  --file tests/files/markdown/threat-report.md \
  --output examples/markdown_input \
  --defang true \
  --extract_text_from_image true

Tests

For more examples, you can also run our automated scripts to generate files.

python3 -m unittest tests/test_1_output_file_generation.py

You will need a Google Vision in your .env file when running this test.

This script generates output files using a combination of file2txt settings.

You need to check the output manually to ensure it matches expectations.

python3 -m unittest tests/test_2_negative_tests.py

This will test invalid file input settings. All tests are expected to fail.

Debugging

If the script is failing, you can examine the log file printed in logs/ to try and resolve any issues. Each run has its own log, named using execution time (e.g. file2txt_20231127-205228_846248.log).

File types and Input types

You can upload a range of filetypes to file2txt.

File extensions and mimetypes are validated on input for security, if they are not supported an error is returned.

The input file type determines how the files should be handled.

Text (mode: `txt`)

Filetypes supported (mime-type): txt (text/plain)
Embedded images processed using image mode and stored locally: FALSE
Supports paging: FALSE
Python library used for conversion to markdown: n/a

Text (mode: `md`)

Filetypes supported (mime-type): .md (text/markdown), .markdown (text/markdown)
Embedded images processed using image mode and stored locally: TRUE
Supports paging: FALSE
Python library used for conversion to markdown: n/a

Image (mode: `image`)

Filetypes supported (mime-type): jpg (image/jpg), .jpeg (image/jpeg), .png (image/png), .webp (image/webp)
Embedded images processed using image mode and stored locally: TRUE
Supports paging: FALSE
Python library used for conversion to markdown: n/a

CSV (mode: `csv`)

Filetypes supported (mime-type): csv (text/csv)
Embedded images processed using image mode and stored locally: FALSE
Supports paging: FALSE
Python library used for conversion to markdown: pandas

File2txt

Install / Use

README

file2txt

Overview

Download and Install

Configuration options

Optional: Add Marker API Key (`MARKER_API_KEY`)

Optional: Add Google's Cloud Vision API Key (`GOOGLE_VISION_API_KEY`)

Run

Examples

Tests

Debugging

File types and Input types

Text (mode: `txt`)

Text (mode: `md`)

Image (mode: `image`)

CSV (mode: `csv`)

Microsoft Excel (mo

File2txt

Install / Use

README

file2txt

Overview

Download and Install

Configuration options

Optional: Add Marker API Key (MARKER_API_KEY)

Optional: Add Google's Cloud Vision API Key (GOOGLE_VISION_API_KEY)

Run

Examples

Tests

Debugging

File types and Input types

Text (mode: txt)

Text (mode: md)

Image (mode: image)

CSV (mode: csv)

Microsoft Excel (mo

Optional: Add Marker API Key (`MARKER_API_KEY`)

Optional: Add Google's Cloud Vision API Key (`GOOGLE_VISION_API_KEY`)

Text (mode: `txt`)

Text (mode: `md`)

Image (mode: `image`)

CSV (mode: `csv`)