SkillAgentSearch skills...

LLMCleanPDFReader

This NLP project leverages a quantised LLM to read and correct text extracted from PDFs. Ideal for students, professionals, and data scientists, it helps clean up and organize text data from various documents. Built to run even on small GPUs with 8GB VRAM, it's a fun learning project aimed at making PDF text extraction smarter and cleaner.

Install / Use

/learn @uallende/LLMCleanPDFReader
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LLMCleanPDFReader

Overview

LLMCleanPDFReader is a project aimed at cleaning up and correcting the text extracted from PDF documents. It utilizes a large language model (Mistral 7B - int4) to correct grammatical errors and separate words that might have been stitched together during the PDF parsing process. This is a learning project and is designed to work efficiently on a small GPU with 8GB VRAM.

Features

  • PDF text extraction
  • Text correction using language models
  • Text chunking for efficient processing
  • Command line interface for easy parameter adjustment

Installation

Clone this repository to your local machine and navigate to the project directory.

git clone https://github.com/uallende/LLMCleanPDFReader.git
cd LLMCleanPDFReader

Install the required Python packages.

pip install -r requirements.txt

Usage

Run the main script and pass in the required arguments.

python main_script.py --doc_path=path/to/document --chunk_size=150 --overlap=0

Contributing

Feel free to fork the project, open a pull request, or report any issues you encounter.

View on GitHub
GitHub Stars4
CategoryEducation
Updated1y ago
Forks0

Languages

Jupyter Notebook

Security Score

55/100

Audited on Apr 23, 2024

No findings