ScaleDP

ScaleDP is an Open-Source extension of Apache Spark for Document Processing

Generate Convert Improve

Install / Use

/learn @StabRise/ScaleDP

About this skill

Quality Score

0/100

README

<a href="https://stabrise.com/scaledp/" target="_blank"><img alt="ScaleDP" src="https://stabrise.com/static/images/projects/scaledp.webp" width="450" style="max-width: 100%;"></a> An Open-Source Library for Processing Documents using AI/ML in Apache Spark. <a href="https://pypi.org/project/scaledp/" alt="Package on PyPI"><img src="https://img.shields.io/pypi/v/scaledp.svg" /></a> <a href="https://github.com/stabrise/scaledp/blob/main/LICENSE"><img alt="GitHub" src="https://img.shields.io/github/license/stabrise/scaledp.svg?color=blue"></a> <a href="https://stabrise.com"><img alt="StabRise" src="https://img.shields.io/badge/powered%20by-StabRise-orange.svg?style=flat&colorA=E1523D&colorB=blue"></a> <a href="https://app.codacy.com/gh/StabRise/ScaleDP/dashboard?utm_source=gh&utm_medium=referral&utm_content=&utm_campaign=Badge_grade"> <img src="https://app.codacy.com/project/badge/Grade/98570508281140c2a33e616a4f749c20" alt="Codacy Badge" /></a> <a href="https://github.com/pre-commit/pre-commit"><img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white" alt="pre-commit"></a> <a href="https://python-poetry.org/"><img src="https://img.shields.io/endpoint?url=https://python-poetry.org/badge/v0.json" alt="Poetry"/></a> <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg"alt="Code style: black"/></a> <a href="https://scaledp.stabrise.com/en/latest/"><img src="https://app.readthedocs.org/projects/scaledp/badge/?version=latest" alt="Documentation Status"/></a>

Source Code: <a href="https://github.com/StabRise/ScaleDP/" target="_blank">https://github.com/StabRise/ScaleDP</a>

Quickstart: <a href="https://colab.research.google.com/github/StabRise/scaledp-tutorials/blob/master/1.QuickStart.ipynb" target="_blank">1.QuickStart.ipynb</a>

Documentations: <a href="https://scaledp.stabrise.com/en/latest/" target="_blank">https://scaledp.stabrise.com</a>

Tutorials: <a href="https://github.com/StabRise/ScaleDP-Tutorials/" target="_blank">https://github.com/StabRise/ScaleDP-Tutorials</a>

Change Log

Welcome to the ScaleDP library

ScaleDP is library allows you to process documents using AI/ML capabilities and scale it using Apache Spark.

LLM (Large Language Models) and VLM (Vision Language Models) models are used to extract data from text and images in combination with OCR engines.

Discover pre-trained models for your projects or play with the thousands of models hosted on the Hugging Face Hub.

Key features

Document processing:

✅ Loading PDF documents/Images to the Spark DataFrame (using Spark PDF Datasource and as binaryFile)
✅ Extraction text/images from PDF documents/Images
✅ Zero-Shot extraction structured data from text/images using LLM and ML models
✅ Possibility run as REST API service without Spark Session for have minimum processing latency
✅ Support Streaming mode for processing documents in real-time

LLM:

Support OpenAI compatible API for call LLM/VLM models (GPT, Gemini, GROQ, etc.)

OCR Images/PDF documents using Vision LLM models
Extract data from the image using Vision LLM models
Extract data from the text/images using LLM models
Extract data using DSPy framework
NER using LLM's
Visualize results

NLP:

Extract data from the text/images using NLP models from the Hugging Face Hub
NER using classical ML models

OCR:

Support various open-source OCR engines:

Tesseract OCR
Easy OCR
Surya OCR
DocTR
Vision LLM models

CV:

Object detection on images using YOLO models
Text detection on images

Installation

Prerequisites

Python 3.10 or higher
Apache Spark 3.5 or higher
Java 8

Installation using pip

Install the ScaleDP package with pip:

pip install scaledp

Installation using Docker

Build image:

  docker build -t scaledp .

Run container:

  docker run -p 8888:8888 scaledp:latest

Open Jupyter Notebook in your browser:

  http://localhost:8888

Qiuckstart

Start a Spark session with ScaleDP:

from scaledp import *
spark = ScaleDPSession()
spark

Read example image file:

image_example = files('resources/images/Invoice.png')
df = spark.read.format("binaryFile") \
    .load(image_example)

df.show_image("content")

Output:

Zero-Shot data Extraction from the Image:

from pydantic import BaseModel
import json

class Items(BaseModel):
    date: str
    item: str
    note: str
    debit: str

class InvoiceSchema(BaseModel):
    hospital: str
    tax_id: str
    address: str
    email: str
    phone: str
    items: list[Items]
    total: str
    

pipeline = PipelineModel(stages=[
    DataToImage(
        inputCol="content",
        outputCol="image"
    ),
    LLMVisualExtractor(
        inputCol="image",
        outputCol="invoice",
        model="gemini-1.5-flash",
        apiKey="",
        apiBase="https://generativelanguage.googleapis.com/v1beta/",
        schema=json.dumps(InvoiceSchema.model_json_schema())
    )
])

result = pipeline.transform(df).cache()

Show the extracted json:

result.show_json("invoice")

Let's show Invoice as Structured Data in Data Frame

result.select("invoice.data.*").show()

Output:

+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+
|           hospital|   tax_id|             address|               email|         phone|               items|  total|
+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+
|Hope Haven Hospital|26-123123|855 Howard Street...|hopedutton@hopeha...|(123) 456-1238|[{10/21/2022, App...|1024.50|
+-------------------+---------+--------------------+--------------------+--------------+--------------------+-------+

Schema:

result.printSchema()

root
 |-- path: string (nullable = true)
 |-- modificationTime: timestamp (nullable = true)
 |-- length: long (nullable = true)
 |-- image: struct (nullable = true)
 |    |-- path: string (nullable = false)
 |    |-- resolution: integer (nullable = false)
 |    |-- data: binary (nullable = false)
 |    |-- imageType: string (nullable = false)
 |    |-- exception: string (nullable = false)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |-- invoice: struct (nullable = true)
 |    |-- path: string (nullable = false)
 |    |-- json_data: string (nullable = true)
 |    |-- type: string (nullable = false)
 |    |-- exception: string (nullable = false)
 |    |-- processing_time: double (nullable = false)
 |    |-- data: struct (nullable = true)
 |    |    |-- hospital: string (nullable = false)
 |    |    |-- tax_id: string (nullable = false)
 |    |    |-- address: string (nullable = false)
 |    |    |-- email: string (nullable = false)
 |    |    |-- phone: string (nullable = false)
 |    |    |-- items: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- date: string (nullable = false)
 |    |    |    |    |-- item: string (nullable = false)
 |    |    |    |    |-- note: string (nullable = false)
 |    |    |    |    |-- debit: string (nullable = false)
 |    |    |-- total: string (nullable = false)

NER using model from the HuggingFace models Hub

Define pipeline for extract text from the image and run NER:

pipeline = PipelineModel(stages=[
    DataToImage(inputCol="content", outputCol="image"),
    TesseractOcr(inputCol="image", outputCol="text", psm=PSM.AUTO, keepInputData=True),
    Ner(model="obi/deid_bert_i2b2", inputCol="text", outputCol="ner", keepInputData=True),
    ImageDrawBoxes(inputCols=["image", "ner"], outputCol="image_with_boxes", lineWidth=3, 
                   padding=5, displayDataList=['entity_group'])
])

result = pipeline.transform(df).cache()

result.show_text("text")

Output:

Show NER results:

result.show_ner(limit=20)

Output:

+------------+-------------------+----------+-----+---+--------------------+
|entity_group|              score|      word|start|end|               boxes|
+------------+-------------------+----------+-----+---+--------------------+
|        HOSP|  0.991257905960083|  Hospital|    0|  8|[{Hospital:, 0.94...|
|         LOC|  0.999171257019043|    Dutton|   10| 16|[{Dutton,, 0.9609...|
|         LOC| 0.9992585778236389|        MI|   18| 20|[{MI, 0.93335297,...|
|          ID| 0.6838774085044861|        26|   29| 31|[{26-123123, 0.90...|
|       PHONE| 0.4669836759567261|         -|   31| 32|[{26-123123, 0.90...|
|       PHONE| 0.7790696024894714|    123123|   32| 38|[{26-123123, 0.90...|
|        HOSP|0.37445762753486633|      HOPE|   39| 43|[{HOPE, 0.9525460...|
|        HOSP| 0.9503226280212402|     HAVEN|   44| 49|[{HAVEN, 0.952546...|
|         LOC| 0.9975488185882568|855 Howard|   59| 69|[{855, 0.94682700...|
|         LOC| 0.9984399676322937|    Street|   70| 76|[{Street, 0.95823...|
|        HOSP| 0.3670221269130707|  HOSPITAL|   77| 85|[{HOSPITAL, 0.959.