SkillAgentSearch skills...

Pixels

Facilitates simple large scale processing of HLS Medical images, documents, zip files. OHIF Viewer, 2 segmentation models and interactive learning.

Install / Use

/learn @databricks-industry-solutions/Pixels
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div bgcolor="white" style="display: flex;"> <img src=https://hls-eng-data-public.s3.amazonaws.com/img/Databricks_HLS.png width="380px" align="center"> <img width="800" height="333" alt="Pixels Logo" src="https://github.com/user-attachments/assets/bee10938-caf3-424f-9941-a53ccf27e546" /> </div>

pixels Solution Accelerator

✅ Ingest and index DICOM image metadata (.dcm and from zip archives) </br> ✅ Analyze DICOM image metadata with SQL and Machine Learning. </br> ✅ View, segment, label DICOM Images with OHIF viewer integrated into Lakehouse Apps and Databricks security model. </br> ✅ One button push to launch model training from OHIF viewer. </br> ✅ NVIDIA's MONAI Integration, AI to automatically segment medical images and train custom models. </br> ✅ Leverage Databricks' Model Serving with serverless GPU enabled clusters for real-time segmentation.


Secure Lakehouse integrated DICOM Viewer powered by OHIF

<img src="https://github.com/databricks-industry-solutions/pixels/blob/main/images/LHA_AUTOSEG.gif?raw=true" alt="MONAI_AUTOSEG"/></br>


Run SQL queries over DICOM metadata

Analyze


Build Dashboards over DICOM metadata

add any features extracted too! Dashboard


DICOM data ingestion is easy

# import Pixels Catalog (indexer) and DICOM transformers & utilities
from dbx.pixels import Catalog                              # 01
from dbx.pixels.dicom import *                              # 02

# catalog all your files
catalog = Catalog(spark)                                    # 03
catalog_df = catalog.catalog(<path>)                        # 04

# extract the DICOM metadata
meta_df = DicomMetaExtractor(catalog).transform(catalog_df) # 05

# save your work for SQL access
catalog.save(meta_df)                                       # 06

You'll find this example in 01-dcm-demo which does:


Architecture

image

The image depicts the Pixels Reference Solution Architecture, which outlines a data processing and analytics framework designed for healthcare or imaging applications. Here's a breakdown of its components:

Key Functional Areas

  1. AI/BI Analytics: Supports cohort building and natural language-based analysis.

  2. Lakehouse Apps: Includes an OHIF Viewer for labeling and customer-specific applications.

  3. Deep Learning: Facilitates active learning and customer model training.

  4. Realtime Inferencing: Implements MONAI (Medical Open Network for AI) for segmentation integration with the OHIF viewer. Customer provided proprietary models can be easily plugged in.

Data Flow: Batch, Incremental, Streaming Lakeflow

The architecture processes data in stages:

  1. Acquire: from data in ADLS, S3, GCS cloud storage as governed by Unity Catalog (UC) Volumes. Based on customer demand, due to the composible nature of the solution accelerator, sources VNA, PACS, CIFS, AWS HealthImaging can be added as needed.

  2. Ingest: Ultimately all the DICOM files are ingested. Ingesting and producing Nifti file formats are currently on the roadmap.

  3. Extract & Index: Unzips files, storing the extracted DICOM files into a UC volume. All of the DICOM metadata tags are extracted and stored in Databricks Data Intelligence Platform tables.

  4. Protect – Metadata: Applies PHI (Protected Health Information) redaction via format preserving encryption to all necessary tags.

  5. Protect – Image: Ensures PHI redaction for pixel-level data. This is under active integration based on work Databricks has done in previous solution accelerators.

  6. Inferencing: Utilizes industry-standard models pre-trained MONAI open source models sponsored by NVIDIA. Similarly, customers can fine tune the MONAI models or bring their own segmentation or featurization models.

Supporting Layers

  • Governance Layer: Unity Catalog provides data access controls, automatic capture of data lineage (including models)

  • Customer’s Cloud Storage: Stores object indexes, folders, and ML models in open formats in customer's account.

  • Open Access: Provides APIs, SQL connections, Spark integration, and credential vending via Delta Sharing.

This architecture is designed to handle healthcare imaging data securely while enabling advanced analytics and AI-driven insights.

DICOMweb Apps Reference

For the Databricks Apps architecture and operations guide (viewer app, gateway app, QIDO/WADO/STOW implementation, caching, metrics, and config reference), see:

The notebook-driven OHIF/MONAI sections in this README remain valid for interactive workspace workflows. For production DICOMweb deployments with the split dicom_web + dicom_web_gateway Databricks Apps architecture, use README_DICOMWEB.md as the source of truth.


Getting started

  1. To run this accelerator, clone this repo into a Databricks workspace.

  2. Attach a notebook to Serverless Compute or a cluster (>=DBR 14.3 LTS)

  3. Run config/setup.py from the notebook. This will install the pixels package onto your workspace

  4. If you need additional libraries to decode or encode DICOM pixel data, use the pydicom guidance to pick the right optional codec package(s): pydicom pixel data decompression guide.

Run pipeline as a job

  1. Attach the RUNME notebook to Serverless Compute or a cluster (>=DBR 14.3 LTS). 2. Execute the notebook via Run-All. You can configure the notebook tasks run by the job in job_json A multi-step-job describing the accelerator pipeline will be created, and the link will be provided. The cost associated with running the accelerator is the user's responsibility.

Incremental processing

Pixels allows you to ingest DICOM files in a streaming fashion using autoloader capability. To enable incremental processing you need to set streaming and streamCheckpointBasePath as follows:

catalog_df = catalog.catalog(path, streaming=True, streamCheckpointBasePath=<checkpointPath>)

Optional: managed file events with Auto Loader

For higher scalability, you can enable managed file events for discovery instead of directory listing.

catalog_df = catalog.catalog(
  path,
  streaming=True,
  streamCheckpointBasePath=<checkpointPath>,
  useManagedFileEvents=True,
  includeExistingFiles=True,
  allowOverwrites=False,
  maxFileAge="90 days"
)

Best practices:

  • Use Unity Catalog Volumes or external locations governed by Unity Catalog.
  • Ensure the stream runs at least once every 7 days to keep file events warm.
  • Keep allowOverwrites=False unless upstream systems can overwrite files.
  • Use maxFileAge to bound discovery windows for large/high-churn landing zones.
  • Reuse a stable checkpoint path across runs to avoid reprocessing.

Built-in unzip

Automatically extracts zip files in the defined volume path. If extractZip is not enabled then zip files will be ignored. To enable unzip capability you need to set extractZip. The parameter extractZipBasePath is optional and the default path will be volume + /unzipped/

catalog_df = catalog.catalog(path, extractZip=True, extractZipBasePath=<unzipPath>)

Metadata Anonymization

Pixels provides a feature to anonymize DICOM metadata to ensure patient privacy and compliance with regulations. This feature can be enabled during the cataloging process. An example can be explored in the 03-Metadata-DeIdentification notebook.

To enable metadata anonymization, you can use the following extractor:

metadata_df = DicomMetaAnonymizerExtractor(
   catalog,
   anonym_mode="METADATA",
   fp_key=<fp_key>, #ONLY HEX STRING ALLOWED - 128, 192 or 256 bits
   fp_tweak=<fp_tweak>,   #ONLY HEX STRING ALLOWED - 64 bits
   anonymization_base_path=<anonym_path>
).transform(catalog_df)

fp_key is the format preserving encryption key used to ensure that the anonymization process is consistent across different runs. This key is used to generate pseudonyms for sensitive data fields, ensuring that the same input value always maps to the same pseudonym. This is useful for maintaining the ability to link records across datasets without revealing the original sensitive information.

fp_tweak is an optional parameter that can be used to add an additional layer of randomness to the pseudonymization process. This can be useful for further enhancing privacy.

By setting the anonym_mode parameter to "METADATA", the DICOM metadata will be anonymized during the ingestion process. This ensures that sensitive patient information is not stored in the catalog. The default configuration will save the anonymized DICOM files under anonymization_base_path property's path.

Remove UN Tags

DICOM files can contain elements with Value Representation UN (Unknown), which are tags that could not be resolved to a specific VR during parsing. These tags often carry unstructured or proprietary data that can bloat the extracted metadata, cause serialization issues, or introduce noise in downstream analytics.

Pixels provides a built-in option to strip all UN VR elements from the dataset before metadata extraction. The removal is recursive, so UN elements nest

Related Skills

View on GitHub
GitHub Stars387
CategoryEducation
Updated6h ago
Forks120

Languages

JavaScript

Security Score

85/100

Audited on Mar 20, 2026

No findings