PDF Analyzer & Markdown Converter

Project Brief & Development Requirements

🎯 Project Overview

Build a comprehensive PDF analysis application that extracts text, images, and tables from uploaded PDFs and outputs everything in clean Markdown format. The system uses Appwrite as the backend service for authentication, storage, and database management.

🏆 Value Proposition

Problem Solved: Manual PDF content extraction is time-consuming and error-prone
Target Users: Researchers, students, content creators, document processors
Unique Features: Complete PDF reconstruction with tables and smart Markdown formatting
Hackathon Appeal: Showcases full-stack development with real-time processing and modern web technologies

🔧 Technical Architecture

Core Stack

Frontend: React 18+ with TypeScript
Backend: Appwrite (BaaS) + Python Functions
Storage: Appwrite Storage
Database: Appwrite Database
Styling: Tailwind CSS
State Management: React Context + useReducer

System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   React App     │───▶│   Appwrite       │───▶│  Python         │
│   (Frontend)    │    │   (Backend)      │    │  Functions      │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                              │                          │
                              ▼                          ▼
                       ┌──────────────┐         ┌──────────────┐
                       │   Storage &  │         │   PDF        │
                       │   Database   │         │ Processing   │
                       └──────────────┘         └──────────────┘

📋 Functional Requirements

Core Features (MVP)

User Authentication
- Email/password registration and login
- Protected routes for authenticated users
- User session management
PDF Upload & Management
- Drag-and-drop PDF upload interface
- File validation (PDF only, size limits)
- Upload progress indicators
- File management dashboard
PDF Processing Pipeline
- Text extraction from PDF documents
- Image detection and extraction
- Table identification and reconstruction
Markdown Generation
- Convert extracted content to structured Markdown
- Preserve document hierarchy (headers, lists, etc.)
- Include extracted images with proper references
- Format tables in Markdown table syntax
Real-time Processing Status
- Live updates on processing progress
- Error handling and user feedback
- Processing queue management
Results Display & Export
- Preview extracted Markdown content
- Download processed Markdown files
- View extracted images separately
- Copy to clipboard functionality

Advanced Features (Future Improvements)

OCR processing for image content using Google Vision API
Batch PDF processing
Document comparison tools
Advanced table formatting options
Export to multiple formats (HTML, DOCX)
Document search and indexing

🏗️ Development Requirements & Guidelines

Code Quality Standards

SOLID Principles: Apply Single Responsibility, Open/Closed, Liskov Substitution, Interface Segregation, and Dependency Inversion
DRY (Don't Repeat Yourself): Extract common logic into reusable functions/components
KISS (Keep It Simple, Stupid): Prefer simple, readable solutions over complex ones
Component Reusability: Convert to reusable components when used 2+ times

File Structure

src/
├── components/           # Reusable UI components
│   ├── common/          # Generic components (Button, Modal, etc.)
│   ├── forms/           # Form-specific components
│   └── layout/          # Layout components
├── pages/               # Page components
├── services/            # API and external service integrations
├── hooks/               # Custom React hooks
├── contexts/            # React Context providers
├── utils/               # Utility functions
├── types/               # TypeScript type definitions
└── constants/           # Application constants

Component Architecture Guidelines

// Example: Reusable component structure
interface ComponentProps {
  // Props interface
}

const Component: React.FC<ComponentProps> = ({ ...props }) => {
  // Component logic
  return (
    // JSX
  );
};

export default Component;

🔌 API Integration Requirements

Appwrite Configuration

// appwrite.config.js
const client = new Client()
  .setEndpoint(process.env.REACT_APP_APPWRITE_ENDPOINT)
  .setProject(process.env.REACT_APP_APPWRITE_PROJECT_ID);

const databases = new Databases(client);
const storage = new Storage(client);
const functions = new Functions(client);
const account = new Account(client);

Database Schema Design

// Collections
const collections = {
  documents: {
    userId: "string",
    originalFileName: "string",
    fileId: "string",  // IMPORTANT: This must be the Appwrite file ID (e.g., "68baf465002390a5d863"), NOT the filename
    status: "pending|processing|completed|failed",
    processingStarted: "datetime",
    processingCompleted: "datetime",
    extractedText: "string",
    markdownContent: "string",
    imageIds: "string[]",  // Array of Appwrite file IDs for extracted images
    tableCount: "integer",
    ocrEnabled: "boolean",
    errorMessage: "string",
    metadata: "object"
  },

  images: {
    documentId: "string",
    fileId: "string",
    originalName: "string",
    ocrText: "string",
    position: "integer",
    boundingBox: "object"
  },

  tables: {
    documentId: "string",
    position: "integer",
    rows: "integer",
    columns: "integer",
    data: "object",
    markdownTable: "string"
  }
};

File Upload API Endpoint

Endpoint: `/upload`

Method: POST
Content-Type: multipart/form-data

Upload PDF files to the system for processing. The endpoint automatically creates storage buckets, handles file storage, and creates database records.

Request Format

Form Data:

Key: file
Type: File
Value: Select your PDF file (e.g., C5_W4.pdf)

Headers:

x-user-id: your-user-id          # Required: User identifier
x-filename: C5_W4.pdf           # Optional: Override filename (if not detected from file)
x-bucket-id: pdf-files          # Optional: Storage bucket (default: pdf-files)

Postman Setup

Method: POST
URL: https://your-function-url/upload
Body:
- Select "form-data"
- Add key file as File type
- Select your PDF file
Headers (optional but recommended):
- x-user-id: your-user-id
- x-filename: C5_W4.pdf (if needed)

Success Response

{
  "success": true,
  "message": "PDF uploaded successfully",
  "documentId": "67b8f1a5002c8e9d1f2a",
  "fileId": "67b8f1a6002c8e9d1f2b",
  "fileName": "default-user_1757340749_C5_W4.pdf",
  "bucketId": "pdf-files"
}

Error Responses

// Missing file
{
  "error": "No file data found in request",
  "usage": "Send PDF file as multipart/form-data with key 'file'"
}

// Storage bucket error
{
  "error": "Failed to create or access bucket 'pdf-files'",
  "available_buckets": ["existing-bucket-1"],
  "suggestion": "Create bucket manually or use existing bucket"
}

// General error
{
  "error": "File upload failed",
  "message": "Detailed error message",
  "type": "ExceptionType"
}

Features

✅ Automatic bucket creation - Creates storage bucket if it doesn't exist
✅ File validation - Validates PDF format and file integrity
✅ Database integration - Creates document records automatically
✅ Error recovery - Cleans up files if database operations fail
✅ Flexible headers - Supports custom user IDs and bucket names
✅ Comprehensive logging - Detailed logs for debugging

Python Function Structure

# functions/src/main.py
from .upload_handler import UploadHandler

def main(context):
    if context.req.path == "/upload":
        upload_handler = UploadHandler(context)
        result = upload_handler.handle_upload(context.req)
        return context.res.json(result)

# functions/src/upload_handler.py
class UploadHandler:
    def handle_upload(self, request) -> Dict[str, Any]:
        # Extract file data and metadata
        # Validate and process upload
        # Return structured response
        pass

🎨 UI/UX Requirements

Design Principles

Clean & Modern: Minimalist interface with intuitive navigation
Responsive: Mobile-first design approach
Accessible: WCAG 2.1 AA compliance
Performance: Fast loading and smooth interactions

Key UI Components (Reusable)

FileUploader - Drag & drop with progress
ProcessingStatus - Real-time status updates with separate dialogs for upload vs processing
MarkdownPreview - Syntax-highlighted preview
ProgressIndicator - Processing progress with step-by-step indicators
ErrorBoundary - Error handling wrapper
LoadingSpinner - Loading states
Toast - Notifications system
Modal - Dialog wrapper
Navigation Bar - Always visible with user info and logout
ResultsView - Clean markdown-only results display

Color Scheme & Theming

Primary: Blue/Indigo for actions and links
Secondary: Gray for neutral elements
Success: Green for completed states
Warning: Yellow for processing states
Error: Red for error states
Background: Light gray with white cards

📦 Dependencies & Setup

Frontend Dependencies

{
  "dependencies": {
    "react": "^18.2.0",
    "react-dom": "^18.2.0",
    "react-router-dom": "^6.8.0",
    "appwrite": "^13.0.0",
    "typescript": "^4.9.0",
    "tailwindcss": "^3.2.0",
    "react-markdown": "^8.0.0",
    "react-syntax-highlighter": "^15.5.0",
    "react-d

Pdfusion

Install / Use

README

PDF Analyzer & Markdown Converter

Project Brief & Development Requirements

🎯 Project Overview

🏆 Value Proposition

🔧 Technical Architecture

Core Stack

System Architecture

📋 Functional Requirements

Core Features (MVP)

Advanced Features (Future Improvements)

🏗️ Development Requirements & Guidelines

Code Quality Standards

File Structure

Component Architecture Guidelines

🔌 API Integration Requirements

Appwrite Configuration

Database Schema Design

File Upload API Endpoint

Endpoint: `/upload`

Request Format

Postman Setup

Success Response

Error Responses

Features

Python Function Structure

🎨 UI/UX Requirements

Design Principles

Key UI Components (Reusable)

Color Scheme & Theming

📦 Dependencies & Setup

Frontend Dependencies

Pdfusion

Install / Use

README

PDF Analyzer & Markdown Converter

Project Brief & Development Requirements

🎯 Project Overview

🏆 Value Proposition

🔧 Technical Architecture

Core Stack

System Architecture

📋 Functional Requirements

Core Features (MVP)

Advanced Features (Future Improvements)

🏗️ Development Requirements & Guidelines

Code Quality Standards

File Structure

Component Architecture Guidelines

🔌 API Integration Requirements

Appwrite Configuration

Database Schema Design

File Upload API Endpoint

Endpoint: /upload

Request Format

Postman Setup

Success Response

Error Responses

Features

Python Function Structure

🎨 UI/UX Requirements

Design Principles

Key UI Components (Reusable)

Color Scheme & Theming

📦 Dependencies & Setup

Frontend Dependencies

Endpoint: `/upload`