Harvester
Intelligent data acquisition framework for GitHub and web sources
Install / Use
/learn @wzdnzd/HarvesterQuality Score
Category
Development & EngineeringSupported Platforms
README
Harvester - Universal Data Acquisition Framework
📖 中文文档 | English | 🔗 More Tools
A universal, adaptive data acquisition framework designed for comprehensive information acquisition from multiple sources including GitHub, network mapping platforms (FOFA, Shodan), and arbitrary web endpoints. While the current implementation focuses on AI service provider key discovery as a practical example, the framework is architected for extensibility to support diverse data acquisition scenarios.
⭐⭐⭐ If this project helps you, please give it a star! Your support motivates us to keep improving and adding new features.
Table of Contents
Project Goals
The system aims to build a universal data acquisition framework primarily targeting:
- GitHub: Code repositories, issues, commits, and API endpoints
- Network Mapping Platforms:
- Arbitrary Web Endpoints: Custom APIs, web services, and data sources
- Extensible Architecture: Plugin-based system for easy integration of new data sources
Current Data Source Support
| Data Source | Status | Description | | ----------- | ------------- | --------------------------------------- | | GitHub API | ✅ Implemented | Full API integration with rate limiting | | GitHub Web | ✅ Implemented | Web scraping with intelligent parsing | | FOFA | 🚧 Planned | Cyberspace asset discovery integration | | Shodan | 🚧 Planned | IoT and network device enumeration | | Custom APIs | 🚧 Planned | Generic REST/GraphQL API adapter |
Architecture
Layered Architecture
graph TB
%% Entry Layer
subgraph Entry["Entry Layer"]
CLI["CLI Interface<br/>(main.py)"]
App["Application Core<br/>(main.py)"]
end
%% Management Layer
subgraph Management["Management Layer"]
TaskMgr["Task Manager<br/>(manager/task.py)"]
Pipeline["Pipeline Manager<br/>(manager/pipeline.py)"]
WorkerMgr["Worker Manager<br/>(manager/worker.py)"]
QueueMgr["Queue Manager<br/>(manager/queue.py)"]
StatusMgr["Status Manager<br/>(manager/status.py)"]
Shutdown["Shutdown Coordinator<br/>(manager/shutdown.py)"]
end
%% Processing Layer
subgraph Processing["Processing Layer"]
StageBase["Stage Framework<br/>(stage/base.py)"]
StageImpl["Stage Implementations<br/>(stage/definition.py)"]
StageReg["Stage Registry<br/>(stage/registry.py)"]
StageFactory["Stage Factory<br/>(stage/factory.py)"]
StageResolver["Dependency Resolver<br/>(stage/resolver.py)"]
end
%% Service Layer
subgraph Service["Service Layer"]
SearchSvc["Search Service<br/>(search/client.py)"]
SearchProviders["Search Providers<br/>(search/provider/)"]
RefineSvc["Query Refinement<br/>(refine/)"]
RefineEngine["Refine Engine<br/>(refine/engine.py)"]
RefineOptimizer["Query Optimizer<br/>(refine/optimizer.py)"]
end
%% Core Domain Layer
subgraph Core["Core Domain Layer"]
Models["Domain Models & Tasks<br/>(core/models.py)"]
Types["Type System<br/>(core/types.py)"]
Enums["Enumerations<br/>(core/enums.py)"]
Metrics["Metrics<br/>(core/metrics.py)"]
Auth["Authentication<br/>(core/auth.py)"]
end
%% Infrastructure Layer
subgraph Infrastructure["Infrastructure Layer"]
Config["Configuration<br/>(config/)"]
Tools["Tools & Utilities<br/>(tools/)"]
Constants["Constants<br/>(constant/)"]
Storage["Storage & Persistence<br/>(storage/)"]
end
%% State Management Layer
subgraph StateLayer["State Management Layer"]
StateCollector["State Collector<br/>(state/collector.py)"]
StateDisplay["Display Engine<br/>(state/display.py)"]
StateBuilder["Status Builder<br/>(state/builder.py)"]
StateModels["State Models<br/>(state/models.py)"]
StateMonitor["State Monitor<br/>(state/monitor.py)"]
StateEnums["State Enums<br/>(state/enums.py)"]
StateTypes["State Types<br/>(state/types.py)"]
end
%% External Systems
subgraph External["External Systems"]
GitHub["GitHub<br/>(API + Web)"]
AIServices["AI Service<br/>Providers"]
FileSystem["File System<br/>(Local Storage)"]
end
%% Dependencies (Top-down)
Entry --> Management
Management --> Processing
Processing --> Service
Service --> Core
%% Infrastructure dependencies
Entry -.-> Infrastructure
Management -.-> Infrastructure
Processing -.-> Infrastructure
Service -.-> Infrastructure
Core -.-> Infrastructure
%% State management dependencies
Entry -.-> StateLayer
Management -.-> StateLayer
%% External dependencies
Service --> External
Infrastructure --> External
System Architecture Overview
graph TB
%% User Interface Layer
subgraph UserLayer["User Interface Layer"]
User[User]
CLI[Command Line Interface]
ConfigMgmt[Configuration Management]
end
%% Application Management Layer
subgraph AppLayer["Application Management Layer"]
MainApp[Main Application]
TaskManager[Task Manager]
StatusManager[Status Manager]
ResourceManager[Resource Manager]
ShutdownManager[Shutdown Manager]
end
%% Core Pipeline Engine
subgraph PipelineCore["Pipeline Engine"]
%% Stage Management System
subgraph StageSystem["Stage Management System"]
StageRegistry[Stage Registry]
DependencyResolver[Dependency Resolver]
StageFactory[Stage Factory]
end
%% Queue Management System
subgraph QueueSystem["Queue Management System"]
QueueManager[Queue Manager]
WorkerManager[Worker Manager]
MonitoringSystem[System Monitor]
end
%% Processing Stages
subgraph ProcessingStages["Processing Stages"]
SearchStage[Search Stage]
GatherStage[Gather Stage]
CheckStage[Check Stage]
InspectStage[Inspect Stage]
end
end
%% Search Provider Ecosystem
subgraph ProviderEcosystem["Search Provider Ecosystem"]
ProviderRegistry[Provider Registry]
BaseProvider[Base Provider]
OpenAIProvider[OpenAI-like Provider]
CustomProviders[Custom Providers]
end
%% Advanced Processing Engines
subgraph ProcessingEngines["Processing Engines"]
SearchClient[Search Client]
%% Query Optimization Engine
subgraph QueryOptimizer["Query Optimization Engine"]
RefineEngine[Refine Engine]
RegexParser[Regex Parser]
SplittabilityAnalyzer[Splittability Analyzer]
EnumerationOptimizer[Enumeration Optimizer]
QueryGenerator[Query Generator]
OptimizationStrategies[Optimization Strategies]
%% Internal Flow
RefineEngine --> RegexParser
RegexParser --> SplittabilityAnalyzer
SplittabilityAnalyzer --> EnumerationOptimizer
EnumerationOptimizer --> OptimizationStrategies
OptimizationStrategies --> QueryGenerator
end
ValidationEngine[API Key Validation]
RecoveryEngine[Task Recovery]
end
%% State & Data Management
subgraph StateManagement["State & Data Management"]
StateCollector[State Collector]
DisplayEngine[Display Engine]
StatusBuilder[Status Builder]
StateMonitor[State Monitor]
PersistenceLayer[Persistence Layer]
SnapshotManager[Snapshot Manager]
ResultManager[Result Manager]
end
%% Infrastructure Services
subgraph Infrastructure["Infrastructure Services"]
RateLimiting[Rate Limiting]
CredentialMgmt[Credential Management]
AgentRotation[User Agent Rotation]
LoggingSystem[Logging System]
RetryFramework[Retry Framework]
ResourcePool[Resource Pool]
end
%% External Systems
subgraph External["External Systems"]
GitHubAPI[GitHub API]
GitHubWeb[GitHub Web Interface]
AIServiceAPIs[AI Service APIs]
FileSystem[Local File System]
end
%% User Interactions
User --> CLI
User --> ConfigMgmt
CLI --> MainApp
ConfigMgmt --> MainApp
%% Application Flow
MainApp --> TaskManager
MainApp --> StatusManager
MainApp --> ResourceManager
MainApp --> ShutdownManager
TaskManager --> StageRegistry
TaskManager --> QueueManager
%% Stage Management Flow
StageRegistry --> DependencyResolver
StageRegistry --> StageFactory
DependencyResolver --> ProcessingStages
StageFactory --> ProcessingStages
%% Queue Management Flow
QueueManager --> WorkerManager
QueueManager --> MonitoringSystem
WorkerManager --> ProcessingStages
%% Stage Dependencies (Pipeline)
SearchStage --> GatherStage
GatherStage --> CheckStage
CheckStage --> InspectStage
%% Processing Engine Integration
SearchStage --> SearchClient
SearchStage --> QueryOptimizer
CheckStage --> ValidationEngine
ProcessingStages --> RecoveryEngine
%% Provider Integration
SearchClient --> ProviderRegistry
ProviderRegistry --> BaseProvider
BaseProvider --> OpenAIProvider
BaseProvider --> CustomProviders
%% State Management Integration
ProcessingStages --> StateCollector
QueueManager --> StateCollector
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
async-pr-review
100.6kTrigger this skill when the user wants to start an asynchronous PR review, run background checks on a PR, or check the status of a previously started async PR review.
ci
100.6kCI Replicate & Status This skill enables the agent to efficiently monitor GitHub Actions, triage failures, and bridge remote CI errors to local development. It defaults to automatic replication
