Kosmos
Kosmos: An AI Scientist for Autonomous Discovery - An implementation and adaptation to be driven by Claude Code or API - Based on the Kosmos AI Paper - https://arxiv.org/abs/2511.02824
Install / Use
/learn @jimmc414/KosmosQuality Score
Category
Development & EngineeringSupported Platforms
README
Kosmos
An autonomous AI scientist for scientific discovery, implementing the architecture described in Lu et al. (2024).
What is Kosmos?
Kosmos is an open-source implementation of an autonomous AI scientist that can:
- Generate hypotheses from literature and data analysis
- Design experiments to test those hypotheses
- Execute code in sandboxed Docker containers
- Validate discoveries using an 8-dimension quality framework
- Build knowledge graphs to track relationships between concepts
The system runs autonomous research cycles, generating tasks, executing analyses, and synthesizing findings into validated discoveries.
Quick Start
Requirements
- Python 3.11+
- Anthropic API key or OpenAI API key
- Docker (recommended for code execution)
Without Docker, code runs via exec() with static validation. See "Code Execution Security" below.
Installation
git clone https://github.com/jimmc414/Kosmos.git
cd Kosmos
pip install -e .
cp .env.example .env
# Edit .env and set ANTHROPIC_API_KEY or OPENAI_API_KEY
Verify Installation
# Run smoke tests
python scripts/smoke_test.py
# Run unit tests
pytest tests/unit/ -v --tb=short
Run Research Workflow
import asyncio
from kosmos.workflow.research_loop import ResearchWorkflow
async def run():
workflow = ResearchWorkflow(
research_objective="Your research question here",
artifacts_dir="./artifacts"
)
result = await workflow.run(num_cycles=5, tasks_per_cycle=10)
report = await workflow.generate_report()
print(report)
asyncio.run(run())
CLI Usage
# Run research with default settings
kosmos run "What metabolic pathways differ between cancer and normal cells?" --domain biology
# With budget limit
kosmos run "How do perovskites optimize efficiency?" --domain materials --budget 50
# Interactive mode (recommended for first time)
kosmos run --interactive
# Maximum verbosity
kosmos run "Your question" --domain biology --trace
# Real-time streaming display
kosmos run "Your question" --stream
# Streaming with token display disabled
kosmos run "Your question" --stream --no-stream-tokens
# Show system information
kosmos info
# Run diagnostics
kosmos doctor
Features
Core Capabilities
| Feature | Description | Status | |---------|-------------|--------| | Research Loop | Multi-cycle autonomous research with hypothesis generation | Complete | | Literature Search | ArXiv, PubMed, Semantic Scholar integration | Complete | | Code Execution | Docker-sandboxed Jupyter notebooks | Complete | | Knowledge Graph | Neo4j-based relationship storage (optional) | Complete | | Context Compression | Query-based hierarchical compression (20:1 ratio) | Complete | | Discovery Validation | 8-dimension ScholarEval quality framework | Complete | | Multi-Provider LLM | Anthropic, OpenAI, LiteLLM (100+ providers) | Complete | | Budget Enforcement | Cost tracking with configurable limits and enforcement | Complete | | Error Recovery | Exponential backoff with circuit breaker | Complete | | Debug Mode | 4-level verbosity with stage tracking | Complete | | Real-time Streaming | SSE/WebSocket events, CLI --stream flag | Complete |
Code Execution Security
AI-generated code runs in isolated Docker containers:
| Layer | Implementation |
|-------|---------------|
| Container Isolation | --cap-drop=ALL, no privileged access |
| Network | Disabled (--network=none) |
| Filesystem | Read-only root, tmpfs for scratch |
| Resources | CPU: 2 cores, Memory: 2GB, Timeout: 300s |
| Pooling | Pre-warmed containers reduce cold start |
See: kosmos/execution/sandbox.py, docker_manager.py
Without Docker, falls back to CodeValidator static analysis + exec(). Not recommended for untrusted inputs.
Agent Architecture
| Agent | Role | |-------|------| | Research Director | Master orchestrator coordinating all agents | | Hypothesis Generator | Generates testable hypotheses from literature | | Experiment Designer | Creates experimental protocols | | Data Analyst | Analyzes results and interprets findings | | Literature Analyzer | Searches and synthesizes papers | | Plan Creator/Reviewer | Strategic task generation with 70/30 exploration/exploitation |
How Context Compression Works
The system processes literature in batches, not bulk:
- Relevance Sorting: Papers ranked by query relevance before processing
- Batch Size: Top 10 papers per batch
- Statistics Extraction: Regex-based extraction of p-values, sample sizes, effect sizes
- Tiered Summarization:
- Task: 42K lines code to 2-line summary + extracted stats
- Cycle: 10 task summaries to cycle overview
- Synthesis: 20 cycles to final narrative
- Detail: Full content lazy-loaded when needed
Effective ratio: ~20:1. See kosmos/compression/compressor.py.
Configuration
All configuration via environment variables. See .env.example for the full list.
LLM Provider
# Anthropic (default)
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
# LiteLLM (supports 100+ providers including local models)
LLM_PROVIDER=litellm
LITELLM_MODEL=ollama/llama3.1:8b
LITELLM_API_BASE=http://localhost:11434
Budget Control
BUDGET_ENABLED=true
BUDGET_LIMIT_USD=10.00
Budget enforcement raises BudgetExceededError when the limit is reached, gracefully transitioning the research to completion.
Concurrency
Three independent limits in kosmos/config.py:
| Setting | Default | Range |
|---------|---------|-------|
| max_parallel_hypotheses | 3 | 1-10 |
| max_concurrent_experiments | 10 | 1-16 |
| max_concurrent_llm_calls | 5 | 1-20 |
The paper describes 10 parallel tasks. Default now matches paper specification.
Optional Services
# Neo4j (optional, for knowledge graph features)
NEO4J_URI=bolt://localhost:7687
NEO4J_PASSWORD=your-password
# Redis (optional, for distributed caching)
REDIS_URL=redis://localhost:6379
Docker Setup for Optional Services
Start Neo4j, Redis, and PostgreSQL with Docker Compose:
# Start all optional services (Neo4j, Redis, PostgreSQL)
docker compose --profile dev up -d
# Or start individual services
docker compose up -d neo4j
docker compose up -d redis
docker compose up -d postgres
# Stop services
docker compose --profile dev down
Service URLs when running via Docker:
- Neo4j Browser: http://localhost:7474 (user: neo4j, password: kosmos-password)
- PostgreSQL: localhost:5432 (user: kosmos, password: kosmos-dev-password)
- Redis: localhost:6379
Semantic Scholar API
Literature search via Semantic Scholar works without authentication. An API key is optional but increases rate limits:
# Optional: Get API key from https://www.semanticscholar.org/product/api
SEMANTIC_SCHOLAR_API_KEY=your-key-here
Debug Mode
# Enable debug mode with level 1-3
DEBUG_MODE=true
DEBUG_LEVEL=2
# Or use CLI flag for maximum verbosity
kosmos run "Your research question" --trace
See docs/DEBUG_MODE.md for comprehensive debug documentation.
Architecture
kosmos/
├── agents/ # Research agents (director, hypothesis, experiment, etc.)
├── compression/ # Context compression (20:1 ratio)
├── core/ # LLM providers, metrics, configuration
│ └── providers/ # Anthropic, OpenAI, LiteLLM with async support
├── execution/ # Docker-based sandboxed code execution
├── knowledge/ # Neo4j knowledge graph (1,025 lines)
├── literature/ # ArXiv, PubMed, Semantic Scholar clients
├── orchestration/ # Plan creation/review, task delegation
├── validation/ # ScholarEval 8-dimension quality framework
├── workflow/ # Main research loop integration
└── world_model/ # State management, JSON artifacts
Project Status
Implementation Completeness
| Category | Percentage | Description | |----------|------------|-------------| | Paper gaps | 100% | All 17 paper implementation gaps complete | | Ready for user testing | 95% | Core research loop, agents, LLM providers, validation | | Deferred | 5% | Phase 4 production mode (polyglot persistence) |
Fixed Issues (Recent)
| Issue | Description | Status | |-------|-------------|--------| | #66 | CLI deadlock - async refactor | ✅ Fixed | | #67 | SkillLoader domain mapping | ✅ Fixed | | #68 | Pydantic V2 migration | ✅ Fixed | | #54-#58 | Critical paper gaps | ✅ Fixed | | #59 | h5ad/Parquet data formats | ✅ Fixed | | #69 | R language execution | ✅ Fixed | | #60 | Figure generation | ✅ Fixed | | #61 | Jupyter notebook generation | ✅ Fixed | | #70 | Null model statistical validation | ✅ Fixed | | #63 | Failure mode detection | ✅ Fixed | | #62 | Code line provenance | ✅ Fixed | | #64 | Multi-run convergence framework | ✅ Fixed | | [#65](https://github.com/
