SkillAgentSearch skills...

Onefilellm

Specify a github or local repo, github pull request, arXiv or Sci-Hub paper, Youtube transcript or documentation URL on the web and scrape into a text file and clipboard for easier LLM ingestion

Install / Use

/learn @jimmc414/Onefilellm

README

OneFileLLM

Content Aggregator for LLMs - Aggregate and structure multi-source data into a single XML file for LLM context.

Description

OneFileLLM is a command-line tool that automates data aggregation from various sources (local files, GitHub repos, web pages, PDFs, YouTube transcripts, etc.) and combines them into a single, structured XML output that's automatically copied to your clipboard for use with Large Language Models.

Installation

git clone https://github.com/jimmc414/onefilellm.git
cd onefilellm
pip install -r requirements.txt

Pip install

OneFileLLM is also available as a pip package. You can install it directly and use both the CLI and Python API without cloning the repository:

pip install onefilellm

Command-Line Interface (CLI)

This project can also be installed as a command-line tool, which allows you to run onefilellm directly from your terminal.

CLI Installation

To install the CLI, run the following command in the project's root directory:

pip install -e .

This will install the package in "editable" mode, meaning any changes you make to the source code will be immediately available to the command-line tool.

CLI Usage

Once installed, you can use the onefilellm command instead of python onefilellm.py.

Synopsis: onefilellm [OPTIONS] [INPUT_SOURCES...]

Example:

onefilellm ./docs/ https://github.com/user/project/issues/123

All other command-line arguments and options work the same as the script-based approach.

For GitHub API access (recommended):

export GITHUB_TOKEN="your_personal_access_token"

Python API

After installing via pip, OneFileLLM can be invoked directly from Python code.

from onefilellm import run

# Process inputs programmatically
run(["./docs/"])

Command Help

usage: onefilellm.py [-h] [-c]
                     [-f {text,markdown,json,html,yaml,doculing,markitdown}]
                     [--alias-add NAME [COMMAND_STRING ...]]
                     [--alias-remove NAME] [--alias-list] [--alias-list-core]
                     [--crawl-max-depth CRAWL_MAX_DEPTH]
                     [--crawl-max-pages CRAWL_MAX_PAGES]
                     [--crawl-user-agent CRAWL_USER_AGENT]
                     [--crawl-delay CRAWL_DELAY]
                     [--crawl-include-pattern CRAWL_INCLUDE_PATTERN]
                     [--crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN]
                     [--crawl-timeout CRAWL_TIMEOUT] [--crawl-include-images]
                     [--crawl-no-include-code] [--crawl-no-extract-headings]
                     [--crawl-follow-links] [--crawl-no-clean-html]
                     [--crawl-no-strip-js] [--crawl-no-strip-css]
                     [--crawl-no-strip-comments] [--crawl-respect-robots]
                     [--crawl-concurrency CRAWL_CONCURRENCY]
                     [--crawl-restrict-path] [--crawl-no-include-pdfs]
                     [--crawl-no-ignore-epubs] [--help-topic [TOPIC]]
                     [inputs ...]

OneFileLLM - Content Aggregator for LLMs

positional arguments:
  inputs                Input paths, URLs, or aliases to process

options:
  -h, --help            show this help message and exit
  -c, --clipboard       Process text from clipboard
  -f {text,markdown,json,html,yaml,doculing,markitdown}, --format {text,markdown,json,html,yaml,doculing,markitdown}
                        Override format detection for text input
  --help-topic [TOPIC]  Show help for specific topic (basic, aliases,
                        crawling, pipelines, examples, config)

## Quick Start Examples

### Local Files and Directories
```bash
python onefilellm.py research_paper.pdf config.yaml src/
python onefilellm.py *.py requirements.txt docs/ README.md
python onefilellm.py notebook.ipynb --format json
python onefilellm.py large_dataset.csv logs/ --format text

GitHub Repositories and Issues

python onefilellm.py https://github.com/microsoft/vscode
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper
python onefilellm.py https://github.com/microsoft/vscode/pull/12345
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=all
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=open
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=closed

You can retrieve issues for a repository by specifying the state query parameter. Use state=all (default) to fetch all issues, state=open for open issues only, or state=closed for closed issues.

Using Specific Branches or Tags

Is it possible to use this tool with different branches on a GitHub repository?

Yes. When you supply a GitHub URL that includes a branch (e.g., https://github.com/openai/whisper/tree/main/whisper), the tool parses the tree/ portion and sends the request with a ref parameter so that the specified branch or tag is retrieved.

Web Documentation and APIs

python onefilellm.py https://docs.python.org/3/tutorial/
python onefilellm.py https://react.dev/learn/thinking-in-react
python onefilellm.py https://docs.stripe.com/api
python onefilellm.py https://kubernetes.io/docs/concepts/

Multimedia and Academic Sources

python onefilellm.py https://www.youtube.com/watch?v=dQw4w9WgXcQ
python onefilellm.py https://arxiv.org/abs/2103.00020
python onefilellm.py arxiv:1706.03762 PMID:35177773
python onefilellm.py doi:10.1038/s41586-021-03819-2

Multiple Inputs

python onefilellm.py https://github.com/jimmc414/hey-claude https://modelcontextprotocol.io/llms-full.txt https://github.com/anthropics/anthropic-sdk-python https://github.com/anthropics/anthropic-cookbook
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper https://www.youtube.com/watch?v=dQw4w9WgXcQ ALIAS_MCP
python onefilellm.py https://github.com/microsoft/vscode/pull/12345 https://arxiv.org/abs/2103.00020 
python onefilellm.py https://github.com/kubernetes/kubernetes/issues https://pytorch.org/docs

Input Streams

python onefilellm.py --clipboard --format markdown
cat large_dataset.json | python onefilellm.py - --format json
curl -s https://api.github.com/repos/microsoft/vscode | python onefilellm.py -
echo 'Quick analysis task' | python onefilellm.py -

Alias System

Create Simple and Complex Aliases

python onefilellm.py --alias-add mcp "https://github.com/anthropics/mcp"
python onefilellm.py --alias-add modern-web \
  "https://github.com/facebook/react https://reactjs.org/docs/ https://github.com/vercel/next.js"

Dynamic Placeholders

# Create placeholders with {}
python onefilellm.py --alias-add gh-search "https://github.com/search?q={}"
python onefilellm.py --alias-add gh-user "https://github.com/{}"
python onefilellm.py --alias-add arxiv-search "https://arxiv.org/search/?query={}"

# Use placeholders dynamically
python onefilellm.py gh-search "machine learning transformers"
python onefilellm.py gh-user "microsoft"
python onefilellm.py arxiv-search "attention mechanisms"

Complex Ecosystem Aliases

python onefilellm.py --alias-add ai-research \
  "arxiv:1706.03762 https://github.com/huggingface/transformers https://pytorch.org/docs"
python onefilellm.py --alias-add k8s-ecosystem \
  "https://github.com/kubernetes/kubernetes https://kubernetes.io/docs/ https://github.com/istio/istio"

# Combine multiple aliases with live sources
python onefilellm.py ai-research k8s-ecosystem modern-web \
  conference_notes.pdf local_experiments/

Alias Management

python onefilellm.py --alias-list              # Show all aliases
python onefilellm.py --alias-list-core         # Show core aliases only
python onefilellm.py --alias-remove old-alias  # Remove user alias
cat ~/.onefilellm_aliases/aliases.json         # View raw JSON

  --alias-add NAME [COMMAND_STRING ...]
                        Add or update a user-defined alias. Multiple arguments
                        after NAME will be joined as COMMAND_STRING.
  --alias-remove NAME   Remove a user-defined alias.
  --alias-list          List all effective aliases (user-defined aliases
                        override core aliases).
  --alias-list-core     List only pre-shipped (core) aliases.

Web Crawler Options:
  --crawl-max-depth CRAWL_MAX_DEPTH
                        Maximum crawl depth (default: 3)
  --crawl-max-pages CRAWL_MAX_PAGES
                        Maximum pages to crawl (default: 1000)
  --crawl-user-agent CRAWL_USER_AGENT
                        User agent for web requests (default:
                        OneFileLLMCrawler/1.1)
  --crawl-delay CRAWL_DELAY
                        Delay between requests in seconds (default: 0.25)
  --crawl-include-pattern CRAWL_INCLUDE_PATTERN
                        Regex pattern for URLs to include
  --crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN
                        Regex pattern for URLs to exclude
  --crawl-timeout CRAWL_TIMEOUT
                        Request timeout in seconds (default: 20)
  --crawl-include-images
                        Include image URLs in output
  --crawl-no-include-code
                        Exclude code blocks from output
  --crawl-no-extract-headings
                        Exclude heading extraction
  --crawl-follow-links  Follow links to external domains
  --crawl-no-clean-html
                        Disable readability cleaning
  --crawl-no-strip-js   Keep JavaScript code
  --crawl-no-strip-css  Keep CSS styles
  --crawl-no-strip-comments
                        Keep HTML comments
  --crawl-respect-robots
                        Respect robots.txt (default: ignore for backward
                        compatibility)
  --crawl-concurrency CRAWL_CONCURRENCY
                        Number of concurrent requests (default: 3)
  --crawl-restrict-path
                        Restrict crawl to paths under start URL
  --crawl-no-incl

Related Skills

View on GitHub
GitHub Stars1.9k
CategoryProduct
Updated18h ago
Forks174

Languages

Python

Security Score

100/100

Audited on Mar 30, 2026

No findings