Onefilellm
Specify a github or local repo, github pull request, arXiv or Sci-Hub paper, Youtube transcript or documentation URL on the web and scrape into a text file and clipboard for easier LLM ingestion
Install / Use
/learn @jimmc414/OnefilellmREADME
OneFileLLM
Content Aggregator for LLMs - Aggregate and structure multi-source data into a single XML file for LLM context.
Description
OneFileLLM is a command-line tool that automates data aggregation from various sources (local files, GitHub repos, web pages, PDFs, YouTube transcripts, etc.) and combines them into a single, structured XML output that's automatically copied to your clipboard for use with Large Language Models.
Installation
git clone https://github.com/jimmc414/onefilellm.git
cd onefilellm
pip install -r requirements.txt
Pip install
OneFileLLM is also available as a pip package. You can install it directly and use both the CLI and Python API without cloning the repository:
pip install onefilellm
Command-Line Interface (CLI)
This project can also be installed as a command-line tool, which allows you to run onefilellm directly from your terminal.
CLI Installation
To install the CLI, run the following command in the project's root directory:
pip install -e .
This will install the package in "editable" mode, meaning any changes you make to the source code will be immediately available to the command-line tool.
CLI Usage
Once installed, you can use the onefilellm command instead of python onefilellm.py.
Synopsis:
onefilellm [OPTIONS] [INPUT_SOURCES...]
Example:
onefilellm ./docs/ https://github.com/user/project/issues/123
All other command-line arguments and options work the same as the script-based approach.
For GitHub API access (recommended):
export GITHUB_TOKEN="your_personal_access_token"
Python API
After installing via pip, OneFileLLM can be invoked directly from Python code.
from onefilellm import run
# Process inputs programmatically
run(["./docs/"])
Command Help
usage: onefilellm.py [-h] [-c]
[-f {text,markdown,json,html,yaml,doculing,markitdown}]
[--alias-add NAME [COMMAND_STRING ...]]
[--alias-remove NAME] [--alias-list] [--alias-list-core]
[--crawl-max-depth CRAWL_MAX_DEPTH]
[--crawl-max-pages CRAWL_MAX_PAGES]
[--crawl-user-agent CRAWL_USER_AGENT]
[--crawl-delay CRAWL_DELAY]
[--crawl-include-pattern CRAWL_INCLUDE_PATTERN]
[--crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN]
[--crawl-timeout CRAWL_TIMEOUT] [--crawl-include-images]
[--crawl-no-include-code] [--crawl-no-extract-headings]
[--crawl-follow-links] [--crawl-no-clean-html]
[--crawl-no-strip-js] [--crawl-no-strip-css]
[--crawl-no-strip-comments] [--crawl-respect-robots]
[--crawl-concurrency CRAWL_CONCURRENCY]
[--crawl-restrict-path] [--crawl-no-include-pdfs]
[--crawl-no-ignore-epubs] [--help-topic [TOPIC]]
[inputs ...]
OneFileLLM - Content Aggregator for LLMs
positional arguments:
inputs Input paths, URLs, or aliases to process
options:
-h, --help show this help message and exit
-c, --clipboard Process text from clipboard
-f {text,markdown,json,html,yaml,doculing,markitdown}, --format {text,markdown,json,html,yaml,doculing,markitdown}
Override format detection for text input
--help-topic [TOPIC] Show help for specific topic (basic, aliases,
crawling, pipelines, examples, config)
## Quick Start Examples
### Local Files and Directories
```bash
python onefilellm.py research_paper.pdf config.yaml src/
python onefilellm.py *.py requirements.txt docs/ README.md
python onefilellm.py notebook.ipynb --format json
python onefilellm.py large_dataset.csv logs/ --format text
GitHub Repositories and Issues
python onefilellm.py https://github.com/microsoft/vscode
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper
python onefilellm.py https://github.com/microsoft/vscode/pull/12345
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=all
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=open
python onefilellm.py https://github.com/kubernetes/kubernetes/issues?state=closed
You can retrieve issues for a repository by specifying the state query parameter.
Use state=all (default) to fetch all issues, state=open for open issues only,
or state=closed for closed issues.
Using Specific Branches or Tags
Is it possible to use this tool with different branches on a GitHub repository?
Yes. When you supply a GitHub URL that includes a branch (e.g., https://github.com/openai/whisper/tree/main/whisper), the tool parses the tree/ portion and sends the request with a ref parameter so that the specified branch or tag is retrieved.
Web Documentation and APIs
python onefilellm.py https://docs.python.org/3/tutorial/
python onefilellm.py https://react.dev/learn/thinking-in-react
python onefilellm.py https://docs.stripe.com/api
python onefilellm.py https://kubernetes.io/docs/concepts/
Multimedia and Academic Sources
python onefilellm.py https://www.youtube.com/watch?v=dQw4w9WgXcQ
python onefilellm.py https://arxiv.org/abs/2103.00020
python onefilellm.py arxiv:1706.03762 PMID:35177773
python onefilellm.py doi:10.1038/s41586-021-03819-2
Multiple Inputs
python onefilellm.py https://github.com/jimmc414/hey-claude https://modelcontextprotocol.io/llms-full.txt https://github.com/anthropics/anthropic-sdk-python https://github.com/anthropics/anthropic-cookbook
python onefilellm.py https://github.com/openai/whisper/tree/main/whisper https://www.youtube.com/watch?v=dQw4w9WgXcQ ALIAS_MCP
python onefilellm.py https://github.com/microsoft/vscode/pull/12345 https://arxiv.org/abs/2103.00020
python onefilellm.py https://github.com/kubernetes/kubernetes/issues https://pytorch.org/docs
Input Streams
python onefilellm.py --clipboard --format markdown
cat large_dataset.json | python onefilellm.py - --format json
curl -s https://api.github.com/repos/microsoft/vscode | python onefilellm.py -
echo 'Quick analysis task' | python onefilellm.py -
Alias System
Create Simple and Complex Aliases
python onefilellm.py --alias-add mcp "https://github.com/anthropics/mcp"
python onefilellm.py --alias-add modern-web \
"https://github.com/facebook/react https://reactjs.org/docs/ https://github.com/vercel/next.js"
Dynamic Placeholders
# Create placeholders with {}
python onefilellm.py --alias-add gh-search "https://github.com/search?q={}"
python onefilellm.py --alias-add gh-user "https://github.com/{}"
python onefilellm.py --alias-add arxiv-search "https://arxiv.org/search/?query={}"
# Use placeholders dynamically
python onefilellm.py gh-search "machine learning transformers"
python onefilellm.py gh-user "microsoft"
python onefilellm.py arxiv-search "attention mechanisms"
Complex Ecosystem Aliases
python onefilellm.py --alias-add ai-research \
"arxiv:1706.03762 https://github.com/huggingface/transformers https://pytorch.org/docs"
python onefilellm.py --alias-add k8s-ecosystem \
"https://github.com/kubernetes/kubernetes https://kubernetes.io/docs/ https://github.com/istio/istio"
# Combine multiple aliases with live sources
python onefilellm.py ai-research k8s-ecosystem modern-web \
conference_notes.pdf local_experiments/
Alias Management
python onefilellm.py --alias-list # Show all aliases
python onefilellm.py --alias-list-core # Show core aliases only
python onefilellm.py --alias-remove old-alias # Remove user alias
cat ~/.onefilellm_aliases/aliases.json # View raw JSON
--alias-add NAME [COMMAND_STRING ...]
Add or update a user-defined alias. Multiple arguments
after NAME will be joined as COMMAND_STRING.
--alias-remove NAME Remove a user-defined alias.
--alias-list List all effective aliases (user-defined aliases
override core aliases).
--alias-list-core List only pre-shipped (core) aliases.
Web Crawler Options:
--crawl-max-depth CRAWL_MAX_DEPTH
Maximum crawl depth (default: 3)
--crawl-max-pages CRAWL_MAX_PAGES
Maximum pages to crawl (default: 1000)
--crawl-user-agent CRAWL_USER_AGENT
User agent for web requests (default:
OneFileLLMCrawler/1.1)
--crawl-delay CRAWL_DELAY
Delay between requests in seconds (default: 0.25)
--crawl-include-pattern CRAWL_INCLUDE_PATTERN
Regex pattern for URLs to include
--crawl-exclude-pattern CRAWL_EXCLUDE_PATTERN
Regex pattern for URLs to exclude
--crawl-timeout CRAWL_TIMEOUT
Request timeout in seconds (default: 20)
--crawl-include-images
Include image URLs in output
--crawl-no-include-code
Exclude code blocks from output
--crawl-no-extract-headings
Exclude heading extraction
--crawl-follow-links Follow links to external domains
--crawl-no-clean-html
Disable readability cleaning
--crawl-no-strip-js Keep JavaScript code
--crawl-no-strip-css Keep CSS styles
--crawl-no-strip-comments
Keep HTML comments
--crawl-respect-robots
Respect robots.txt (default: ignore for backward
compatibility)
--crawl-concurrency CRAWL_CONCURRENCY
Number of concurrent requests (default: 3)
--crawl-restrict-path
Restrict crawl to paths under start URL
--crawl-no-incl
Related Skills
apple-reminders
341.6kManage Apple Reminders via remindctl CLI (list, add, edit, complete, delete). Supports lists, date filters, and JSON/plain output.
gh-issues
341.6kFetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
oracle
341.6kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
xurl
341.6kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
