Websight
Browser agents based on WebSight-7B, a custom 7B fine-tuned multimodal computer use model
Install / Use
/learn @SuperAce100/WebsightREADME
Websight
Vision first browser agents based on Websight-7B, a custom 7B parameter model.
Installation
pip install websight
# or
uv add websight
Quickstart
Call the model directly on an image:
from websight import websight_call
action = websight_call(
prompt="Click the Login button",
history=[], # prior (reasoning, action) pairs if you have them
image_base64="data:image/png;base64,<...>",
)
print(action.action) # e.g., "click"
print(action.args) # e.g., {"x": "175", "y": "514"}
Reference
- websight.websight_call
def websight_call(
prompt: str,
history: list[tuple[str, str]],
image_base64: str,
console: rich.console.Console | None = None,
max_new_tokens: int = 1000,
) -> Action
Calls the Websight VLM with a screenshot and instruction, returning a structured Action.
- websight.Action
class Action(BaseModel):
action: str # e.g. "click", "drag", "type", "scroll", ...
args: dict[str, str] # e.g. {"x": "175", "y": "514"}
reasoning: str # model rationale
- websight.Agent
from websight.agent import Agent
agent = Agent(show_browser=False)
result = agent.run("Open https://example.com and search for 'websight'", max_iterations=10)
Basic Agent loop using Playwright: takes a screenshot, calls websight_call, parses and executes the predicted action, and repeats until it sees finished(...).
Related Skills
node-connect
342.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.7kCommit, push, and open a PR
