Pywinassistant
The first open-source Artificial Narrow Intelligence generalist agentic framework Computer-Using-Agent that fully operates graphical-user-interfaces (GUIs) by using only natural language. Uses Visualization-of-Thought and Chain-of-Thought reasoning to elicit spatial reasoning and perception, emulates, plans and simulates synthetic HID interactions.
Install / Use
/learn @a-real-ai/PywinassistantREADME
PyWinAssistant: An artificial assistant – MIT Licensed | Public Release: December 31, 2023 | Complies with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.
PyWinAssistant is the first open-source Artificial Narrow Intelligence to elicit spatial reasoning and perception as a generalist agentic framework Computer-Using-Agent that fully operates graphical-user-interfaces (GUIs) for Windows 10/11 through direct OS-native semantic interaction. It functions as a Computer-Using-Agent / Large-Action-Model, forming the foundation for a pure symbolic spatial cognition framework that enables artificial operation of a computer using only natural language, without relying on computer vision, OCR, or pixel-level imaging. PyWinAssistant emulates, plans, and simulates synthetic Human-Interface-Device (HID) interactions through native Windows Accessibility APIs, eliciting human-like abstraction across geometric, hierarchical, and temporal dimensions at an Operating-System level. This OS-integrated approach simulating spatial utilization of a computer provides a future-proof, generalized, modular, and dynamic ANI orchestration framework for multi-agent-driven automation, marking an important step in symbolic reasoning towards AGI.
Key Features:
- Not relying only on Imaging Pipeline: Operates exclusively through Windows UI Automation (UIA) and programmatic GUI semantics, enabling universal workflow orchestration.
- Symbolic Spatial Mapping: Hierarchical element tracking via OS-native parent/child relationships and coordinate systems.
- Non-Visual Perception: Real-time interface understanding through direct metadata extraction (control types, states, positions).
- Visual Perception: A single screenshot can elicit comprehension and perception with attention to detail by visualizing goal intent and environment changes in a spatial space over time, can be fine-tuned to look up for visual cues, bugs, causal reasoning bugs, static, semantic grounding, errors, corruption...
- Unified Automation: Automatic element detection. Combines GUI, system, and web automation under one Python API. Eliminates context-switching between tools.
- AI-Powered Script Generation: Translates natural language or demonstrations into any kind of code inside any IDE or text edit areas.
- Self-Healing Workflows: Auto-adjusts to UI changes (e.g., element ID shifts). Reducing maintenance overhead, making PyWinAssistant's algorithm future-proof.
- AI/ML Integration: Using NLP to generate scripts (e.g., “Automate Application” → plan of test execution steps in JSON) with self-correcting selectors.
- Cross-Context Automation: Seamlessly combining GUI, web, and API workflows in a Pythonic way, unifying disjointed automation methods (GUI, API, web) into a single framework.
- Accessibility: Enhancing accessibility for users with different needs, enabling voice or simple text commands to control complex actions.
- Generalization: Elicits spatial cognition to understand and execute a wide range of commands in a natural, intuitive manner.
- Small and compact: PyWinAssistant functions as an example algorithm of a modular and generalized computer assistant framework that elicits spatial cognition.
PyWinAssistant has its own set of reasoning agents, utilizing Visualization-of-Thought (VoT) and Chain-of-Thought (CoT) to enhance generalization, dynamically simulating actions through abstract GUI semantic dimensions rather than visual processing, making it future-proof for next-generation LLM models. By visualizing interface contents to dynamically simulate and plan actions over abstract GUI semantic dimensions, concepts, and differentials, PyWinAssistant redefines computer vision automation, enabling high-efficiency visual processing at a fraction of traditional computational costs. PyWinAssistant has achieved real-time spatial perception at an Operating-System level, allowing for memorization of visual cues and tracking of on-screen changes over time.
Released before key breakthroughs in AI for Spatial Reasoning, it predates:
- Microsoft’s Visualization-of-Thought research paper (April 4, 2024)
- Anthropic Claude’s Computer-Use Agent (October 22, 2024)
- OpenIA ChatGPT’s Operator Computer-Using Agent (CUA) (January 23, 2025)
PyWinAssistant represents a major paradigm shift in AI and automation by pioneering pure symbolic computer interaction bridging human intent with GUI automation at an OS level through these breakthroughs:
- First Agent to bypass OCR/imaging for Computer-Using-Agent GUI automation.
- First Framework using Windows UIA as the primary spatial perception channel.
- First System demonstrating OS-native hierarchical-temporal reasoning.
1. Unified Natural Language → GUI Automation
Traditional Approach:
Automation tools require scripting (e.g., AutoHotkey) or API integration (e.g., Selenium).
PyWinAssistant Breakthrough:
# True generalization for natural language directly driving UI actions
assistant("Play Daft Punk on Spotify and email the lyrics to my friend")
# The agent chooses a fitting item according to the related context to comply with user intent.
Mechanism: Combines UIAutomation’s GUI control detection with LLMs to:
- Parse intent ("play", "email lyrics")
- Map to UI elements (Spotify play button, Outlook compose window)
- Generate adaptive workflows
PyWinAssistant Innovation: Eliminates the need for:
- Predefined API integrations
- XPath/CSS selector knowledge
- Manual error handling
2. Cross-Application State Awareness
Traditional Limitation:
Tools operate in app silos (e.g., Power Automate connectors).
PyWinAssistant Innovation:
# Notes:
# The full set of steps generation from the Assistant is working flawlessly, but in-step modifier and memory-content retrieval was purposely disabled and commented into the code- [def act()](https://github.com/a-real-ai/pywinassistant/blob/6aae4e514a0dc661f7ed640181663f483972bc1e/core/driver.py#L648C1-L648C8)
# to comply with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.
# Accurately maintains context and intent across apps using UIA tree and spatial memory: (Example for further development)
assistant("Find for the best and cheapest flight to Mexico, and also look for local hotels and suggest me on new tabs the best on cultural options")
assistant("Look for various pizza coupons for anything but pineapple, fill in the details to order and show me the results")
# PyWinAssistant is highly modular (example):
def workflow():
song = assistant(goal="get the current track") # UIA
write_action(f"Review '{song}': Great bassline!", app="Notepad") # Win32
assistant(goal="Post on twitter the written text from notepad") # Web
# The previous set of actions can be also executed by simply using natural language:
assistant(f"Get the current song playing and in notepad put the title as Review song name: Great bassline, and write about why it is a great baseline, then post it on twitter", assistant_identity="You're an expert music critic")
Key Advancements:
- Unified Control Graph: Treats all apps as nodes in a single UIA-accessible graph
- State Transfer: Passes data between apps via clipboard/UIA properties
- Semantic Transfer: Passes semantics of goal intent acros all steps
- Error Recovery: Uses agentic reasoning systems to avoid failing actions
Impact: Enables workflows previously requiring custom middleware.
3. Probabilistic Automation Engine
Traditional Model:
Deterministic scripts fail on UI changes.
PyWinAssistant’s Solution:
# Adaptive element discovery
def fast_action(goal):
speaker(f"Clicking onto the element without visioning context. No imaging is required.")
analyzed_ui = analyze_app(application=ai_choosen_app, additional_search_options=generated_keywords)
gen_coordinates = [{"role": "assistant",
f"content": f"You are an AI Windows Mouse Agent that can interact with the mouse. Only respond with the "
f"predicted coordinates of the mouse click position to the center of the element object "
f"\"x=, y=\" to achieve the goal."},
{"role": "system", "content": f"Goal: {single_step}\n\nContext:{original_goal}\n{analyzed_ui}"}]
coordinates = api_call(gen_coordinates, model_name="gpt-4-1106-preview", max_tokens=100, temperature=0.0)
print(f"AI decision coordinates: \'{coordinates}\'")
Revolutionary Features:
- Semantic Search by thinking: Example
synonyms("download") → ["save", "export", "↓ icon"] - Spatial Probability: Prioritizes elements by utilizing sets of self-reasoning agents for the synthetic operation of the actions
- Spatial-Prevention: Senses and prevents possible bad actions or misaligned step execution by utilizing sets of self-reasoning agents
- Self-Healing: Automatically chooses the perfect plan to execute without failing its step reasoning, by utilizing sets of self-reasoning agents
4. Democratized Accessibility
Task: Automate to save a song on spotify GUI.
Before:
Automation required:
WinWait, Spotify
ControlClick, x=152 y=311 # Fragile coordinates
Now: Only 1 natural lan
Related Skills
diffs
343.1kUse the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.
openpencil
1.9kThe world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.
HappyColorBlend
HappyColorBlendVibe Project Guidelines Project Overview HappyColorBlendVibe is a Figma plugin for color palette generation with advanced tint/shade blending capabilities. It allows designers to
Flyaro-waffle-app
Waffle Delight - Full Stack MERN Application Rules & Documentation Project Overview A comprehensive waffle delivery application built with MERN stack featuring premium UI/UX, admin management, a
