Voot
An LLM API Empowered Live Transalator on HarmonyOS
Install / Use
/learn @YANGZX22/VootREADME
Voot – LLM-powered Live Translation for HarmonyOS
Voot, standing for "Voice On Top," is an intelligent simultaneous-interpretation & text translation app for HarmonyOS, powered by your own LLM / translation APIs.
It is designed with three core principles: security, privacy, and simplicity.
[!NOTE] Voot does not provide or resell any LLM/translation service.
You bring your own API keys (OpenAI, DeepL, Ollama, 豆包, etc.).
Huawei AppGallery
Voot has launched on Huawei AppGallery (Overseas) (Note: You need an oversea internet environment to access). Also, releases update will be still available on GitHub for sideloading. Even though, we strongly recommend you to follow the AppGallery listing for the latest version.
Table of Contents
- Features
- Architecture
- Screenshots
- Getting Started
- Install Hap
- Configuration
- Usage
- Security & Privacy
- Roadmap
- Blueprints
- Contributing
- Model Performance
- Known Issues
- Acknowledgements
- License
- Disclaimer
Features
- 🔐 Secure by design
- No built-in or hosted model – you must configure your own API keys.
- API keys are stored only in the HarmonyOS sandbox, protected by face / biometric unlock.
- No third-party analytics SDKs.
- 🕵️ Privacy-first
- Audio is processed locally on-device for capture & pre-processing.
- Recorded audio for translation is not uploaded and is destroyed after processing.
- Only the minimal text required for translation is sent directly to the provider you configure.
- 🧩 Multi-provider support
- OpenAI (GPT-style chat / translation)
- DeepL
- Ollama (local LLM gateway)
- 豆包 / other custom endpoints (via configurable URL & API key)
- 🗣️ Simultaneous interpretation
- One-tap start/stop of “live” translation.
- Clear split between original text and translated text.
- 🔄 Device Continuation
- Seamlessly transfer your active translation session to another HarmonyOS device (e.g., from Phone to Tablet).
- Keeps your current transcription and translation context intact.
- 🖼️ Subtitles
- Floating subtitle window that works over other apps.
- Resizable and movable overlay for seamless multitasking.
- 📱 Desktop Widgets
- Control Card: Start/stop subtitle and interpretation directly from the home screen.
- Token Card: Monitor your API token usage without opening the app.
- 💨 Air Gestures
- Control translation start/stop without touching the screen.
- Ideal for hands-free operation during presentations or cooking.
- ✨ Text Polishing
- Improve the quality and tone of translated text.
- Refine rough translations into more natural and professional language.
- 📷 Scan & Translate
- Scan text from physical documents or screens using the camera.
- Instantly translate scanned text with save functionality.
Architecture
Voot/
├─ entry/
│ ├─ src/main/ets/
│ │ ├─ pages/ # ArkUI pages (Index, Configuration, Translation, Settings, etc.)
│ │ ├─ services/ # Mic + ASR services (SherpaWhisperMicService, PipSubtitleManager)
│ │ ├─ storage/ # Preference-backed stores (API config, TokenUsage, etc.)
│ │ ├─ components/ # Shared UI builders (PolicySheet, TokenUsageChart, etc.)
│ │ ├─ widget/ # Service Cards (Desktop Widgets)
│ │ ├─ entryformability/ # Widget lifecycle management
│ │ └─ workers/ # Background ASR workers for long-running capture
│ ├─ src/main/resources/ # Raw HTML, media assets, Sherpa models
│ ├─ oh-package*.json5 # Module package definitions
│ └─ build-profile.json5 # Entry module build settings
├─ AppScope/ # Application-level configuration and assets
├─ hvigorfile.ts # Workspace hvigor build script
└─ build-profile.json5 # Global build profile
Screenshots
<div align="center"> <img src="https://github.com/YANGZX22/Voot/blob/main/entry/src/main/resources/base/media/screenshot.jpg"> </div>Getting Started
Prerequisites
-
HarmonyOS toolchain:
- DevEco Studio with ArkTS support
- HarmonyOS SDK (version matching the project, current: 6.0.1(21))
-
A HarmonyOS device or emulator
-
One or more API keys, for example:
- OpenAI API key
- DeepL API key
- Ollama endpoint running locally or on LAN
- 豆包 / other compatible HTTP API
Clone
git clone https://github.com/YANGZX22/Voot.git
cd Voot
Open the project in DevEco Studio.
Run
- Connect a HarmonyOS device or start an emulator.
- In DevEco Studio, select the run configuration corresponding to the app.
- Click Run to build and deploy.
Install Hap
Or you can use Auto-installer or DevEcho Testing for installation.
[!IMPORTANT] Huawei's signing servers block IP addresses outside mainland China. To sideload software for HarmonyOS NEXT in countries/regions outside mainland China.
[!NOTE] Apps sideloaded via self-signing on HarmonyOS NEXT have a default validity period of 14 days. Completing Developer Real-Name Authentication extends this period to 180 days.
Configuration
API Providers
In the “配置 API” tab:
-
Choose the current provider (e.g. OpenAI, DeepL, Ollama, 豆包).
-
Tap “配置 API”.
-
For each provider, fill in:
- API URL (e.g.
https://api.openai.com/v1/chat/completions,https://api-free.deepl.com/v2/translate, or your Ollama endpoint) - API Key / Token
- Optional: custom prompt / system message used for translation.
- API URL (e.g.
The configuration is stored locally in the sandbox and bound to face / biometric verification when accessing/modifying.
Target Languages
In the “目标语言 / Target language” section:
- Select your default output language (e.g. 中文, English, etc.).
- The chosen target language is used for all translation APIs by default.
Glossary / Terminology
In the “术语库 / Glossary” menu:
- Enter term pairs in the format
Original = Translation(one per line). - Example:
HarmonyOS = 鸿蒙 AI = 人工智能 - These terms are automatically appended to the system prompt, instructing the LLM to strictly follow your terminology.
Usage
-
Launch Voot on your HarmonyOS device.
-
Configure API:
- Go to the first tab Configuration.
- Select an API provider (OpenAI, DeepL, etc.) and enter your API Key/URL.
- Set your Target Language.
-
Live Translation (翻译):
- Switch to the Translation tab.
- Tap “开启麦克风” to start capturing audio.
- Speak in the source language; the app will transcribe and translate in real-time.
- Air Gestures: Wave your hand above the front camera to start/stop translation without touching the screen.
- Device Continuation: Tap the Transfer (流转) icon to move the session to another HarmonyOS device.
-
Text Polishing (润色):
- Switch to the Polishing tab.
- Input or paste text that needs refinement.
- The AI will improve the tone, grammar, and clarity of the text.
-
Scan & Translate (扫描):
- Switch to the Scan tab.
- Point the camera at a document or screen.
- The app will recognize the text and provide an instant translation.
- You can save the scanned results to History.
Security & Privacy
Short summary (see in-app privacy policy / privacy.html for details):
-
Audio:
- Recorded only on device for the current translation session.
- Not uploaded to our servers (we have none).
- Discarded after processing.
-
API Keys:
- Stored in the app sandbox.
- Protected with HarmonyOS face/biometric mechanisms.
- Never transmitted to any server except the provider you configured.
-
Data Flow:
- Text is sent only to your chosen provider (OpenAI / DeepL / etc.).
- No central logging, analytics, or telemetry from the developer.
Roadmap
Finished / planned / possible steps:
- Subtitle (Realized ✅)
- Live Window on HarmonyOS (Realized ✅)
- Desktop Widgets (Realized ✅)
- Token usage analytics (Realized ✅)
- Glossary / Terminology Support (Realized ✅)
- Device Continuation (Realized ✅)
- History & Favorites (Realized ✅)
- Air Gestures (Realized ✅)
- Text Polishing (Realized ✅)
- Scan & Translate (Realized ✅)
- Pose Detection Button Dialog (Realized ✅)
- Support for more LLM / translation APIs (e.g. Google Translate)
- Enhanced ASR and cutoff logic
- More supported original languages
Feel free to open issues or PRs with feature requests.
Blueprints
1. Audio-Direct Multimodal
Moving beyond "Speech-to-Text-to-Translation" lossy pipelines:
- Direct Audio Input Sending VAD-filtered audio segments directly to multimodal models (e.g., GPT-4o Audio, Gemini 1.5 Pro).
- Nuance Capture Preserving tone, emotion (sarcasm, urgency), and speaker identity which are often lost in ASR.
- Feedback Loop Using the rich understanding from the multimodal engine to "feed back" into the frontend, correcting previous ASR errors or updating the context for the Fast Track.
2. Confidence Scoring & Visual Feedback
- Implement a confidence scoring system that highlights transl
