feat: add initial browser voice input prototype

2026-05-11 22:01:23 -04:00
parent 67bd22b27d
commit 3edc41e86c
10 changed files with 10660 additions and 15 deletions
@@ -0,0 +1,281 @@
+# sherpa-onnx Moonshine v2 Migration Plan
+
+## Goal
+
+Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.
+
+The target behavior is:
+
+- user clicks a voice button in the terminal UI
+- browser captures microphone audio
+- VAD segments speech locally
+- Moonshine v2 recognition runs locally in WASM
+- committed transcript is sent into the existing terminal input path
+
+## Why We Are Changing Direction
+
+The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use.
+
+Current blockers:
+
+- `moonshine-js` expects old-style ONNX assets such as:
+  - `quantized/encoder_model.onnx`
+  - `quantized/decoder_model_merged.onnx`
+- the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout:
+  - `frontend.ort`
+  - `encoder.ort`
+  - `decoder_kv.ort`
+  - `adapter.ort`
+  - `cross_kv.ort`
+  - `tokenizer.bin`
+  - `streaming_config.json`
+- self-hosting those files alone does not make the existing `moonshine-js` loader compatible
+
+## Constraints
+
+- keep transcription fully local in the browser
+- keep the current terminal WebSocket/stdin path unchanged if possible
+- preserve the small voice button UX already added to `webterm/static/js/terminal.ts`
+- prefer self-hosted assets served by `webterm/static/`
+- do not depend on third-party model/CDN availability at runtime
+
+## Repo Touchpoints
+
+Expected files/modules to change:
+
+- `webterm/static/js/terminal.ts`
+- `webterm/static/js/terminal.js`
+- `package.json`
+- `bun.lock`
+- `package-lock.json`
+- `webterm/assets_embed.go`
+- `webterm/static/...` for new WASM/model assets
+- `README.md`
+
+Possible new files:
+
+- `webterm/static/js/sherpa-voice.ts` or similar
+- `webterm/static/js/sherpa-onnx.d.ts`
+- `webterm/static/models/moonshine-v2-medium/...`
+- `scripts/install-sherpa-model.sh` or similar helper
+
+## Proposed Architecture
+
+### 1. Frontend runtime
+
+Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`.
+
+Primary responsibilities:
+
+- initialize sherpa WASM runtime
+- load self-hosted Moonshine v2 model files
+- initialize VAD + ASR pipeline
+- expose simple start/stop API to terminal UI
+- emit committed transcript strings
+
+### 2. Audio flow
+
+Use the VAD + offline ASR flow described by sherpa-onnx:
+
+- microphone input
+- VAD detects speech boundaries
+- finalized speech segment is sent to recognizer
+- recognizer returns transcript
+- transcript is injected into terminal stdin
+
+### 3. Terminal integration
+
+Keep the existing terminal send path:
+
+- reuse `sendStdin(...)` in `WebTerminal`
+- keep current voice button/status UI shell
+- replace only the recognition backend
+
+## Implementation Phases
+
+### Phase 1. Remove the current MoonshineJS dependency path
+
+- remove `@moonshine-ai/moonshine-js` import and startup logic
+- remove `moonshine-js.d.ts`
+- remove MoonshineJS-specific error capture code
+- keep the voice UI container, button, and status area
+
+Deliverable:
+
+- terminal still builds
+- voice button exists but uses a new backend abstraction
+
+### Phase 2. Stage sherpa-onnx assets locally
+
+- choose exact sherpa-onnx JS/WASM package/version
+- download/copy the required runtime assets into `webterm/static/`
+- unpack `~/medium-streaming-en.zip` into repo-managed static model directory
+- verify final model path layout expected by sherpa-onnx
+- ensure static assets are available both in dev mode and embedded mode
+
+Deliverable:
+
+- all WASM/model files are served locally from `/static/...`
+
+### Phase 3. Build a thin browser voice adapter
+
+- create a dedicated module that wraps sherpa-onnx initialization
+- define a minimal interface:
+  - `start()`
+  - `stop()`
+  - `isActive()`
+  - callbacks for:
+    - status updates
+    - final transcript
+    - detailed errors
+- keep terminal code from depending directly on low-level sherpa objects
+
+Deliverable:
+
+- one self-contained browser voice adapter module
+
+### Phase 4. Wire VAD + recognition into the terminal UI
+
+- connect voice button click to adapter start/stop
+- show clear states:
+  - ready
+  - loading runtime
+  - loading model
+  - listening
+  - processing speech
+  - final transcript sent
+  - detailed error
+- push final transcript into `sendStdin(...)`
+- decide whether to append newline automatically
+
+Open question:
+
+- should transcript be inserted as raw text only, or raw text plus `\r`?
+
+Deliverable:
+
+- end-to-end browser speech to terminal input using sherpa-onnx
+
+### Phase 5. Performance and caching
+
+- confirm browser HTTP caching behavior for WASM and model files
+- add long-lived cache headers if needed for self-hosted static assets
+- optionally add a service worker pre-cache later
+- measure first-load vs repeat-load experience
+
+Deliverable:
+
+- repeat visits avoid re-downloading large model assets where possible
+
+### Phase 6. Documentation and deploy
+
+- document required model files and where they live
+- document browser requirements
+- document secure-origin requirement for microphone access
+- document how to update the model bundle in future
+
+Deliverable:
+
+- README instructions for install, deploy, and troubleshooting
+
+## Technical Questions To Resolve Early
+
+1. Which sherpa-onnx JS/WASM distribution should we use?
+
+- npm package
+- vendored release bundle
+- custom copied example assets
+
+2. Which exact browser API shape should we target?
+
+- direct sherpa recognizer API
+- sherpa VAD + non-streaming ASR helper
+- example-derived wrapper from sherpa demos
+
+3. What is the expected asset layout for the Moonshine v2 medium zip?
+
+- whether files can be served exactly as unzipped
+- whether any extra config or renamed paths are required
+
+4. What transcript commit behavior do we want?
+
+- send text only
+- send text plus Enter
+- configurable mode
+
+## Risks
+
+### Runtime/API mismatch
+
+Risk:
+
+- sherpa-onnx JS APIs may differ from the examples we choose
+
+Mitigation:
+
+- lock to one verified release
+- copy a known working browser example shape before adapting
+
+### Large asset size
+
+Risk:
+
+- medium model load time may be high on first use
+
+Mitigation:
+
+- self-host locally
+- keep caching aggressive
+- consider fallback option for smaller model later
+
+### Mobile/browser compatibility
+
+Risk:
+
+- WASM + large model + microphone flow may be poor on weaker browsers
+
+Mitigation:
+
+- treat desktop Chromium/Firefox as first target
+- gate unsupported browsers with explicit errors
+
+### Current worktree noise
+
+Risk:
+
+- repo already has ongoing voice-related frontend edits
+
+Mitigation:
+
+- isolate the new adapter into its own file/module
+- keep migration incremental
+
+## Validation Checklist
+
+- `bun run typecheck` passes
+- frontend bundle builds
+- `./update.sh` deploys successfully
+- remote HTTPS origin still permits microphone access
+- first transcription works end-to-end
+- repeated transcriptions do not leak memory or duplicate recognizers
+- page reload reuses cached assets when possible
+- detailed runtime errors are visible in UI and console
+
+## Recommended Execution Order
+
+1. Pick and verify one sherpa-onnx browser example for Moonshine v2
+2. Vendor the required WASM/runtime assets
+3. Unpack and serve the local model bundle
+4. Build a dedicated browser adapter module
+5. Rewire the existing voice UI to that adapter
+6. Validate microphone -> transcript -> terminal stdin flow
+7. Improve caching and docs
+
+## Definition of Done
+
+- no `moonshine-js` dependency remains in the browser path
+- voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
+- transcripts reach terminal stdin reliably
+- remote HTTPS access works
+- runtime/model errors are understandable
+- setup is documented and reproducible