# sherpa-onnx Moonshine v2 Migration Plan ## Goal Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path. The target behavior is: - user clicks a voice button in the terminal UI - browser captures microphone audio - VAD segments speech locally - Moonshine v2 recognition runs locally in WASM - committed transcript is sent into the existing terminal input path ## Why We Are Changing Direction The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use. Current blockers: - `moonshine-js` expects old-style ONNX assets such as: - `quantized/encoder_model.onnx` - `quantized/decoder_model_merged.onnx` - the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout: - `frontend.ort` - `encoder.ort` - `decoder_kv.ort` - `adapter.ort` - `cross_kv.ort` - `tokenizer.bin` - `streaming_config.json` - self-hosting those files alone does not make the existing `moonshine-js` loader compatible ## Constraints - keep transcription fully local in the browser - keep the current terminal WebSocket/stdin path unchanged if possible - preserve the small voice button UX already added to `webterm/static/js/terminal.ts` - prefer self-hosted assets served by `webterm/static/` - do not depend on third-party model/CDN availability at runtime ## Repo Touchpoints Expected files/modules to change: - `webterm/static/js/terminal.ts` - `webterm/static/js/terminal.js` - `package.json` - `bun.lock` - `package-lock.json` - `webterm/assets_embed.go` - `webterm/static/...` for new WASM/model assets - `README.md` Possible new files: - `webterm/static/js/sherpa-voice.ts` or similar - `webterm/static/js/sherpa-onnx.d.ts` - `webterm/static/models/moonshine-v2-medium/...` - `scripts/install-sherpa-model.sh` or similar helper ## Proposed Architecture ### 1. Frontend runtime Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`. Primary responsibilities: - initialize sherpa WASM runtime - load self-hosted Moonshine v2 model files - initialize VAD + ASR pipeline - expose simple start/stop API to terminal UI - emit committed transcript strings ### 2. Audio flow Use the VAD + offline ASR flow described by sherpa-onnx: - microphone input - VAD detects speech boundaries - finalized speech segment is sent to recognizer - recognizer returns transcript - transcript is injected into terminal stdin ### 3. Terminal integration Keep the existing terminal send path: - reuse `sendStdin(...)` in `WebTerminal` - keep current voice button/status UI shell - replace only the recognition backend ## Implementation Phases ### Phase 1. Remove the current MoonshineJS dependency path - remove `@moonshine-ai/moonshine-js` import and startup logic - remove `moonshine-js.d.ts` - remove MoonshineJS-specific error capture code - keep the voice UI container, button, and status area Deliverable: - terminal still builds - voice button exists but uses a new backend abstraction ### Phase 2. Stage sherpa-onnx assets locally - choose exact sherpa-onnx JS/WASM package/version - download/copy the required runtime assets into `webterm/static/` - unpack `~/medium-streaming-en.zip` into repo-managed static model directory - verify final model path layout expected by sherpa-onnx - ensure static assets are available both in dev mode and embedded mode Deliverable: - all WASM/model files are served locally from `/static/...` ### Phase 3. Build a thin browser voice adapter - create a dedicated module that wraps sherpa-onnx initialization - define a minimal interface: - `start()` - `stop()` - `isActive()` - callbacks for: - status updates - final transcript - detailed errors - keep terminal code from depending directly on low-level sherpa objects Deliverable: - one self-contained browser voice adapter module ### Phase 4. Wire VAD + recognition into the terminal UI - connect voice button click to adapter start/stop - show clear states: - ready - loading runtime - loading model - listening - processing speech - final transcript sent - detailed error - push final transcript into `sendStdin(...)` - decide whether to append newline automatically Open question: - should transcript be inserted as raw text only, or raw text plus `\r`? Deliverable: - end-to-end browser speech to terminal input using sherpa-onnx ### Phase 5. Performance and caching - confirm browser HTTP caching behavior for WASM and model files - add long-lived cache headers if needed for self-hosted static assets - optionally add a service worker pre-cache later - measure first-load vs repeat-load experience Deliverable: - repeat visits avoid re-downloading large model assets where possible ### Phase 6. Documentation and deploy - document required model files and where they live - document browser requirements - document secure-origin requirement for microphone access - document how to update the model bundle in future Deliverable: - README instructions for install, deploy, and troubleshooting ## Technical Questions To Resolve Early 1. Which sherpa-onnx JS/WASM distribution should we use? - npm package - vendored release bundle - custom copied example assets 2. Which exact browser API shape should we target? - direct sherpa recognizer API - sherpa VAD + non-streaming ASR helper - example-derived wrapper from sherpa demos 3. What is the expected asset layout for the Moonshine v2 medium zip? - whether files can be served exactly as unzipped - whether any extra config or renamed paths are required 4. What transcript commit behavior do we want? - send text only - send text plus Enter - configurable mode ## Risks ### Runtime/API mismatch Risk: - sherpa-onnx JS APIs may differ from the examples we choose Mitigation: - lock to one verified release - copy a known working browser example shape before adapting ### Large asset size Risk: - medium model load time may be high on first use Mitigation: - self-host locally - keep caching aggressive - consider fallback option for smaller model later ### Mobile/browser compatibility Risk: - WASM + large model + microphone flow may be poor on weaker browsers Mitigation: - treat desktop Chromium/Firefox as first target - gate unsupported browsers with explicit errors ### Current worktree noise Risk: - repo already has ongoing voice-related frontend edits Mitigation: - isolate the new adapter into its own file/module - keep migration incremental ## Validation Checklist - `bun run typecheck` passes - frontend bundle builds - `./update.sh` deploys successfully - remote HTTPS origin still permits microphone access - first transcription works end-to-end - repeated transcriptions do not leak memory or duplicate recognizers - page reload reuses cached assets when possible - detailed runtime errors are visible in UI and console ## Recommended Execution Order 1. Pick and verify one sherpa-onnx browser example for Moonshine v2 2. Vendor the required WASM/runtime assets 3. Unpack and serve the local model bundle 4. Build a dedicated browser adapter module 5. Rewire the existing voice UI to that adapter 6. Validate microphone -> transcript -> terminal stdin flow 7. Improve caching and docs ## Definition of Done - no `moonshine-js` dependency remains in the browser path - voice input uses sherpa-onnx + locally hosted Moonshine v2 assets - transcripts reach terminal stdin reliably - remote HTTPS access works - runtime/model errors are understandable - setup is documented and reproducible