7.5 KiB
sherpa-onnx Moonshine v2 Migration Plan
Goal
Replace the current @moonshine-ai/moonshine-js browser integration with a sherpa-onnx WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.
The target behavior is:
- user clicks a voice button in the terminal UI
- browser captures microphone audio
- VAD segments speech locally
- Moonshine v2 recognition runs locally in WASM
- committed transcript is sent into the existing terminal input path
Why We Are Changing Direction
The current MoonshineJS integration is the wrong runtime for the medium-streaming-en bundle we want to use.
Current blockers:
moonshine-jsexpects old-style ONNX assets such as:quantized/encoder_model.onnxquantized/decoder_model_merged.onnx
- the downloaded
medium-streaming-en.zipcontains a newer Moonshine v2 layout:frontend.ortencoder.ortdecoder_kv.ortadapter.ortcross_kv.orttokenizer.binstreaming_config.json
- self-hosting those files alone does not make the existing
moonshine-jsloader compatible
Constraints
- keep transcription fully local in the browser
- keep the current terminal WebSocket/stdin path unchanged if possible
- preserve the small voice button UX already added to
webterm/static/js/terminal.ts - prefer self-hosted assets served by
webterm/static/ - do not depend on third-party model/CDN availability at runtime
Repo Touchpoints
Expected files/modules to change:
webterm/static/js/terminal.tswebterm/static/js/terminal.jspackage.jsonbun.lockpackage-lock.jsonwebterm/assets_embed.gowebterm/static/...for new WASM/model assetsREADME.md
Possible new files:
webterm/static/js/sherpa-voice.tsor similarwebterm/static/js/sherpa-onnx.d.tswebterm/static/models/moonshine-v2-medium/...scripts/install-sherpa-model.shor similar helper
Proposed Architecture
1. Frontend runtime
Use sherpa-onnx JavaScript/WebAssembly in the browser instead of moonshine-js.
Primary responsibilities:
- initialize sherpa WASM runtime
- load self-hosted Moonshine v2 model files
- initialize VAD + ASR pipeline
- expose simple start/stop API to terminal UI
- emit committed transcript strings
2. Audio flow
Use the VAD + offline ASR flow described by sherpa-onnx:
- microphone input
- VAD detects speech boundaries
- finalized speech segment is sent to recognizer
- recognizer returns transcript
- transcript is injected into terminal stdin
3. Terminal integration
Keep the existing terminal send path:
- reuse
sendStdin(...)inWebTerminal - keep current voice button/status UI shell
- replace only the recognition backend
Implementation Phases
Phase 1. Remove the current MoonshineJS dependency path
- remove
@moonshine-ai/moonshine-jsimport and startup logic - remove
moonshine-js.d.ts - remove MoonshineJS-specific error capture code
- keep the voice UI container, button, and status area
Deliverable:
- terminal still builds
- voice button exists but uses a new backend abstraction
Phase 2. Stage sherpa-onnx assets locally
- choose exact sherpa-onnx JS/WASM package/version
- download/copy the required runtime assets into
webterm/static/ - unpack
~/medium-streaming-en.zipinto repo-managed static model directory - verify final model path layout expected by sherpa-onnx
- ensure static assets are available both in dev mode and embedded mode
Deliverable:
- all WASM/model files are served locally from
/static/...
Phase 3. Build a thin browser voice adapter
- create a dedicated module that wraps sherpa-onnx initialization
- define a minimal interface:
start()stop()isActive()- callbacks for:
- status updates
- final transcript
- detailed errors
- keep terminal code from depending directly on low-level sherpa objects
Deliverable:
- one self-contained browser voice adapter module
Phase 4. Wire VAD + recognition into the terminal UI
- connect voice button click to adapter start/stop
- show clear states:
- ready
- loading runtime
- loading model
- listening
- processing speech
- final transcript sent
- detailed error
- push final transcript into
sendStdin(...) - decide whether to append newline automatically
Open question:
- should transcript be inserted as raw text only, or raw text plus
\r?
Deliverable:
- end-to-end browser speech to terminal input using sherpa-onnx
Phase 5. Performance and caching
- confirm browser HTTP caching behavior for WASM and model files
- add long-lived cache headers if needed for self-hosted static assets
- optionally add a service worker pre-cache later
- measure first-load vs repeat-load experience
Deliverable:
- repeat visits avoid re-downloading large model assets where possible
Phase 6. Documentation and deploy
- document required model files and where they live
- document browser requirements
- document secure-origin requirement for microphone access
- document how to update the model bundle in future
Deliverable:
- README instructions for install, deploy, and troubleshooting
Technical Questions To Resolve Early
- Which sherpa-onnx JS/WASM distribution should we use?
- npm package
- vendored release bundle
- custom copied example assets
- Which exact browser API shape should we target?
- direct sherpa recognizer API
- sherpa VAD + non-streaming ASR helper
- example-derived wrapper from sherpa demos
- What is the expected asset layout for the Moonshine v2 medium zip?
- whether files can be served exactly as unzipped
- whether any extra config or renamed paths are required
- What transcript commit behavior do we want?
- send text only
- send text plus Enter
- configurable mode
Risks
Runtime/API mismatch
Risk:
- sherpa-onnx JS APIs may differ from the examples we choose
Mitigation:
- lock to one verified release
- copy a known working browser example shape before adapting
Large asset size
Risk:
- medium model load time may be high on first use
Mitigation:
- self-host locally
- keep caching aggressive
- consider fallback option for smaller model later
Mobile/browser compatibility
Risk:
- WASM + large model + microphone flow may be poor on weaker browsers
Mitigation:
- treat desktop Chromium/Firefox as first target
- gate unsupported browsers with explicit errors
Current worktree noise
Risk:
- repo already has ongoing voice-related frontend edits
Mitigation:
- isolate the new adapter into its own file/module
- keep migration incremental
Validation Checklist
bun run typecheckpasses- frontend bundle builds
./update.shdeploys successfully- remote HTTPS origin still permits microphone access
- first transcription works end-to-end
- repeated transcriptions do not leak memory or duplicate recognizers
- page reload reuses cached assets when possible
- detailed runtime errors are visible in UI and console
Recommended Execution Order
- Pick and verify one sherpa-onnx browser example for Moonshine v2
- Vendor the required WASM/runtime assets
- Unpack and serve the local model bundle
- Build a dedicated browser adapter module
- Rewire the existing voice UI to that adapter
- Validate microphone -> transcript -> terminal stdin flow
- Improve caching and docs
Definition of Done
- no
moonshine-jsdependency remains in the browser path - voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
- transcripts reach terminal stdin reliably
- remote HTTPS access works
- runtime/model errors are understandable
- setup is documented and reproducible