Files

7.5 KiB

sherpa-onnx Moonshine v2 Migration Plan

Goal

Replace the current @moonshine-ai/moonshine-js browser integration with a sherpa-onnx WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.

The target behavior is:

  • user clicks a voice button in the terminal UI
  • browser captures microphone audio
  • VAD segments speech locally
  • Moonshine v2 recognition runs locally in WASM
  • committed transcript is sent into the existing terminal input path

Why We Are Changing Direction

The current MoonshineJS integration is the wrong runtime for the medium-streaming-en bundle we want to use.

Current blockers:

  • moonshine-js expects old-style ONNX assets such as:
    • quantized/encoder_model.onnx
    • quantized/decoder_model_merged.onnx
  • the downloaded medium-streaming-en.zip contains a newer Moonshine v2 layout:
    • frontend.ort
    • encoder.ort
    • decoder_kv.ort
    • adapter.ort
    • cross_kv.ort
    • tokenizer.bin
    • streaming_config.json
  • self-hosting those files alone does not make the existing moonshine-js loader compatible

Constraints

  • keep transcription fully local in the browser
  • keep the current terminal WebSocket/stdin path unchanged if possible
  • preserve the small voice button UX already added to webterm/static/js/terminal.ts
  • prefer self-hosted assets served by webterm/static/
  • do not depend on third-party model/CDN availability at runtime

Repo Touchpoints

Expected files/modules to change:

  • webterm/static/js/terminal.ts
  • webterm/static/js/terminal.js
  • package.json
  • bun.lock
  • package-lock.json
  • webterm/assets_embed.go
  • webterm/static/... for new WASM/model assets
  • README.md

Possible new files:

  • webterm/static/js/sherpa-voice.ts or similar
  • webterm/static/js/sherpa-onnx.d.ts
  • webterm/static/models/moonshine-v2-medium/...
  • scripts/install-sherpa-model.sh or similar helper

Proposed Architecture

1. Frontend runtime

Use sherpa-onnx JavaScript/WebAssembly in the browser instead of moonshine-js.

Primary responsibilities:

  • initialize sherpa WASM runtime
  • load self-hosted Moonshine v2 model files
  • initialize VAD + ASR pipeline
  • expose simple start/stop API to terminal UI
  • emit committed transcript strings

2. Audio flow

Use the VAD + offline ASR flow described by sherpa-onnx:

  • microphone input
  • VAD detects speech boundaries
  • finalized speech segment is sent to recognizer
  • recognizer returns transcript
  • transcript is injected into terminal stdin

3. Terminal integration

Keep the existing terminal send path:

  • reuse sendStdin(...) in WebTerminal
  • keep current voice button/status UI shell
  • replace only the recognition backend

Implementation Phases

Phase 1. Remove the current MoonshineJS dependency path

  • remove @moonshine-ai/moonshine-js import and startup logic
  • remove moonshine-js.d.ts
  • remove MoonshineJS-specific error capture code
  • keep the voice UI container, button, and status area

Deliverable:

  • terminal still builds
  • voice button exists but uses a new backend abstraction

Phase 2. Stage sherpa-onnx assets locally

  • choose exact sherpa-onnx JS/WASM package/version
  • download/copy the required runtime assets into webterm/static/
  • unpack ~/medium-streaming-en.zip into repo-managed static model directory
  • verify final model path layout expected by sherpa-onnx
  • ensure static assets are available both in dev mode and embedded mode

Deliverable:

  • all WASM/model files are served locally from /static/...

Phase 3. Build a thin browser voice adapter

  • create a dedicated module that wraps sherpa-onnx initialization
  • define a minimal interface:
    • start()
    • stop()
    • isActive()
    • callbacks for:
      • status updates
      • final transcript
      • detailed errors
  • keep terminal code from depending directly on low-level sherpa objects

Deliverable:

  • one self-contained browser voice adapter module

Phase 4. Wire VAD + recognition into the terminal UI

  • connect voice button click to adapter start/stop
  • show clear states:
    • ready
    • loading runtime
    • loading model
    • listening
    • processing speech
    • final transcript sent
    • detailed error
  • push final transcript into sendStdin(...)
  • decide whether to append newline automatically

Open question:

  • should transcript be inserted as raw text only, or raw text plus \r?

Deliverable:

  • end-to-end browser speech to terminal input using sherpa-onnx

Phase 5. Performance and caching

  • confirm browser HTTP caching behavior for WASM and model files
  • add long-lived cache headers if needed for self-hosted static assets
  • optionally add a service worker pre-cache later
  • measure first-load vs repeat-load experience

Deliverable:

  • repeat visits avoid re-downloading large model assets where possible

Phase 6. Documentation and deploy

  • document required model files and where they live
  • document browser requirements
  • document secure-origin requirement for microphone access
  • document how to update the model bundle in future

Deliverable:

  • README instructions for install, deploy, and troubleshooting

Technical Questions To Resolve Early

  1. Which sherpa-onnx JS/WASM distribution should we use?
  • npm package
  • vendored release bundle
  • custom copied example assets
  1. Which exact browser API shape should we target?
  • direct sherpa recognizer API
  • sherpa VAD + non-streaming ASR helper
  • example-derived wrapper from sherpa demos
  1. What is the expected asset layout for the Moonshine v2 medium zip?
  • whether files can be served exactly as unzipped
  • whether any extra config or renamed paths are required
  1. What transcript commit behavior do we want?
  • send text only
  • send text plus Enter
  • configurable mode

Risks

Runtime/API mismatch

Risk:

  • sherpa-onnx JS APIs may differ from the examples we choose

Mitigation:

  • lock to one verified release
  • copy a known working browser example shape before adapting

Large asset size

Risk:

  • medium model load time may be high on first use

Mitigation:

  • self-host locally
  • keep caching aggressive
  • consider fallback option for smaller model later

Mobile/browser compatibility

Risk:

  • WASM + large model + microphone flow may be poor on weaker browsers

Mitigation:

  • treat desktop Chromium/Firefox as first target
  • gate unsupported browsers with explicit errors

Current worktree noise

Risk:

  • repo already has ongoing voice-related frontend edits

Mitigation:

  • isolate the new adapter into its own file/module
  • keep migration incremental

Validation Checklist

  • bun run typecheck passes
  • frontend bundle builds
  • ./update.sh deploys successfully
  • remote HTTPS origin still permits microphone access
  • first transcription works end-to-end
  • repeated transcriptions do not leak memory or duplicate recognizers
  • page reload reuses cached assets when possible
  • detailed runtime errors are visible in UI and console
  1. Pick and verify one sherpa-onnx browser example for Moonshine v2
  2. Vendor the required WASM/runtime assets
  3. Unpack and serve the local model bundle
  4. Build a dedicated browser adapter module
  5. Rewire the existing voice UI to that adapter
  6. Validate microphone -> transcript -> terminal stdin flow
  7. Improve caching and docs

Definition of Done

  • no moonshine-js dependency remains in the browser path
  • voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
  • transcripts reach terminal stdin reliably
  • remote HTTPS access works
  • runtime/model errors are understandable
  • setup is documented and reproducible