Files

282 lines
7.5 KiB
Markdown

# sherpa-onnx Moonshine v2 Migration Plan
## Goal
Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.
The target behavior is:
- user clicks a voice button in the terminal UI
- browser captures microphone audio
- VAD segments speech locally
- Moonshine v2 recognition runs locally in WASM
- committed transcript is sent into the existing terminal input path
## Why We Are Changing Direction
The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use.
Current blockers:
- `moonshine-js` expects old-style ONNX assets such as:
- `quantized/encoder_model.onnx`
- `quantized/decoder_model_merged.onnx`
- the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout:
- `frontend.ort`
- `encoder.ort`
- `decoder_kv.ort`
- `adapter.ort`
- `cross_kv.ort`
- `tokenizer.bin`
- `streaming_config.json`
- self-hosting those files alone does not make the existing `moonshine-js` loader compatible
## Constraints
- keep transcription fully local in the browser
- keep the current terminal WebSocket/stdin path unchanged if possible
- preserve the small voice button UX already added to `webterm/static/js/terminal.ts`
- prefer self-hosted assets served by `webterm/static/`
- do not depend on third-party model/CDN availability at runtime
## Repo Touchpoints
Expected files/modules to change:
- `webterm/static/js/terminal.ts`
- `webterm/static/js/terminal.js`
- `package.json`
- `bun.lock`
- `package-lock.json`
- `webterm/assets_embed.go`
- `webterm/static/...` for new WASM/model assets
- `README.md`
Possible new files:
- `webterm/static/js/sherpa-voice.ts` or similar
- `webterm/static/js/sherpa-onnx.d.ts`
- `webterm/static/models/moonshine-v2-medium/...`
- `scripts/install-sherpa-model.sh` or similar helper
## Proposed Architecture
### 1. Frontend runtime
Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`.
Primary responsibilities:
- initialize sherpa WASM runtime
- load self-hosted Moonshine v2 model files
- initialize VAD + ASR pipeline
- expose simple start/stop API to terminal UI
- emit committed transcript strings
### 2. Audio flow
Use the VAD + offline ASR flow described by sherpa-onnx:
- microphone input
- VAD detects speech boundaries
- finalized speech segment is sent to recognizer
- recognizer returns transcript
- transcript is injected into terminal stdin
### 3. Terminal integration
Keep the existing terminal send path:
- reuse `sendStdin(...)` in `WebTerminal`
- keep current voice button/status UI shell
- replace only the recognition backend
## Implementation Phases
### Phase 1. Remove the current MoonshineJS dependency path
- remove `@moonshine-ai/moonshine-js` import and startup logic
- remove `moonshine-js.d.ts`
- remove MoonshineJS-specific error capture code
- keep the voice UI container, button, and status area
Deliverable:
- terminal still builds
- voice button exists but uses a new backend abstraction
### Phase 2. Stage sherpa-onnx assets locally
- choose exact sherpa-onnx JS/WASM package/version
- download/copy the required runtime assets into `webterm/static/`
- unpack `~/medium-streaming-en.zip` into repo-managed static model directory
- verify final model path layout expected by sherpa-onnx
- ensure static assets are available both in dev mode and embedded mode
Deliverable:
- all WASM/model files are served locally from `/static/...`
### Phase 3. Build a thin browser voice adapter
- create a dedicated module that wraps sherpa-onnx initialization
- define a minimal interface:
- `start()`
- `stop()`
- `isActive()`
- callbacks for:
- status updates
- final transcript
- detailed errors
- keep terminal code from depending directly on low-level sherpa objects
Deliverable:
- one self-contained browser voice adapter module
### Phase 4. Wire VAD + recognition into the terminal UI
- connect voice button click to adapter start/stop
- show clear states:
- ready
- loading runtime
- loading model
- listening
- processing speech
- final transcript sent
- detailed error
- push final transcript into `sendStdin(...)`
- decide whether to append newline automatically
Open question:
- should transcript be inserted as raw text only, or raw text plus `\r`?
Deliverable:
- end-to-end browser speech to terminal input using sherpa-onnx
### Phase 5. Performance and caching
- confirm browser HTTP caching behavior for WASM and model files
- add long-lived cache headers if needed for self-hosted static assets
- optionally add a service worker pre-cache later
- measure first-load vs repeat-load experience
Deliverable:
- repeat visits avoid re-downloading large model assets where possible
### Phase 6. Documentation and deploy
- document required model files and where they live
- document browser requirements
- document secure-origin requirement for microphone access
- document how to update the model bundle in future
Deliverable:
- README instructions for install, deploy, and troubleshooting
## Technical Questions To Resolve Early
1. Which sherpa-onnx JS/WASM distribution should we use?
- npm package
- vendored release bundle
- custom copied example assets
2. Which exact browser API shape should we target?
- direct sherpa recognizer API
- sherpa VAD + non-streaming ASR helper
- example-derived wrapper from sherpa demos
3. What is the expected asset layout for the Moonshine v2 medium zip?
- whether files can be served exactly as unzipped
- whether any extra config or renamed paths are required
4. What transcript commit behavior do we want?
- send text only
- send text plus Enter
- configurable mode
## Risks
### Runtime/API mismatch
Risk:
- sherpa-onnx JS APIs may differ from the examples we choose
Mitigation:
- lock to one verified release
- copy a known working browser example shape before adapting
### Large asset size
Risk:
- medium model load time may be high on first use
Mitigation:
- self-host locally
- keep caching aggressive
- consider fallback option for smaller model later
### Mobile/browser compatibility
Risk:
- WASM + large model + microphone flow may be poor on weaker browsers
Mitigation:
- treat desktop Chromium/Firefox as first target
- gate unsupported browsers with explicit errors
### Current worktree noise
Risk:
- repo already has ongoing voice-related frontend edits
Mitigation:
- isolate the new adapter into its own file/module
- keep migration incremental
## Validation Checklist
- `bun run typecheck` passes
- frontend bundle builds
- `./update.sh` deploys successfully
- remote HTTPS origin still permits microphone access
- first transcription works end-to-end
- repeated transcriptions do not leak memory or duplicate recognizers
- page reload reuses cached assets when possible
- detailed runtime errors are visible in UI and console
## Recommended Execution Order
1. Pick and verify one sherpa-onnx browser example for Moonshine v2
2. Vendor the required WASM/runtime assets
3. Unpack and serve the local model bundle
4. Build a dedicated browser adapter module
5. Rewire the existing voice UI to that adapter
6. Validate microphone -> transcript -> terminal stdin flow
7. Improve caching and docs
## Definition of Done
- no `moonshine-js` dependency remains in the browser path
- voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
- transcripts reach terminal stdin reliably
- remote HTTPS access works
- runtime/model errors are understandable
- setup is documented and reproducible