282 lines
7.5 KiB
Markdown
282 lines
7.5 KiB
Markdown
# sherpa-onnx Moonshine v2 Migration Plan
|
|
|
|
## Goal
|
|
|
|
Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.
|
|
|
|
The target behavior is:
|
|
|
|
- user clicks a voice button in the terminal UI
|
|
- browser captures microphone audio
|
|
- VAD segments speech locally
|
|
- Moonshine v2 recognition runs locally in WASM
|
|
- committed transcript is sent into the existing terminal input path
|
|
|
|
## Why We Are Changing Direction
|
|
|
|
The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use.
|
|
|
|
Current blockers:
|
|
|
|
- `moonshine-js` expects old-style ONNX assets such as:
|
|
- `quantized/encoder_model.onnx`
|
|
- `quantized/decoder_model_merged.onnx`
|
|
- the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout:
|
|
- `frontend.ort`
|
|
- `encoder.ort`
|
|
- `decoder_kv.ort`
|
|
- `adapter.ort`
|
|
- `cross_kv.ort`
|
|
- `tokenizer.bin`
|
|
- `streaming_config.json`
|
|
- self-hosting those files alone does not make the existing `moonshine-js` loader compatible
|
|
|
|
## Constraints
|
|
|
|
- keep transcription fully local in the browser
|
|
- keep the current terminal WebSocket/stdin path unchanged if possible
|
|
- preserve the small voice button UX already added to `webterm/static/js/terminal.ts`
|
|
- prefer self-hosted assets served by `webterm/static/`
|
|
- do not depend on third-party model/CDN availability at runtime
|
|
|
|
## Repo Touchpoints
|
|
|
|
Expected files/modules to change:
|
|
|
|
- `webterm/static/js/terminal.ts`
|
|
- `webterm/static/js/terminal.js`
|
|
- `package.json`
|
|
- `bun.lock`
|
|
- `package-lock.json`
|
|
- `webterm/assets_embed.go`
|
|
- `webterm/static/...` for new WASM/model assets
|
|
- `README.md`
|
|
|
|
Possible new files:
|
|
|
|
- `webterm/static/js/sherpa-voice.ts` or similar
|
|
- `webterm/static/js/sherpa-onnx.d.ts`
|
|
- `webterm/static/models/moonshine-v2-medium/...`
|
|
- `scripts/install-sherpa-model.sh` or similar helper
|
|
|
|
## Proposed Architecture
|
|
|
|
### 1. Frontend runtime
|
|
|
|
Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`.
|
|
|
|
Primary responsibilities:
|
|
|
|
- initialize sherpa WASM runtime
|
|
- load self-hosted Moonshine v2 model files
|
|
- initialize VAD + ASR pipeline
|
|
- expose simple start/stop API to terminal UI
|
|
- emit committed transcript strings
|
|
|
|
### 2. Audio flow
|
|
|
|
Use the VAD + offline ASR flow described by sherpa-onnx:
|
|
|
|
- microphone input
|
|
- VAD detects speech boundaries
|
|
- finalized speech segment is sent to recognizer
|
|
- recognizer returns transcript
|
|
- transcript is injected into terminal stdin
|
|
|
|
### 3. Terminal integration
|
|
|
|
Keep the existing terminal send path:
|
|
|
|
- reuse `sendStdin(...)` in `WebTerminal`
|
|
- keep current voice button/status UI shell
|
|
- replace only the recognition backend
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1. Remove the current MoonshineJS dependency path
|
|
|
|
- remove `@moonshine-ai/moonshine-js` import and startup logic
|
|
- remove `moonshine-js.d.ts`
|
|
- remove MoonshineJS-specific error capture code
|
|
- keep the voice UI container, button, and status area
|
|
|
|
Deliverable:
|
|
|
|
- terminal still builds
|
|
- voice button exists but uses a new backend abstraction
|
|
|
|
### Phase 2. Stage sherpa-onnx assets locally
|
|
|
|
- choose exact sherpa-onnx JS/WASM package/version
|
|
- download/copy the required runtime assets into `webterm/static/`
|
|
- unpack `~/medium-streaming-en.zip` into repo-managed static model directory
|
|
- verify final model path layout expected by sherpa-onnx
|
|
- ensure static assets are available both in dev mode and embedded mode
|
|
|
|
Deliverable:
|
|
|
|
- all WASM/model files are served locally from `/static/...`
|
|
|
|
### Phase 3. Build a thin browser voice adapter
|
|
|
|
- create a dedicated module that wraps sherpa-onnx initialization
|
|
- define a minimal interface:
|
|
- `start()`
|
|
- `stop()`
|
|
- `isActive()`
|
|
- callbacks for:
|
|
- status updates
|
|
- final transcript
|
|
- detailed errors
|
|
- keep terminal code from depending directly on low-level sherpa objects
|
|
|
|
Deliverable:
|
|
|
|
- one self-contained browser voice adapter module
|
|
|
|
### Phase 4. Wire VAD + recognition into the terminal UI
|
|
|
|
- connect voice button click to adapter start/stop
|
|
- show clear states:
|
|
- ready
|
|
- loading runtime
|
|
- loading model
|
|
- listening
|
|
- processing speech
|
|
- final transcript sent
|
|
- detailed error
|
|
- push final transcript into `sendStdin(...)`
|
|
- decide whether to append newline automatically
|
|
|
|
Open question:
|
|
|
|
- should transcript be inserted as raw text only, or raw text plus `\r`?
|
|
|
|
Deliverable:
|
|
|
|
- end-to-end browser speech to terminal input using sherpa-onnx
|
|
|
|
### Phase 5. Performance and caching
|
|
|
|
- confirm browser HTTP caching behavior for WASM and model files
|
|
- add long-lived cache headers if needed for self-hosted static assets
|
|
- optionally add a service worker pre-cache later
|
|
- measure first-load vs repeat-load experience
|
|
|
|
Deliverable:
|
|
|
|
- repeat visits avoid re-downloading large model assets where possible
|
|
|
|
### Phase 6. Documentation and deploy
|
|
|
|
- document required model files and where they live
|
|
- document browser requirements
|
|
- document secure-origin requirement for microphone access
|
|
- document how to update the model bundle in future
|
|
|
|
Deliverable:
|
|
|
|
- README instructions for install, deploy, and troubleshooting
|
|
|
|
## Technical Questions To Resolve Early
|
|
|
|
1. Which sherpa-onnx JS/WASM distribution should we use?
|
|
|
|
- npm package
|
|
- vendored release bundle
|
|
- custom copied example assets
|
|
|
|
2. Which exact browser API shape should we target?
|
|
|
|
- direct sherpa recognizer API
|
|
- sherpa VAD + non-streaming ASR helper
|
|
- example-derived wrapper from sherpa demos
|
|
|
|
3. What is the expected asset layout for the Moonshine v2 medium zip?
|
|
|
|
- whether files can be served exactly as unzipped
|
|
- whether any extra config or renamed paths are required
|
|
|
|
4. What transcript commit behavior do we want?
|
|
|
|
- send text only
|
|
- send text plus Enter
|
|
- configurable mode
|
|
|
|
## Risks
|
|
|
|
### Runtime/API mismatch
|
|
|
|
Risk:
|
|
|
|
- sherpa-onnx JS APIs may differ from the examples we choose
|
|
|
|
Mitigation:
|
|
|
|
- lock to one verified release
|
|
- copy a known working browser example shape before adapting
|
|
|
|
### Large asset size
|
|
|
|
Risk:
|
|
|
|
- medium model load time may be high on first use
|
|
|
|
Mitigation:
|
|
|
|
- self-host locally
|
|
- keep caching aggressive
|
|
- consider fallback option for smaller model later
|
|
|
|
### Mobile/browser compatibility
|
|
|
|
Risk:
|
|
|
|
- WASM + large model + microphone flow may be poor on weaker browsers
|
|
|
|
Mitigation:
|
|
|
|
- treat desktop Chromium/Firefox as first target
|
|
- gate unsupported browsers with explicit errors
|
|
|
|
### Current worktree noise
|
|
|
|
Risk:
|
|
|
|
- repo already has ongoing voice-related frontend edits
|
|
|
|
Mitigation:
|
|
|
|
- isolate the new adapter into its own file/module
|
|
- keep migration incremental
|
|
|
|
## Validation Checklist
|
|
|
|
- `bun run typecheck` passes
|
|
- frontend bundle builds
|
|
- `./update.sh` deploys successfully
|
|
- remote HTTPS origin still permits microphone access
|
|
- first transcription works end-to-end
|
|
- repeated transcriptions do not leak memory or duplicate recognizers
|
|
- page reload reuses cached assets when possible
|
|
- detailed runtime errors are visible in UI and console
|
|
|
|
## Recommended Execution Order
|
|
|
|
1. Pick and verify one sherpa-onnx browser example for Moonshine v2
|
|
2. Vendor the required WASM/runtime assets
|
|
3. Unpack and serve the local model bundle
|
|
4. Build a dedicated browser adapter module
|
|
5. Rewire the existing voice UI to that adapter
|
|
6. Validate microphone -> transcript -> terminal stdin flow
|
|
7. Improve caching and docs
|
|
|
|
## Definition of Done
|
|
|
|
- no `moonshine-js` dependency remains in the browser path
|
|
- voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
|
|
- transcripts reach terminal stdin reliably
|
|
- remote HTTPS access works
|
|
- runtime/model errors are understandable
|
|
- setup is documented and reproducible
|