feat: add initial browser voice input prototype
This commit is contained in:
@@ -0,0 +1,281 @@
|
||||
# sherpa-onnx Moonshine v2 Migration Plan
|
||||
|
||||
## Goal
|
||||
|
||||
Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.
|
||||
|
||||
The target behavior is:
|
||||
|
||||
- user clicks a voice button in the terminal UI
|
||||
- browser captures microphone audio
|
||||
- VAD segments speech locally
|
||||
- Moonshine v2 recognition runs locally in WASM
|
||||
- committed transcript is sent into the existing terminal input path
|
||||
|
||||
## Why We Are Changing Direction
|
||||
|
||||
The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use.
|
||||
|
||||
Current blockers:
|
||||
|
||||
- `moonshine-js` expects old-style ONNX assets such as:
|
||||
- `quantized/encoder_model.onnx`
|
||||
- `quantized/decoder_model_merged.onnx`
|
||||
- the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout:
|
||||
- `frontend.ort`
|
||||
- `encoder.ort`
|
||||
- `decoder_kv.ort`
|
||||
- `adapter.ort`
|
||||
- `cross_kv.ort`
|
||||
- `tokenizer.bin`
|
||||
- `streaming_config.json`
|
||||
- self-hosting those files alone does not make the existing `moonshine-js` loader compatible
|
||||
|
||||
## Constraints
|
||||
|
||||
- keep transcription fully local in the browser
|
||||
- keep the current terminal WebSocket/stdin path unchanged if possible
|
||||
- preserve the small voice button UX already added to `webterm/static/js/terminal.ts`
|
||||
- prefer self-hosted assets served by `webterm/static/`
|
||||
- do not depend on third-party model/CDN availability at runtime
|
||||
|
||||
## Repo Touchpoints
|
||||
|
||||
Expected files/modules to change:
|
||||
|
||||
- `webterm/static/js/terminal.ts`
|
||||
- `webterm/static/js/terminal.js`
|
||||
- `package.json`
|
||||
- `bun.lock`
|
||||
- `package-lock.json`
|
||||
- `webterm/assets_embed.go`
|
||||
- `webterm/static/...` for new WASM/model assets
|
||||
- `README.md`
|
||||
|
||||
Possible new files:
|
||||
|
||||
- `webterm/static/js/sherpa-voice.ts` or similar
|
||||
- `webterm/static/js/sherpa-onnx.d.ts`
|
||||
- `webterm/static/models/moonshine-v2-medium/...`
|
||||
- `scripts/install-sherpa-model.sh` or similar helper
|
||||
|
||||
## Proposed Architecture
|
||||
|
||||
### 1. Frontend runtime
|
||||
|
||||
Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`.
|
||||
|
||||
Primary responsibilities:
|
||||
|
||||
- initialize sherpa WASM runtime
|
||||
- load self-hosted Moonshine v2 model files
|
||||
- initialize VAD + ASR pipeline
|
||||
- expose simple start/stop API to terminal UI
|
||||
- emit committed transcript strings
|
||||
|
||||
### 2. Audio flow
|
||||
|
||||
Use the VAD + offline ASR flow described by sherpa-onnx:
|
||||
|
||||
- microphone input
|
||||
- VAD detects speech boundaries
|
||||
- finalized speech segment is sent to recognizer
|
||||
- recognizer returns transcript
|
||||
- transcript is injected into terminal stdin
|
||||
|
||||
### 3. Terminal integration
|
||||
|
||||
Keep the existing terminal send path:
|
||||
|
||||
- reuse `sendStdin(...)` in `WebTerminal`
|
||||
- keep current voice button/status UI shell
|
||||
- replace only the recognition backend
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1. Remove the current MoonshineJS dependency path
|
||||
|
||||
- remove `@moonshine-ai/moonshine-js` import and startup logic
|
||||
- remove `moonshine-js.d.ts`
|
||||
- remove MoonshineJS-specific error capture code
|
||||
- keep the voice UI container, button, and status area
|
||||
|
||||
Deliverable:
|
||||
|
||||
- terminal still builds
|
||||
- voice button exists but uses a new backend abstraction
|
||||
|
||||
### Phase 2. Stage sherpa-onnx assets locally
|
||||
|
||||
- choose exact sherpa-onnx JS/WASM package/version
|
||||
- download/copy the required runtime assets into `webterm/static/`
|
||||
- unpack `~/medium-streaming-en.zip` into repo-managed static model directory
|
||||
- verify final model path layout expected by sherpa-onnx
|
||||
- ensure static assets are available both in dev mode and embedded mode
|
||||
|
||||
Deliverable:
|
||||
|
||||
- all WASM/model files are served locally from `/static/...`
|
||||
|
||||
### Phase 3. Build a thin browser voice adapter
|
||||
|
||||
- create a dedicated module that wraps sherpa-onnx initialization
|
||||
- define a minimal interface:
|
||||
- `start()`
|
||||
- `stop()`
|
||||
- `isActive()`
|
||||
- callbacks for:
|
||||
- status updates
|
||||
- final transcript
|
||||
- detailed errors
|
||||
- keep terminal code from depending directly on low-level sherpa objects
|
||||
|
||||
Deliverable:
|
||||
|
||||
- one self-contained browser voice adapter module
|
||||
|
||||
### Phase 4. Wire VAD + recognition into the terminal UI
|
||||
|
||||
- connect voice button click to adapter start/stop
|
||||
- show clear states:
|
||||
- ready
|
||||
- loading runtime
|
||||
- loading model
|
||||
- listening
|
||||
- processing speech
|
||||
- final transcript sent
|
||||
- detailed error
|
||||
- push final transcript into `sendStdin(...)`
|
||||
- decide whether to append newline automatically
|
||||
|
||||
Open question:
|
||||
|
||||
- should transcript be inserted as raw text only, or raw text plus `\r`?
|
||||
|
||||
Deliverable:
|
||||
|
||||
- end-to-end browser speech to terminal input using sherpa-onnx
|
||||
|
||||
### Phase 5. Performance and caching
|
||||
|
||||
- confirm browser HTTP caching behavior for WASM and model files
|
||||
- add long-lived cache headers if needed for self-hosted static assets
|
||||
- optionally add a service worker pre-cache later
|
||||
- measure first-load vs repeat-load experience
|
||||
|
||||
Deliverable:
|
||||
|
||||
- repeat visits avoid re-downloading large model assets where possible
|
||||
|
||||
### Phase 6. Documentation and deploy
|
||||
|
||||
- document required model files and where they live
|
||||
- document browser requirements
|
||||
- document secure-origin requirement for microphone access
|
||||
- document how to update the model bundle in future
|
||||
|
||||
Deliverable:
|
||||
|
||||
- README instructions for install, deploy, and troubleshooting
|
||||
|
||||
## Technical Questions To Resolve Early
|
||||
|
||||
1. Which sherpa-onnx JS/WASM distribution should we use?
|
||||
|
||||
- npm package
|
||||
- vendored release bundle
|
||||
- custom copied example assets
|
||||
|
||||
2. Which exact browser API shape should we target?
|
||||
|
||||
- direct sherpa recognizer API
|
||||
- sherpa VAD + non-streaming ASR helper
|
||||
- example-derived wrapper from sherpa demos
|
||||
|
||||
3. What is the expected asset layout for the Moonshine v2 medium zip?
|
||||
|
||||
- whether files can be served exactly as unzipped
|
||||
- whether any extra config or renamed paths are required
|
||||
|
||||
4. What transcript commit behavior do we want?
|
||||
|
||||
- send text only
|
||||
- send text plus Enter
|
||||
- configurable mode
|
||||
|
||||
## Risks
|
||||
|
||||
### Runtime/API mismatch
|
||||
|
||||
Risk:
|
||||
|
||||
- sherpa-onnx JS APIs may differ from the examples we choose
|
||||
|
||||
Mitigation:
|
||||
|
||||
- lock to one verified release
|
||||
- copy a known working browser example shape before adapting
|
||||
|
||||
### Large asset size
|
||||
|
||||
Risk:
|
||||
|
||||
- medium model load time may be high on first use
|
||||
|
||||
Mitigation:
|
||||
|
||||
- self-host locally
|
||||
- keep caching aggressive
|
||||
- consider fallback option for smaller model later
|
||||
|
||||
### Mobile/browser compatibility
|
||||
|
||||
Risk:
|
||||
|
||||
- WASM + large model + microphone flow may be poor on weaker browsers
|
||||
|
||||
Mitigation:
|
||||
|
||||
- treat desktop Chromium/Firefox as first target
|
||||
- gate unsupported browsers with explicit errors
|
||||
|
||||
### Current worktree noise
|
||||
|
||||
Risk:
|
||||
|
||||
- repo already has ongoing voice-related frontend edits
|
||||
|
||||
Mitigation:
|
||||
|
||||
- isolate the new adapter into its own file/module
|
||||
- keep migration incremental
|
||||
|
||||
## Validation Checklist
|
||||
|
||||
- `bun run typecheck` passes
|
||||
- frontend bundle builds
|
||||
- `./update.sh` deploys successfully
|
||||
- remote HTTPS origin still permits microphone access
|
||||
- first transcription works end-to-end
|
||||
- repeated transcriptions do not leak memory or duplicate recognizers
|
||||
- page reload reuses cached assets when possible
|
||||
- detailed runtime errors are visible in UI and console
|
||||
|
||||
## Recommended Execution Order
|
||||
|
||||
1. Pick and verify one sherpa-onnx browser example for Moonshine v2
|
||||
2. Vendor the required WASM/runtime assets
|
||||
3. Unpack and serve the local model bundle
|
||||
4. Build a dedicated browser adapter module
|
||||
5. Rewire the existing voice UI to that adapter
|
||||
6. Validate microphone -> transcript -> terminal stdin flow
|
||||
7. Improve caching and docs
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- no `moonshine-js` dependency remains in the browser path
|
||||
- voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
|
||||
- transcripts reach terminal stdin reliably
|
||||
- remote HTTPS access works
|
||||
- runtime/model errors are understandable
|
||||
- setup is documented and reproducible
|
||||
Reference in New Issue
Block a user