izackp/webterm

Fork 0

Files

T

izackp 3edc41e86c feat: add initial browser voice input prototype

2026-05-11 22:01:23 -04:00

7.5 KiB

Raw Permalink Blame History

sherpa-onnx Moonshine v2 Migration Plan

Goal

Replace the current @moonshine-ai/moonshine-js browser integration with a sherpa-onnx WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.

The target behavior is:

user clicks a voice button in the terminal UI
browser captures microphone audio
VAD segments speech locally
Moonshine v2 recognition runs locally in WASM
committed transcript is sent into the existing terminal input path

Why We Are Changing Direction

The current MoonshineJS integration is the wrong runtime for the medium-streaming-en bundle we want to use.

Current blockers:

moonshine-js expects old-style ONNX assets such as:
- quantized/encoder_model.onnx
- quantized/decoder_model_merged.onnx
the downloaded medium-streaming-en.zip contains a newer Moonshine v2 layout:
- frontend.ort
- encoder.ort
- decoder_kv.ort
- adapter.ort
- cross_kv.ort
- tokenizer.bin
- streaming_config.json
self-hosting those files alone does not make the existing moonshine-js loader compatible

Constraints

keep transcription fully local in the browser
keep the current terminal WebSocket/stdin path unchanged if possible
preserve the small voice button UX already added to webterm/static/js/terminal.ts
prefer self-hosted assets served by webterm/static/
do not depend on third-party model/CDN availability at runtime

Repo Touchpoints

Expected files/modules to change:

webterm/static/js/terminal.ts
webterm/static/js/terminal.js
package.json
bun.lock
package-lock.json
webterm/assets_embed.go
webterm/static/... for new WASM/model assets
README.md

Possible new files:

webterm/static/js/sherpa-voice.ts or similar
webterm/static/js/sherpa-onnx.d.ts
webterm/static/models/moonshine-v2-medium/...
scripts/install-sherpa-model.sh or similar helper

Proposed Architecture

1. Frontend runtime

Use sherpa-onnx JavaScript/WebAssembly in the browser instead of moonshine-js.

Primary responsibilities:

initialize sherpa WASM runtime
load self-hosted Moonshine v2 model files
initialize VAD + ASR pipeline
expose simple start/stop API to terminal UI
emit committed transcript strings

2. Audio flow

Use the VAD + offline ASR flow described by sherpa-onnx:

microphone input
VAD detects speech boundaries
finalized speech segment is sent to recognizer
recognizer returns transcript
transcript is injected into terminal stdin

3. Terminal integration

Keep the existing terminal send path:

reuse sendStdin(...) in WebTerminal
keep current voice button/status UI shell
replace only the recognition backend

Implementation Phases

Phase 1. Remove the current MoonshineJS dependency path

remove @moonshine-ai/moonshine-js import and startup logic
remove moonshine-js.d.ts
remove MoonshineJS-specific error capture code
keep the voice UI container, button, and status area

Deliverable:

terminal still builds
voice button exists but uses a new backend abstraction

Phase 2. Stage sherpa-onnx assets locally

choose exact sherpa-onnx JS/WASM package/version
download/copy the required runtime assets into webterm/static/
unpack ~/medium-streaming-en.zip into repo-managed static model directory
verify final model path layout expected by sherpa-onnx
ensure static assets are available both in dev mode and embedded mode

Deliverable:

all WASM/model files are served locally from /static/...

Phase 3. Build a thin browser voice adapter

create a dedicated module that wraps sherpa-onnx initialization
define a minimal interface:
- start()
- stop()
- isActive()
- callbacks for:
  - status updates
  - final transcript
  - detailed errors
keep terminal code from depending directly on low-level sherpa objects

Deliverable:

one self-contained browser voice adapter module

Phase 4. Wire VAD + recognition into the terminal UI

connect voice button click to adapter start/stop
show clear states:
- ready
- loading runtime
- loading model
- listening
- processing speech
- final transcript sent
- detailed error
push final transcript into sendStdin(...)
decide whether to append newline automatically

Open question:

should transcript be inserted as raw text only, or raw text plus \r?

Deliverable:

end-to-end browser speech to terminal input using sherpa-onnx

Phase 5. Performance and caching

confirm browser HTTP caching behavior for WASM and model files
add long-lived cache headers if needed for self-hosted static assets
optionally add a service worker pre-cache later
measure first-load vs repeat-load experience

Deliverable:

repeat visits avoid re-downloading large model assets where possible

Phase 6. Documentation and deploy

document required model files and where they live
document browser requirements
document secure-origin requirement for microphone access
document how to update the model bundle in future

Deliverable:

README instructions for install, deploy, and troubleshooting

Technical Questions To Resolve Early

Which sherpa-onnx JS/WASM distribution should we use?

npm package
vendored release bundle
custom copied example assets

Which exact browser API shape should we target?

direct sherpa recognizer API
sherpa VAD + non-streaming ASR helper
example-derived wrapper from sherpa demos

What is the expected asset layout for the Moonshine v2 medium zip?

whether files can be served exactly as unzipped
whether any extra config or renamed paths are required

What transcript commit behavior do we want?

send text only
send text plus Enter
configurable mode

Risks

Runtime/API mismatch

Risk:

sherpa-onnx JS APIs may differ from the examples we choose

Mitigation:

lock to one verified release
copy a known working browser example shape before adapting

Large asset size

Risk:

medium model load time may be high on first use

Mitigation:

self-host locally
keep caching aggressive
consider fallback option for smaller model later

Mobile/browser compatibility

Risk:

WASM + large model + microphone flow may be poor on weaker browsers

Mitigation:

treat desktop Chromium/Firefox as first target
gate unsupported browsers with explicit errors

Current worktree noise

Risk:

repo already has ongoing voice-related frontend edits

Mitigation:

isolate the new adapter into its own file/module
keep migration incremental

Validation Checklist

bun run typecheck passes
frontend bundle builds
./update.sh deploys successfully
remote HTTPS origin still permits microphone access
first transcription works end-to-end
repeated transcriptions do not leak memory or duplicate recognizers
page reload reuses cached assets when possible
detailed runtime errors are visible in UI and console

Recommended Execution Order

Pick and verify one sherpa-onnx browser example for Moonshine v2
Vendor the required WASM/runtime assets
Unpack and serve the local model bundle
Build a dedicated browser adapter module
Rewire the existing voice UI to that adapter
Validate microphone -> transcript -> terminal stdin flow
Improve caching and docs

Definition of Done

no moonshine-js dependency remains in the browser path
voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
transcripts reach terminal stdin reliably
remote HTTPS access works
runtime/model errors are understandable
setup is documented and reproducible

7.5 KiB Raw Permalink Blame History

sherpa-onnx Moonshine v2 Migration Plan

Goal

Why We Are Changing Direction

Constraints

Repo Touchpoints

Proposed Architecture

1. Frontend runtime

2. Audio flow

3. Terminal integration

Implementation Phases

Phase 1. Remove the current MoonshineJS dependency path

Phase 2. Stage sherpa-onnx assets locally

Phase 3. Build a thin browser voice adapter

Phase 4. Wire VAD + recognition into the terminal UI

Phase 5. Performance and caching

Phase 6. Documentation and deploy

Technical Questions To Resolve Early

Risks

Runtime/API mismatch

Large asset size

Mobile/browser compatibility

Current worktree noise

Validation Checklist

Recommended Execution Order

Definition of Done

7.5 KiB

Raw Permalink Blame History