webterm/plan.md

# sherpa-onnx Moonshine v2 Migration Plan

## Goal

Replace the current `@moonshine-ai/moonshine-js` browser integration with a `sherpa-onnx` WebAssembly integration that can run Moonshine v2 locally in the browser and feed recognized text into the existing terminal stdin path.

The target behavior is:

- user clicks a voice button in the terminal UI
- browser captures microphone audio
- VAD segments speech locally
- Moonshine v2 recognition runs locally in WASM
- committed transcript is sent into the existing terminal input path

## Why We Are Changing Direction

The current MoonshineJS integration is the wrong runtime for the `medium-streaming-en` bundle we want to use.

Current blockers:

- `moonshine-js` expects old-style ONNX assets such as:
  - `quantized/encoder_model.onnx`
  - `quantized/decoder_model_merged.onnx`
- the downloaded `medium-streaming-en.zip` contains a newer Moonshine v2 layout:
  - `frontend.ort`
  - `encoder.ort`
  - `decoder_kv.ort`
  - `adapter.ort`
  - `cross_kv.ort`
  - `tokenizer.bin`
  - `streaming_config.json`
- self-hosting those files alone does not make the existing `moonshine-js` loader compatible

## Constraints

- keep transcription fully local in the browser
- keep the current terminal WebSocket/stdin path unchanged if possible
- preserve the small voice button UX already added to `webterm/static/js/terminal.ts`
- prefer self-hosted assets served by `webterm/static/`
- do not depend on third-party model/CDN availability at runtime

## Repo Touchpoints

Expected files/modules to change:

- `webterm/static/js/terminal.ts`
- `webterm/static/js/terminal.js`
- `package.json`
- `bun.lock`
- `package-lock.json`
- `webterm/assets_embed.go`
- `webterm/static/...` for new WASM/model assets
- `README.md`

Possible new files:

- `webterm/static/js/sherpa-voice.ts` or similar
- `webterm/static/js/sherpa-onnx.d.ts`
- `webterm/static/models/moonshine-v2-medium/...`
- `scripts/install-sherpa-model.sh` or similar helper

## Proposed Architecture

### 1. Frontend runtime

Use `sherpa-onnx` JavaScript/WebAssembly in the browser instead of `moonshine-js`.

Primary responsibilities:

- initialize sherpa WASM runtime
- load self-hosted Moonshine v2 model files
- initialize VAD + ASR pipeline
- expose simple start/stop API to terminal UI
- emit committed transcript strings

### 2. Audio flow

Use the VAD + offline ASR flow described by sherpa-onnx:

- microphone input
- VAD detects speech boundaries
- finalized speech segment is sent to recognizer
- recognizer returns transcript
- transcript is injected into terminal stdin

### 3. Terminal integration

Keep the existing terminal send path:

- reuse `sendStdin(...)` in `WebTerminal`
- keep current voice button/status UI shell
- replace only the recognition backend

## Implementation Phases

### Phase 1. Remove the current MoonshineJS dependency path

- remove `@moonshine-ai/moonshine-js` import and startup logic
- remove `moonshine-js.d.ts`
- remove MoonshineJS-specific error capture code
- keep the voice UI container, button, and status area

Deliverable:

- terminal still builds
- voice button exists but uses a new backend abstraction

### Phase 2. Stage sherpa-onnx assets locally

- choose exact sherpa-onnx JS/WASM package/version
- download/copy the required runtime assets into `webterm/static/`
- unpack `~/medium-streaming-en.zip` into repo-managed static model directory
- verify final model path layout expected by sherpa-onnx
- ensure static assets are available both in dev mode and embedded mode

Deliverable:

- all WASM/model files are served locally from `/static/...`

### Phase 3. Build a thin browser voice adapter

- create a dedicated module that wraps sherpa-onnx initialization
- define a minimal interface:
  - `start()`
  - `stop()`
  - `isActive()`
  - callbacks for:
    - status updates
    - final transcript
    - detailed errors
- keep terminal code from depending directly on low-level sherpa objects

Deliverable:

- one self-contained browser voice adapter module

### Phase 4. Wire VAD + recognition into the terminal UI

- connect voice button click to adapter start/stop
- show clear states:
  - ready
  - loading runtime
  - loading model
  - listening
  - processing speech
  - final transcript sent
  - detailed error
- push final transcript into `sendStdin(...)`
- decide whether to append newline automatically

Open question:

- should transcript be inserted as raw text only, or raw text plus `\r`?

Deliverable:

- end-to-end browser speech to terminal input using sherpa-onnx

### Phase 5. Performance and caching

- confirm browser HTTP caching behavior for WASM and model files
- add long-lived cache headers if needed for self-hosted static assets
- optionally add a service worker pre-cache later
- measure first-load vs repeat-load experience

Deliverable:

- repeat visits avoid re-downloading large model assets where possible

### Phase 6. Documentation and deploy

- document required model files and where they live
- document browser requirements
- document secure-origin requirement for microphone access
- document how to update the model bundle in future

Deliverable:

- README instructions for install, deploy, and troubleshooting

## Technical Questions To Resolve Early

1. Which sherpa-onnx JS/WASM distribution should we use?

- npm package
- vendored release bundle
- custom copied example assets

2. Which exact browser API shape should we target?

- direct sherpa recognizer API
- sherpa VAD + non-streaming ASR helper
- example-derived wrapper from sherpa demos

3. What is the expected asset layout for the Moonshine v2 medium zip?

- whether files can be served exactly as unzipped
- whether any extra config or renamed paths are required

4. What transcript commit behavior do we want?

- send text only
- send text plus Enter
- configurable mode

## Risks

### Runtime/API mismatch

Risk:

- sherpa-onnx JS APIs may differ from the examples we choose

Mitigation:

- lock to one verified release
- copy a known working browser example shape before adapting

### Large asset size

Risk:

- medium model load time may be high on first use

Mitigation:

- self-host locally
- keep caching aggressive
- consider fallback option for smaller model later

### Mobile/browser compatibility

Risk:

- WASM + large model + microphone flow may be poor on weaker browsers

Mitigation:

- treat desktop Chromium/Firefox as first target
- gate unsupported browsers with explicit errors

### Current worktree noise

Risk:

- repo already has ongoing voice-related frontend edits

Mitigation:

- isolate the new adapter into its own file/module
- keep migration incremental

## Validation Checklist

- `bun run typecheck` passes
- frontend bundle builds
- `./update.sh` deploys successfully
- remote HTTPS origin still permits microphone access
- first transcription works end-to-end
- repeated transcriptions do not leak memory or duplicate recognizers
- page reload reuses cached assets when possible
- detailed runtime errors are visible in UI and console

## Recommended Execution Order

1. Pick and verify one sherpa-onnx browser example for Moonshine v2
2. Vendor the required WASM/runtime assets
3. Unpack and serve the local model bundle
4. Build a dedicated browser adapter module
5. Rewire the existing voice UI to that adapter
6. Validate microphone -> transcript -> terminal stdin flow
7. Improve caching and docs

## Definition of Done

- no `moonshine-js` dependency remains in the browser path
- voice input uses sherpa-onnx + locally hosted Moonshine v2 assets
- transcripts reach terminal stdin reliably
- remote HTTPS access works
- runtime/model errors are understandable
- setup is documented and reproducible