Standalone Android ASR · May 2026

Transcriber

On-device speech intelligence for Android. Record, transcribe, translate, and clean up without leaving the phone.

Gemma 4 Whisper.cpp Apache-2.0 Sideload APK

View on GitHub ↗ Back to portfolio

100% Local inference — Whisper.cpp and Gemma 4 run entirely on the device's GPU and CPU.

0 Network calls in a run. Audio, transcripts, and sidecars never leave the phone. Sharing is opt-in.

2.6GB Default model — Gemma 4 E2B weights, fetched once. Bigger Whisper and embedding models optional.

2 ASR engines side by side — Gemma 4 E2B and Whisper.cpp. Pick one per recording.

4 Languages anchored against drift — Arabic, Ukrainian, English, Dutch. Multi-select, constrained auto-detect.

84ms Library search across every segment of ~1000 recordings. Sub-100 ms, case-insensitive.

From mic to clean transcript, in one pass.

Capture

16 kHz mono PCM Mic or imported WAV / MP3 / M4A / AAC / OGG / FLAC

Segment

VAD silence scan Cut points snap to pauses ≥ 250 ms inside ±2 s

Recognize

Gemma 4 E2B or Whisper.cpp Streamed token by token, single long-lived engine

Diarize

sherpa-onnx clusters Reconciled with Gemma's inline Speaker N: labels

Post-process

Summary, rewrite, clean, translate Same Gemma engine, four prompt presets

Persist

Room DB + sidecars .txt · .srt · .json · .speakers.json beside the audio file

Two engines, one app. Switch per recording.

Gemma 4 E2B · LiteRT-LM 0.11 Default

Transcribe, translate, and diarize from one weight set.

Size: 2.59 GB on disk · 676 MB GPU footprint
Best for: Dialectal Arabic, code-switching, Ukrainian
Strengths: Holds Gulf Arabic without collapsing to MSA. Streaming decode keeps peak heap ~6 MB regardless of file length.

Whisper.cpp · ggml

The classic open-source baseline.

Models: tiny (75 MB) · small (466 MB) · large-v3-turbo q5_0 (574 MB)
Best for: Predictable latency · live transcription on weak devices
Strengths: Faster cold start. Pairs with sherpa-onnx for stable speaker IDs across hour-long meetings.

Including the ones that usually drift to English.

Arabic

العربية

Gulf / Qatari dialect anchored in the prompt — won't collapse to MSA at temperature 0.1.

Ukrainian

Українська

Held against Russian by the same dialect-name-twice template Google's audio cookbook recommends.

English

US, UK, and accented variants. The baseline that everything is checked against.

Dutch

Nederlands

NL and BE. Domain vocabulary packs propagate as Gemma spelling hints.

Multi-select picker. Empty = full auto-detect. One = forced. Multiple = constrained auto — faster and more accurate than letting Gemma consider 100+ languages.

Three ways to figure out who said what.

Gemma-only

Prompt-based Speaker N: labels emitted inline by Gemma 4. No extra download.

Zero install

Hybrid

sherpa-onnx pre-pass clusters speakers globally. Cluster IDs become per-chunk hints for Gemma, and reconcile its inline labels at write time.

Recommended

Whisper + sherpa

Run Whisper.cpp end-to-end, then assign speakers from sherpa's cluster output. Stable IDs across hour-long files.

Globally consistent

Bonus: when someone says "Hi, I'm Ahmed" in a segment, the detected name propagates to every segment with the same speaker key.

Four presets that read everything first.

Summary

TL;DR plus bullet key points and a decisions / action items list. Opens as a new tab next to Transcript.

Context-aware rewrite

Reads the entire transcript first, then rewrites each line using the conversation as context. Back-propagates name spellings; fixes homophones; doesn't paraphrase.

Clean

Line by line — fix STT errors, add punctuation, preserve language. Honors verbatim, filler, and tone settings.

Translate & polish

Idiomatic English prose, not segment-by-segment. Same Gemma engine, prompted to read for meaning before rendering.

Output is rendered Markdown · Share + Delete per output · all four prompts editable in Settings.

Teach it your jargon. Then forget you did.

Medical

amoxicillin
atrial fibrillation
ICD-10
MeSH · RxNorm

IT / Software

Kubernetes
OpenTelemetry
CNCF
ACM CCS

Construction

CSI MasterFormat
ASTM · NEN
QCS · ДБН
materials, codes

Legal

res ipsa loquitur
certiorari
Cornell LII Wex
rechtspraak.nl

Finance

CDS · IRS · MiFID
SEC · BIS · AFM
NBU · QFMA
instruments, regs

Plus Custom vocabulary· Tone & verbatim· Filler removal· {snippet:signoff}

Designed for the moment the phone tries to kill the job.

Foreground service

Mic FGS during capture, mediaProcessing FGS for batch jobs. Partial wake lock keeps long jobs alive.

VAD-aligned chunks

28-second targets land on the nearest pause within ±2 s. Words no longer cut mid-syllable at chunk boundaries.

Silent-wedge watchdog

60-second no-delta cancel. Engine is released, reloaded, and the chunk retried once with trimmed input.

Silence pre-skip

RMS below −54 dBFS bypasses Gemma entirely. Eliminates the prefill wedge before it can start.

FIFO job queue

Bulk-enqueue and walk away. Tasks survive process death via Room; charger-parked jobs replay on launch.

Run uninterrupted

First-class flow for the Samsung battery-exemption prompt — the one that keeps long jobs from getting reaped.

The compute knobs that actually matter.

Backend

AutoGPU onlyCPU only

Auto falls back GPU → CPU on init failure. Pin GPU for max throughput, CPU for predictable latency.

Context window

4K8K16K32K

8K is the sweet spot. Bump higher when Context-aware rewrite truncates on hour-long meetings.

CPU threads

Auto2468

Auto picks ~half the cores. Going above physical-core count consistently hurts; don't.

Find recordings by what was said in them.

⌕ retention cohort 3 results · 84 ms

Q3 Planning with Engineering Team

…shifted the retention cohort definition to 28-day…

14 May · 42 min

1:1 — Sara

…what does the retention cohort look like for…

12 May · 28 min

Voice memo · 11:42

…revisit the retention cohort spreadsheet…

08 May · 02 min

Case-insensitive. Auto-titled recordings get real names — Q3 Planning with Engineering Team instead of Recording_2026-05-17_14-23-30.

A complete speech stack — yours, on the phone, under Apache-2.0.

View on GitHub ↗

Stack

Gemma 4 · LiteRT-LM 0.11 · Whisper.cpp · sherpa-onnx · Compose Material 3 · Room · ExoPlayer

Distribution

APK sideload. Curated model downloads inside the app. Custom *.bin / *.litertlm import.

Promise

Audio never leaves the phone. The code is readable. Nothing phones home.

Created by Yuri Ihnatov ihnatov.nl · nl.ihnatov.transcriber · v1 · 2026