Speech ⇄ Fongbe · ASR v14 (WER_NT 18.79%) + MMS-TTS-fon
🎤 Record voice
00:00
Tip: speak clearly, 5–60 s works best. Stop when done — it transcribes immediately.
📂 Upload audio
Drag & drop an audio file here
or
wav · mp3 · m4a · ogg · flac · webm · mp4 — up to 50 MB / 5 min
Transcribing…
📝 Transcription
Inference details
💬 Conversation with Fon assistant
…
Type in Fon or French — the assistant auto-detects. In cascade mode, Fon is translated to French via NLLB, a French LLM reasons over it, and the reply is translated back. Press Enter to send, Shift+Enter for newline.
🗣 Fon text-to-speech
Type or paste Fon text and synthesize it with facebook/mms-tts-fon. Tone diacritics (á à â ǎ ɛ́ ɔ́ ɖ) are respected.
🕘 History (this session)
About this model
ASR (speech → text): Fon v14 (tone_aware_v14_real_on_v12s) — SpeechBrain wav2vec2 backbone (whettenr/asr-fon-with-diacritics-bpe-256) + F0 side encoder + auxiliary tone-CTC head.
ASR training data: 198 h Fon = FFSTC-2 + Zenodo + Bible (~111 h) + non-religious lessons + 4 real-conversational sources (storyrunners, yoannoza, accentfine, jonathansuru) + synthetic from amaniopia/Shads parallel text via MMS-TTS-fon.
ASR performance (cross-corpus, WER_NT_canon, lower=better):
FFSTC-2 in-distribution: 15.25%
beetho_female: 1.80% · beetho_male: 8.39%
jemima_fon: 8.92% · fongbe_zenodo: 12.73%
Disjoint cross-corpus avg: 18.79%
TTS (text → speech): stock facebook/mms-tts-fon (Meta MMS, VITS architecture). Our own fine-tune attempts (v1, v2 of the project TTS) regressed vs the stock baseline — the vendor VITS fine-tune recipe destroys the text encoder regardless of LR / GAN settings. Stock MMS-TTS-fon achieves 18.10% ASR-loop WER_NT_canon on our held-out Bible eval set (n=200, seed=1337), which is the current production quality bar.
Best for: read or semi-spontaneous Fon speech, single-speaker (ASR) and short-to-medium Fon utterances (TTS). WER_NT means tone diacritics are normalized — the underlying letter sequence is what's scored.
Known limitations:
Heavy code-switching with French is partially handled but not the focus.
Spontaneous overlapping conversation degrades ASR.
TTS uses a single speaker voice (MMS default) and cannot be cloned.
Tone marking in output follows Bible orthography conventions — may differ from other sources.
Not for safety-critical use (medical dosage, legal numbers).