Add neutts_nano_tts by jathinsn27 · Pull Request #21 · zetic-ai/ZETIC_Melange_apps

jathinsn27 · 2026-01-24T14:29:12Z

Summary

This PR for iOS integration of the NeuTTS Nano text-to-speech model, enabling on-device speech synthesis with voice cloning capabilities. The implementation includes a three-stage pipeline (Backbone → Encoder → Decoder), custom tokenization matching Hugging Face's ByteLevel BPE tokenizer, and integration of espeak-ng for phonemization. Cross-compiling native libraries for iOS and aligning model inputs/outputs with the 3 models from ZeticMLange SDK.

1. NeuTTS Nano Architecture Overview

Text Input → Phonemization → Tokenization → Backbone Model → Speech Codes → Decoder → Audio Output
                ↓
         Reference Audio → Encoder → Reference Codes

Components:

Backbone Model (neutts_nano): A language model that generates discrete speech tokens from phonemized text
Encoder Model (neucodec-encoder): Converts reference audio to discrete codes for voice cloning
Decoder Model (neucodec-decoder): Converts discrete codes back to raw audio waveforms

2. Model Input/Output Specifications

2.1 Backbone Model (`neutts_nano`)

Input:

input_ids: [1, 128] shape, int32 dtype
- Tokenized and phonemized text prompt
- Includes special tokens: <|start_header_id|>, <|end_header_id|>, <|speech_###|>
- Format: "<|start_header_id|>user<|end_header_id|>\n\n{phonemes}<|speech_###|>"
attention_mask: [1, 128] shape, int32 dtype
- Binary mask indicating valid tokens (1) vs padding (0)

Output:

Logits: [1, 128, vocab_size] shape, float32 dtype
- Language model logits over vocabulary
- Contains special speech tokens <|speech_###|> where ### represents discrete audio codes

2.2 Encoder Model (`neucodec-encoder`)

Input:

audio: [1, 1, 16000] shape, float32 dtype
- Mono audio waveform at 16 kHz sample rate
- Must be exactly 16000 samples (1 second of audio)
- Values normalized to [-1.0, 1.0] range

Output:

codes: [1, 1, 50] shape, int32 dtype
- Discrete audio codes representing the reference audio
- Used for voice cloning (prosody and timbre transfer)

2.3 Decoder Model (`neucodec-decoder`)

Input:

codes: [1, 1, 50] shape, int64 dtype
- Discrete audio codes from backbone output
- Must be padded/truncated to exactly 50 codes

Output:

audio: [1, 1, 24000] shape, float32 dtype
- Raw PCM audio waveform at 24 kHz sample rate
- Values in range [-1.0, 1.0]

4. Phonemization with Espeak-ng

Initial Approach:

Attempted to use Hugging Face's transformers library tokenizer directly
Considered using ZeticMLange to deploy tokenizer as a model (not suitable for text processing)

Phonemization Process:

let phonemes = EspeakPhonemizer.shared.phonemize("Hello world")
// Output: "həlˈoʊ wˈɜːld"

Similar implementation of how its done by neutts: https://github.com/neuphonic/neutts/blob/main/neutts/neutts.py#L11

The phonemes are then tokenized and fed into the backbone model.

Build Settings:

Add libespeak-ng.a to "Link Binary With Libraries"
Add -DESPEAK_AVAILABLE to "Other Swift Flags"
Set header search path to espeak-ng source directory

utts_nano_tts

Add ne

72524e0

utts_nano_tts

jathinsn27 marked this pull request as draft January 24, 2026 14:29

jathinsn27 changed the title ~~Add ne~~ Add neutts_nano_tts Jan 24, 2026

jathin-wq requested review from vaibhavg-zetic and yeonseok-zeticai January 24, 2026 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add neutts_nano_tts#21

Add neutts_nano_tts#21
jathinsn27 wants to merge 1 commit into
zetic-ai:mainfrom
jathinsn27:neutts_nano

jathinsn27 commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jathinsn27 commented Jan 24, 2026

Summary

1. NeuTTS Nano Architecture Overview

2. Model Input/Output Specifications

2.1 Backbone Model (neutts_nano)

2.2 Encoder Model (neucodec-encoder)

2.3 Decoder Model (neucodec-decoder)

4. Phonemization with Espeak-ng

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2.1 Backbone Model (`neutts_nano`)

2.2 Encoder Model (`neucodec-encoder`)

2.3 Decoder Model (`neucodec-decoder`)