Skip to content

Add neutts_nano_tts#21

Draft
jathinsn27 wants to merge 1 commit into
zetic-ai:mainfrom
jathinsn27:neutts_nano
Draft

Add neutts_nano_tts#21
jathinsn27 wants to merge 1 commit into
zetic-ai:mainfrom
jathinsn27:neutts_nano

Conversation

@jathinsn27
Copy link
Copy Markdown
Contributor

Summary

This PR for iOS integration of the NeuTTS Nano text-to-speech model, enabling on-device speech synthesis with voice cloning capabilities. The implementation includes a three-stage pipeline (Backbone → Encoder → Decoder), custom tokenization matching Hugging Face's ByteLevel BPE tokenizer, and integration of espeak-ng for phonemization. Cross-compiling native libraries for iOS and aligning model inputs/outputs with the 3 models from ZeticMLange SDK.


1. NeuTTS Nano Architecture Overview

Text Input → Phonemization → Tokenization → Backbone Model → Speech Codes → Decoder → Audio Output
                ↓
         Reference Audio → Encoder → Reference Codes

Components:

  1. Backbone Model (neutts_nano): A language model that generates discrete speech tokens from phonemized text
  2. Encoder Model (neucodec-encoder): Converts reference audio to discrete codes for voice cloning
  3. Decoder Model (neucodec-decoder): Converts discrete codes back to raw audio waveforms

2. Model Input/Output Specifications

2.1 Backbone Model (neutts_nano)

Input:

  • input_ids: [1, 128] shape, int32 dtype
    • Tokenized and phonemized text prompt
    • Includes special tokens: <|start_header_id|>, <|end_header_id|>, <|speech_###|>
    • Format: "<|start_header_id|>user<|end_header_id|>\n\n{phonemes}<|speech_###|>"
  • attention_mask: [1, 128] shape, int32 dtype
    • Binary mask indicating valid tokens (1) vs padding (0)

Output:

  • Logits: [1, 128, vocab_size] shape, float32 dtype
    • Language model logits over vocabulary
    • Contains special speech tokens <|speech_###|> where ### represents discrete audio codes

2.2 Encoder Model (neucodec-encoder)

Input:

  • audio: [1, 1, 16000] shape, float32 dtype
    • Mono audio waveform at 16 kHz sample rate
    • Must be exactly 16000 samples (1 second of audio)
    • Values normalized to [-1.0, 1.0] range

Output:

  • codes: [1, 1, 50] shape, int32 dtype
    • Discrete audio codes representing the reference audio
    • Used for voice cloning (prosody and timbre transfer)

2.3 Decoder Model (neucodec-decoder)

Input:

  • codes: [1, 1, 50] shape, int64 dtype
    • Discrete audio codes from backbone output
    • Must be padded/truncated to exactly 50 codes

Output:

  • audio: [1, 1, 24000] shape, float32 dtype
    • Raw PCM audio waveform at 24 kHz sample rate
    • Values in range [-1.0, 1.0]

4. Phonemization with Espeak-ng

Initial Approach:

  • Attempted to use Hugging Face's transformers library tokenizer directly
  • Considered using ZeticMLange to deploy tokenizer as a model (not suitable for text processing)

Phonemization Process:

let phonemes = EspeakPhonemizer.shared.phonemize("Hello world")
// Output: "həlˈoʊ wˈɜːld"

Similar implementation of how its done by neutts: https://github.com/neuphonic/neutts/blob/main/neutts/neutts.py#L11

The phonemes are then tokenized and fed into the backbone model.

Build Settings:

  • Add libespeak-ng.a to "Link Binary With Libraries"
  • Add -DESPEAK_AVAILABLE to "Other Swift Flags"
  • Set header search path to espeak-ng source directory

utts_nano_tts
@jathinsn27 jathinsn27 marked this pull request as draft January 24, 2026 14:29
@jathinsn27 jathinsn27 changed the title Add ne Add neutts_nano_tts Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant