Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
python 3.10
Steps to Reproduce
Description
Thank you for open-sourcing Fish Audio S2.
I am trying to reproduce the objective Seed-TTS-Eval results reported in the README / technical report for the open-source fishaudio/s2-pro model, but I am seeing a clear gap from the reported numbers.
This question is about objective evaluation only (Seed-TTS-Eval WER/CER), not subjective evaluation or online service quality.
Reported results
From the README / technical report, I understand the reported Seed-TTS-Eval results are:
- Chinese: 0.54
- English: 0.99
My reproduced results
Using local inference with the open-source model, I currently get:
- Seed-TTS-Eval English: WER = 1.8227
- Seed-TTS-Eval Chinese: CER = 1.0320
So both EN and ZH are significantly worse than the reported numbers.
How my inference path is derived
My local inference code is not an independent reimplementation from scratch. It is a thin wrapper built on top of the official open-source inference path in the fish-speech repository.
Specifically, my code is adapted from the public inference functions in:
fish_speech.models.text2semantic.inference
and uses the official functions directly, including:
init_model(...)
load_codec_model(...)
encode_audio(...)
generate_long(...)
decode_to_audio(...)
So the core generation logic still comes from the official open-source repo. What I changed is mainly to wrap these functions into a batch evaluation pipeline for Seed-TTS-Eval.
My local inference setup
I am not using the cloud API or online service. I am using the open-source model locally.
The local inference pipeline is roughly:
- Load
fishaudio/s2-pro
- Load the codec checkpoint
- For voice cloning, pass dataset-provided
prompt_audio and prompt_text
- Encode the reference audio using
encode_audio(...)
- Generate semantic tokens with
generate_long(...)
- Decode audio with
decode_to_audio(...)
In other words, my inference path is derived from the official local open-source inference functions, but wrapped into an offline benchmark pipeline.
Decoding parameters
My current parameters are:
max_new_tokens = 0
top_p = 0.9
top_k = 30
temperature = 1.0
chunk_length = 300
iterative_prompt = True
These are not intended as arbitrary settings; they come from my local benchmark wrapper built around the official inference path.
I also tried a more deterministic setup with:
but this still did not close the gap meaningfully.
Evaluation setup
For objective evaluation, I used a local evaluation pipeline matching the standard Seed-TTS-Eval setup as closely as possible:
- English ASR:
whisper-large-v3
- Chinese ASR:
Paraformer
- Dataset: Seed-TTS-Eval EN / ZH
- Reference audio and prompt text are taken from the dataset entries
The evaluation completes successfully with no inference/eval failures.
My question
Could you please clarify what exact setup was used to obtain the reported objective Seed-TTS-Eval numbers for the open-source model?
In particular:
- Were the reported Seed-TTS-Eval numbers obtained with the open-source local model only, or did they depend on any internal / online-service-only frontend?
- For objective evaluation, is
generate_long(...) the correct open-source inference path to reproduce the reported benchmark?
- Were any additional steps applied before inference, such as:
- text normalization / text frontend
- prompt filtering / prompt selection
- special decoding settings
- Are the reported numbers expected to be reproducible directly from the public weights and public inference code, or do they rely on additional internal evaluation setup details not yet documented?
I would really appreciate any clarification. I can run the benchmark successfully with the public model and public inference path, but I cannot reproduce the paper-level objective numbers.
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
Self Checks
Cloud or Self Hosted
Self Hosted (Source)
Environment Details
python 3.10
Steps to Reproduce
Description
Thank you for open-sourcing Fish Audio S2.
I am trying to reproduce the objective Seed-TTS-Eval results reported in the README / technical report for the open-source
fishaudio/s2-promodel, but I am seeing a clear gap from the reported numbers.This question is about objective evaluation only (Seed-TTS-Eval WER/CER), not subjective evaluation or online service quality.
Reported results
From the README / technical report, I understand the reported Seed-TTS-Eval results are:
My reproduced results
Using local inference with the open-source model, I currently get:
So both EN and ZH are significantly worse than the reported numbers.
How my inference path is derived
My local inference code is not an independent reimplementation from scratch. It is a thin wrapper built on top of the official open-source inference path in the
fish-speechrepository.Specifically, my code is adapted from the public inference functions in:
fish_speech.models.text2semantic.inferenceand uses the official functions directly, including:
init_model(...)load_codec_model(...)encode_audio(...)generate_long(...)decode_to_audio(...)So the core generation logic still comes from the official open-source repo. What I changed is mainly to wrap these functions into a batch evaluation pipeline for Seed-TTS-Eval.
My local inference setup
I am not using the cloud API or online service. I am using the open-source model locally.
The local inference pipeline is roughly:
fishaudio/s2-proprompt_audioandprompt_textencode_audio(...)generate_long(...)decode_to_audio(...)In other words, my inference path is derived from the official local open-source inference functions, but wrapped into an offline benchmark pipeline.
Decoding parameters
My current parameters are:
max_new_tokens = 0top_p = 0.9top_k = 30temperature = 1.0chunk_length = 300iterative_prompt = TrueThese are not intended as arbitrary settings; they come from my local benchmark wrapper built around the official inference path.
I also tried a more deterministic setup with:
temperature = 0.01but this still did not close the gap meaningfully.
Evaluation setup
For objective evaluation, I used a local evaluation pipeline matching the standard Seed-TTS-Eval setup as closely as possible:
whisper-large-v3ParaformerThe evaluation completes successfully with no inference/eval failures.
My question
Could you please clarify what exact setup was used to obtain the reported objective Seed-TTS-Eval numbers for the open-source model?
In particular:
generate_long(...)the correct open-source inference path to reproduce the reported benchmark?I would really appreciate any clarification. I can run the benchmark successfully with the public model and public inference path, but I cannot reproduce the paper-level objective numbers.
✔️ Expected Behavior
No response
❌ Actual Behavior
No response