Skip to content

Cannot reproduce reported Seed-TTS-Eval results for s2-pro #1268

@RongNanZi

Description

@RongNanZi

Self Checks

  • This template is only for bug reports. For questions, please visit Discussions.
  • I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. English 中文 日本語 Portuguese (Brazil)
  • I have searched for existing issues, including closed ones. Search issues
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template and fill in all required fields.

Cloud or Self Hosted

Self Hosted (Source)

Environment Details

python 3.10

Steps to Reproduce

Description

Thank you for open-sourcing Fish Audio S2.

I am trying to reproduce the objective Seed-TTS-Eval results reported in the README / technical report for the open-source fishaudio/s2-pro model, but I am seeing a clear gap from the reported numbers.

This question is about objective evaluation only (Seed-TTS-Eval WER/CER), not subjective evaluation or online service quality.

Reported results

From the README / technical report, I understand the reported Seed-TTS-Eval results are:

  • Chinese: 0.54
  • English: 0.99

My reproduced results

Using local inference with the open-source model, I currently get:

  • Seed-TTS-Eval English: WER = 1.8227
  • Seed-TTS-Eval Chinese: CER = 1.0320

So both EN and ZH are significantly worse than the reported numbers.

How my inference path is derived

My local inference code is not an independent reimplementation from scratch. It is a thin wrapper built on top of the official open-source inference path in the fish-speech repository.

Specifically, my code is adapted from the public inference functions in:

  • fish_speech.models.text2semantic.inference

and uses the official functions directly, including:

  • init_model(...)
  • load_codec_model(...)
  • encode_audio(...)
  • generate_long(...)
  • decode_to_audio(...)

So the core generation logic still comes from the official open-source repo. What I changed is mainly to wrap these functions into a batch evaluation pipeline for Seed-TTS-Eval.

My local inference setup

I am not using the cloud API or online service. I am using the open-source model locally.

The local inference pipeline is roughly:

  1. Load fishaudio/s2-pro
  2. Load the codec checkpoint
  3. For voice cloning, pass dataset-provided prompt_audio and prompt_text
  4. Encode the reference audio using encode_audio(...)
  5. Generate semantic tokens with generate_long(...)
  6. Decode audio with decode_to_audio(...)

In other words, my inference path is derived from the official local open-source inference functions, but wrapped into an offline benchmark pipeline.

Decoding parameters

My current parameters are:

  • max_new_tokens = 0
  • top_p = 0.9
  • top_k = 30
  • temperature = 1.0
  • chunk_length = 300
  • iterative_prompt = True

These are not intended as arbitrary settings; they come from my local benchmark wrapper built around the official inference path.

I also tried a more deterministic setup with:

  • temperature = 0.01

but this still did not close the gap meaningfully.

Evaluation setup

For objective evaluation, I used a local evaluation pipeline matching the standard Seed-TTS-Eval setup as closely as possible:

  • English ASR: whisper-large-v3
  • Chinese ASR: Paraformer
  • Dataset: Seed-TTS-Eval EN / ZH
  • Reference audio and prompt text are taken from the dataset entries

The evaluation completes successfully with no inference/eval failures.

My question

Could you please clarify what exact setup was used to obtain the reported objective Seed-TTS-Eval numbers for the open-source model?

In particular:

  1. Were the reported Seed-TTS-Eval numbers obtained with the open-source local model only, or did they depend on any internal / online-service-only frontend?
  2. For objective evaluation, is generate_long(...) the correct open-source inference path to reproduce the reported benchmark?
  3. Were any additional steps applied before inference, such as:
    • text normalization / text frontend
    • prompt filtering / prompt selection
    • special decoding settings
  4. Are the reported numbers expected to be reproducible directly from the public weights and public inference code, or do they rely on additional internal evaluation setup details not yet documented?

I would really appreciate any clarification. I can run the benchmark successfully with the public model and public inference path, but I cannot reproduce the paper-level objective numbers.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions