Cannot reproduce reported Seed-TTS-Eval results for s2-pro

### Self Checks

- [x] This template is only for bug reports. For questions, please visit [Discussions](https://github.com/fishaudio/fish-speech/discussions).
- [x] I have thoroughly reviewed the project documentation (installation, training, inference) but couldn't find information to solve my problem. [English](https://speech.fish.audio/) [中文](https://speech.fish.audio/zh/) [日本語](https://speech.fish.audio/ja/) [Portuguese (Brazil)](https://speech.fish.audio/pt/)
- [x] I have searched for existing issues, including closed ones. [Search issues](https://github.com/fishaudio/fish-speech/issues)
- [x] I confirm that I am using English to submit this report (我已阅读并同意 [Language Policy](https://github.com/fishaudio/fish-speech/issues/515)).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
- [x] Please do not modify this template and fill in all required fields.

### Cloud or Self Hosted

Self Hosted (Source)

### Environment Details

python 3.10

### Steps to Reproduce

### Description

Thank you for open-sourcing Fish Audio S2.

I am trying to reproduce the **objective Seed-TTS-Eval results** reported in the README / technical report for the open-source `fishaudio/s2-pro` model, but I am seeing a clear gap from the reported numbers.

This question is about **objective evaluation only** (Seed-TTS-Eval WER/CER), not subjective evaluation or online service quality.

### Reported results

From the README / technical report, I understand the reported Seed-TTS-Eval results are:

- Chinese: **0.54**
- English: **0.99**

### My reproduced results

Using local inference with the open-source model, I currently get:

- Seed-TTS-Eval English: **WER = 1.8227**
- Seed-TTS-Eval Chinese: **CER = 1.0320**

So both EN and ZH are significantly worse than the reported numbers.

### How my inference path is derived

My local inference code is **not an independent reimplementation from scratch**. It is a thin wrapper built on top of the **official open-source inference path** in the `fish-speech` repository.

Specifically, my code is adapted from the public inference functions in:

- `fish_speech.models.text2semantic.inference`

and uses the official functions directly, including:

- `init_model(...)`
- `load_codec_model(...)`
- `encode_audio(...)`
- `generate_long(...)`
- `decode_to_audio(...)`

So the core generation logic still comes from the official open-source repo. What I changed is mainly to wrap these functions into a batch evaluation pipeline for Seed-TTS-Eval.

### My local inference setup

I am **not** using the cloud API or online service. I am using the open-source model locally.

The local inference pipeline is roughly:

1. Load `fishaudio/s2-pro`
2. Load the codec checkpoint
3. For voice cloning, pass dataset-provided `prompt_audio` and `prompt_text`
4. Encode the reference audio using `encode_audio(...)`
5. Generate semantic tokens with `generate_long(...)`
6. Decode audio with `decode_to_audio(...)`

In other words, my inference path is derived from the **official local open-source inference functions**, but wrapped into an offline benchmark pipeline.

### Decoding parameters

My current parameters are:

- `max_new_tokens = 0`
- `top_p = 0.9`
- `top_k = 30`
- `temperature = 1.0`
- `chunk_length = 300`
- `iterative_prompt = True`

These are not intended as arbitrary settings; they come from my local benchmark wrapper built around the official inference path.

I also tried a more deterministic setup with:

- `temperature = 0.01`

but this still did not close the gap meaningfully.

### Evaluation setup

For objective evaluation, I used a local evaluation pipeline matching the standard Seed-TTS-Eval setup as closely as possible:

- English ASR: `whisper-large-v3`
- Chinese ASR: `Paraformer`
- Dataset: Seed-TTS-Eval EN / ZH
- Reference audio and prompt text are taken from the dataset entries

The evaluation completes successfully with no inference/eval failures.

### My question

Could you please clarify what exact setup was used to obtain the reported **objective Seed-TTS-Eval** numbers for the open-source model?

In particular:

1. Were the reported Seed-TTS-Eval numbers obtained with the **open-source local model only**, or did they depend on any internal / online-service-only frontend?
2. For objective evaluation, is `generate_long(...)` the correct open-source inference path to reproduce the reported benchmark?
3. Were any additional steps applied before inference, such as:
   - text normalization / text frontend
   - prompt filtering / prompt selection
   - special decoding settings
4. Are the reported numbers expected to be reproducible directly from the public weights and public inference code, or do they rely on additional internal evaluation setup details not yet documented?

I would really appreciate any clarification. I can run the benchmark successfully with the public model and public inference path, but I cannot reproduce the paper-level objective numbers.

### ✔️ Expected Behavior

_No response_

### ❌ Actual Behavior

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce reported Seed-TTS-Eval results for s2-pro #1268

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

Description

Reported results

My reproduced results

How my inference path is derived

My local inference setup

Decoding parameters

Evaluation setup

My question

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot reproduce reported Seed-TTS-Eval results for s2-pro #1268

Description

Self Checks

Cloud or Self Hosted

Environment Details

Steps to Reproduce

Description

Reported results

My reproduced results

How my inference path is derived

My local inference setup

Decoding parameters

Evaluation setup

My question

✔️ Expected Behavior

❌ Actual Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions