Plans to release the remaining training data?

Hi, thanks for the great work and for open-sourcing Uni-CoT!

I noticed that the two released datasets (UniCoT-Self-Reflection-6K and UniCoT-Breakdown-3K) cover the core CoT reasoning data described in the paper. However, the paper also mentions several additional data sources used during training, including:

~114K text-to-image generation samples
~68K samples from Echo-4o
~46K samples from ShareGPT-4o-image (generation)
~46K image editing samples from ShareGPT-4o-image
~100K samples from LLaVA-OV OneVision Stage training data
~3K geography reasoning CoT data (based on GeoPose3K)
I'm wondering if there are any plans to release these datasets as well? They would be very helpful for reproducing the full training pipeline and for further research.

Also, the released datasets (6K + 3K) appear to be smaller than the ~31K CoT samples mentioned in the paper (~11K macro-level + ~20K micro-level). Could you clarify whether the released versions are subsets, and if so, what criteria were used for filtering?

Thanks again for your contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans to release the remaining training data? #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Plans to release the remaining training data? #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions