Skip to content

Plans to release the remaining training data? #20

@IcyFeather233

Description

@IcyFeather233

Hi, thanks for the great work and for open-sourcing Uni-CoT!

I noticed that the two released datasets (UniCoT-Self-Reflection-6K and UniCoT-Breakdown-3K) cover the core CoT reasoning data described in the paper. However, the paper also mentions several additional data sources used during training, including:

~114K text-to-image generation samples
~68K samples from Echo-4o
~46K samples from ShareGPT-4o-image (generation)
~46K image editing samples from ShareGPT-4o-image
~100K samples from LLaVA-OV OneVision Stage training data
~3K geography reasoning CoT data (based on GeoPose3K)
I'm wondering if there are any plans to release these datasets as well? They would be very helpful for reproducing the full training pipeline and for further research.

Also, the released datasets (6K + 3K) appear to be smaller than the ~31K CoT samples mentioned in the paper (~11K macro-level + ~20K micro-level). Could you clarify whether the released versions are subsets, and if so, what criteria were used for filtering?

Thanks again for your contributions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions