ultrachat_200k/README.md
2024-10-16 11:52:27 +00:00

3.8 KiB

language license size_categories task_categories pretty_name configs dataset_info
en
mit
100K<n<1M
text-generation
UltraChat 200k
config_name data_files
default
split path
train_sft data/train_sft-*
split path
test_sft data/test_sft-*
split path
train_gen data/train_gen-*
split path
test_gen data/test_gen-*
features splits download_size dataset_size
name dtype
prompt string
name dtype
prompt_id string
name list
messages
name dtype
content string
name dtype
role string
name num_bytes num_examples
train_sft 1397058554 207865
name num_bytes num_examples
test_sft 154695659 23110
name num_bytes num_examples
train_gen 1347396812 256032
name num_bytes num_examples
test_gen 148276089 28304
1624049723 3047427114

Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.

The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

  • Selection of a subset of data for faster supervised fine tuning.
  • Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
  • Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.

Dataset Structure

The dataset has four splits, suitable for:

  • Supervised fine-tuning (sft).
  • Generation ranking (gen) via techniques like rejection sampling or PPO.

The number of examples per split is shown as follows:

train_sft test_sft train_gen test_gen
207865 23110 256032 28304

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

Citation

If you find this dataset is useful in your work, please cite the original UltraChat dataset:

@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, 
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}