ultrachat_200k/README.md
2023-10-27 08:53:22 +00:00

4.4 KiB

language license size_categories task_categories pretty_name configs dataset_info
en
mit
100K<n<1M
conversational
text-generation
UltraChat 200k
config_name data_files
default
split path
train_sft data/train_sft-*
split path
test_sft data/test_sft-*
split path
train_gen data/train_gen-*
split path
test_gen data/test_gen-*
features splits download_size dataset_size
name dtype
prompt string
name dtype
prompt_id string
name list
messages
name dtype
content string
name dtype
role string
name num_bytes num_examples
train_sft 1397058554 207865
name num_bytes num_examples
test_sft 154695659 23110
name num_bytes num_examples
train_gen 1347396812 256032
name num_bytes num_examples
test_gen 148276089 28304
1624049723 3047427114

Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.

The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

  • Selection of a subset of data for faster supervised fine tuning.
  • Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
  • Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.

Dataset Structure

The dataset has four splits, suitable for:

  • Supervised fine-tuning (sft).
  • Generation ranking (gen) via techniques like rejection sampling or PPO.

The number of examples per split is shown as follows:

train_sft test_sft train_gen test_gen
207865 23110 256032 28304

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

Citation

If you find this dataset is useful in your work, please cite the original UltraChat dataset:

@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, 
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

You may also wish to cite the Zephyr 7B technical report:

@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment}, 
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}