language |
license |
size_categories |
task_categories |
pretty_name |
configs |
dataset_info |
|
mit |
|
|
UltraChat 200k |
config_name |
data_files |
default |
split |
path |
train_sft |
data/train_sft-* |
|
split |
path |
test_sft |
data/test_sft-* |
|
split |
path |
train_gen |
data/train_gen-* |
|
split |
path |
test_gen |
data/test_gen-* |
|
|
|
|
features |
splits |
download_size |
dataset_size |
|
name |
dtype |
prompt_id |
string |
|
name |
list |
messages |
name |
dtype |
content |
string |
|
|
|
|
|
name |
num_bytes |
num_examples |
train_sft |
1397058554 |
207865 |
|
name |
num_bytes |
num_examples |
test_sft |
154695659 |
23110 |
|
name |
num_bytes |
num_examples |
train_gen |
1347396812 |
256032 |
|
name |
num_bytes |
num_examples |
test_gen |
148276089 |
28304 |
|
|
1624049723 |
3047427114 |
|
Dataset Card for UltraChat 200k
Dataset Description
This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k
, we applied the following logic:
- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.
Dataset Structure
The dataset has four splits, suitable for:
- Supervised fine-tuning (
sft
).
- Generation ranking (
gen
) via techniques like rejection sampling or PPO.
The number of examples per split is shown as follows:
train_sft |
test_sft |
train_gen |
test_gen |
207865 |
23110 |
256032 |
28304 |
The dataset is stored in parquet format with each entry using the following schema:
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"role": "user"
},
{
"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
"role": "assistant"
},
{
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
{
"content": "That's really interesting! I would love to hear more...",
"role": "user"
}
{
"content": "Certainly! ....",
"role": "assistant"
},
],
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
Citation
If you find this dataset is useful in your work, please cite the original UltraChat dataset:
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}