HuggingFaceH4/ultrachat_200k

README.md

language

license

size_categories

task_categories

pretty_name

configs

dataset_info

mit

100K<n<1M

text-generation

UltraChat 200k

config_name

data_files

default

split	path
train_sft	data/train_sft-*

split	path
test_sft	data/test_sft-*

split	path
train_gen	data/train_gen-*

split	path
test_gen	data/test_gen-*

features

splits

download_size

dataset_size

name	dtype
prompt	string

name	dtype
prompt_id	string

name

list

messages

name	dtype
content	string

name	dtype
role	string

name	num_bytes	num_examples
train_sft	1397058554	207865

name	num_bytes	num_examples
test_sft	154695659	23110

name	num_bytes	num_examples
train_gen	1347396812	256032

name	num_bytes	num_examples
test_gen	148276089	28304

1624049723

3047427114

Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model.

The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning.
Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.

Dataset Structure

The dataset has four splits, suitable for:

Supervised fine-tuning (sft).
Generation ranking (gen) via techniques like rejection sampling or PPO.

The number of examples per split is shown as follows:

train_sft	test_sft	train_gen	test_gen
207865	23110	256032	28304

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

Citation

If you find this dataset is useful in your work, please cite the original UltraChat dataset:

@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, 
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}