ultrachat_200k/README.md

124 lines
3.8 KiB
Markdown
Raw Permalink Normal View History

2023-10-24 08:25:45 +00:00
---
2023-10-24 08:27:00 +00:00
language:
- en
2023-10-26 20:14:18 +00:00
license: mit
2023-10-24 08:27:00 +00:00
size_categories:
- 100K<n<1M
2023-10-26 20:14:18 +00:00
task_categories:
- text-generation
2023-10-26 21:12:15 +00:00
pretty_name: UltraChat 200k
2023-10-26 20:14:18 +00:00
configs:
- config_name: default
data_files:
- split: train_sft
path: data/train_sft-*
- split: test_sft
path: data/test_sft-*
- split: train_gen
path: data/train_gen-*
- split: test_gen
path: data/test_gen-*
dataset_info:
features:
- name: prompt
dtype: string
- name: prompt_id
dtype: string
- name: messages
list:
- name: content
dtype: string
- name: role
dtype: string
splits:
- name: train_sft
num_bytes: 1397058554
num_examples: 207865
- name: test_sft
num_bytes: 154695659
num_examples: 23110
- name: train_gen
num_bytes: 1347396812
num_examples: 256032
- name: test_gen
num_bytes: 148276089
num_examples: 28304
download_size: 1624049723
dataset_size: 3047427114
2023-10-24 08:25:45 +00:00
---
2023-10-26 21:12:15 +00:00
# Dataset Card for UltraChat 200k
2023-10-24 08:27:00 +00:00
## Dataset Description
2023-10-26 21:12:15 +00:00
This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.
2023-10-24 08:52:25 +00:00
2023-10-26 21:12:15 +00:00
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:
2023-10-24 08:51:33 +00:00
2023-10-26 21:12:15 +00:00
- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.
2023-10-24 14:58:27 +00:00
2023-10-26 21:12:15 +00:00
## Dataset Structure
2023-10-24 08:27:00 +00:00
2023-10-26 21:12:15 +00:00
The dataset has four splits, suitable for:
2023-10-24 08:27:00 +00:00
2023-10-26 21:12:15 +00:00
* Supervised fine-tuning (`sft`).
* Generation ranking (`gen`) via techniques like rejection sampling or PPO.
2023-10-24 08:27:00 +00:00
2023-10-26 21:12:15 +00:00
The number of examples per split is shown as follows:
2023-10-24 08:27:00 +00:00
2023-10-26 21:12:15 +00:00
| train_sft | test_sft | train_gen | test_gen |
|:-------:|:-----------:|:-----:| :-----:|
| 207865 | 23110 | 256032 | 28304 |
2023-10-24 08:51:33 +00:00
2023-10-24 08:27:00 +00:00
The dataset is stored in parquet format with each entry using the following schema:
```
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"role": "user"
},
{
"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
"role": "assistant"
},
{
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
{
"content": "That's really interesting! I would love to hear more...",
"role": "user"
}
{
"content": "Certainly! ....",
"role": "assistant"
},
],
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
2023-10-24 08:28:09 +00:00
```
2023-10-27 08:52:54 +00:00
## Citation
If you find this dataset is useful in your work, please cite the original UltraChat dataset:
2023-10-27 08:53:22 +00:00
```
2023-10-27 08:52:54 +00:00
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```