ultrachat_200k/README.md

---
language:
- en
license: mit
size_categories:
- 100K<n<1M
task_categories:
- text-generation
pretty_name: UltraChat 200k
configs:
- config_name: default
  data_files:
  - split: train_sft
    path: data/train_sft-*
  - split: test_sft
    path: data/test_sft-*
  - split: train_gen
    path: data/train_gen-*
  - split: test_gen
    path: data/test_gen-*
dataset_info:
  features:
  - name: prompt
    dtype: string
  - name: prompt_id
    dtype: string
  - name: messages
    list:
    - name: content
      dtype: string
    - name: role
      dtype: string
  splits:
  - name: train_sft
    num_bytes: 1397058554
    num_examples: 207865
  - name: test_sft
    num_bytes: 154695659
    num_examples: 23110
  - name: train_gen
    num_bytes: 1347396812
    num_examples: 256032
  - name: test_gen
    num_bytes: 148276089
    num_examples: 28304
  download_size: 1624049723
  dataset_size: 3047427114
---

# Dataset Card for UltraChat 200k

## Dataset Description

This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.

The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:

- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.

## Dataset Structure

The dataset has four splits, suitable for:

* Supervised fine-tuning (`sft`).
* Generation ranking (`gen`) via techniques like rejection sampling or PPO.

The number of examples per split is shown as follows:


|  train_sft | test_sft  | train_gen | test_gen |
|:-------:|:-----------:|:-----:| :-----:|
|  207865 |       23110 | 256032 | 28304 |

The dataset is stored in parquet format with each entry using the following schema:
```

{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
```

## Citation

If you find this dataset is useful in your work, please cite the original UltraChat dataset:

```
@misc{ding2023enhancing,
      title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, 
      author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
      year={2023},
      eprint={2305.14233},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```
Upload README.md with huggingface_hub 2023-10-24 08:25:45 +00:00			`---`
Update README.md 2023-10-24 08:27:00 +00:00			`language:`
			`- en`
Upload README.md with huggingface_hub 2023-10-26 20:14:18 +00:00			`license: mit`
Update README.md 2023-10-24 08:27:00 +00:00			`size_categories:`
			`- 100K<n<1M`
Upload README.md with huggingface_hub 2023-10-26 20:14:18 +00:00			`task_categories:`
			`- text-generation`
Update README.md 2023-10-26 21:12:15 +00:00			`pretty_name: UltraChat 200k`
Upload README.md with huggingface_hub 2023-10-26 20:14:18 +00:00			`configs:`
			`- config_name: default`
			`data_files:`
			`- split: train_sft`
			`path: data/train_sft-*`
			`- split: test_sft`
			`path: data/test_sft-*`
			`- split: train_gen`
			`path: data/train_gen-*`
			`- split: test_gen`
			`path: data/test_gen-*`
			`dataset_info:`
			`features:`
			`- name: prompt`
			`dtype: string`
			`- name: prompt_id`
			`dtype: string`
			`- name: messages`
			`list:`
			`- name: content`
			`dtype: string`
			`- name: role`
			`dtype: string`
			`splits:`
			`- name: train_sft`
			`num_bytes: 1397058554`
			`num_examples: 207865`
			`- name: test_sft`
			`num_bytes: 154695659`
			`num_examples: 23110`
			`- name: train_gen`
			`num_bytes: 1347396812`
			`num_examples: 256032`
			`- name: test_gen`
			`num_bytes: 148276089`
			`num_examples: 28304`
			`download_size: 1624049723`
			`dataset_size: 3047427114`
Upload README.md with huggingface_hub 2023-10-24 08:25:45 +00:00			`---`

Update README.md 2023-10-26 21:12:15 +00:00			`# Dataset Card for UltraChat 200k`
Update README.md 2023-10-24 08:27:00 +00:00
			`## Dataset Description`

Update README.md 2023-10-26 21:12:15 +00:00			`This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.`
Update README.md 2023-10-24 08:52:25 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:
Update README.md 2023-10-24 08:51:33 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			`- Selection of a subset of data for faster supervised fine tuning.`
			`- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"`
			`- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.`
Update README.md 2023-10-24 14:58:27 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			`## Dataset Structure`
Update README.md 2023-10-24 08:27:00 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			`The dataset has four splits, suitable for:`
Update README.md 2023-10-24 08:27:00 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			* Supervised fine-tuning (`sft`).
			* Generation ranking (`gen`) via techniques like rejection sampling or PPO.
Update README.md 2023-10-24 08:27:00 +00:00
Update README.md 2023-10-26 21:12:15 +00:00			`The number of examples per split is shown as follows:`
Update README.md 2023-10-24 08:27:00 +00:00

Update README.md 2023-10-26 21:12:15 +00:00			`\| train_sft \| test_sft \| train_gen \| test_gen \|`
			`\|:-------:\|:-----------:\|:-----:\| :-----:\|`
			`\| 207865 \| 23110 \| 256032 \| 28304 \|`
Update README.md 2023-10-24 08:51:33 +00:00
Update README.md 2023-10-24 08:27:00 +00:00			`The dataset is stored in parquet format with each entry using the following schema:`
			```

			`{`
			`"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",`
			`"messages":[`
			`{`
			`"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",`
			`"role": "user"`
			`},`
			`{`
			`"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",`
			`"role": "assistant"`
			`},`
			`{`
			`"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",`
			`"role": "user"`
			`},`
			`{`
			`"content": "Certainly! ....",`
			`"role": "assistant"`
			`},`
			`{`
			`"content": "That's really interesting! I would love to hear more...",`
			`"role": "user"`
			`}`
			`{`
			`"content": "Certainly! ....",`
			`"role": "assistant"`
			`},`
			`],`
			`"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"`
			`}`
Update README.md 2023-10-24 08:28:09 +00:00			```
Update README.md 2023-10-27 08:52:54 +00:00
			`## Citation`

			`If you find this dataset is useful in your work, please cite the original UltraChat dataset:`

Update README.md 2023-10-27 08:53:22 +00:00			```
Update README.md 2023-10-27 08:52:54 +00:00			`@misc{ding2023enhancing,`
			`title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},`
			`author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},`
			`year={2023},`
			`eprint={2305.14233},`
			`archivePrefix={arXiv},`
			`primaryClass={cs.CL}`
			`}`
			```