This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:
- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.
The dataset is stored in parquet format with each entry using the following schema:
```
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"role": "user"
},
{
"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
"role": "assistant"
},
{
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
{
"content": "That's really interesting! I would love to hear more...",