ultrachat_200k/README.md

---
language:
- en
license: mit
size_categories:
- 100K<n<1M
task_categories:
- conversational
- text-generation
pretty_name: UltraChat200k
configs:
- config_name: default
  data_files:
  - split: train_sft
    path: data/train_sft-*
  - split: test_sft
    path: data/test_sft-*
  - split: train_gen
    path: data/train_gen-*
  - split: test_gen
    path: data/test_gen-*
dataset_info:
  features:
  - name: prompt
    dtype: string
  - name: prompt_id
    dtype: string
  - name: messages
    list:
    - name: content
      dtype: string
    - name: role
      dtype: string
  splits:
  - name: train_sft
    num_bytes: 1397058554
    num_examples: 207865
  - name: test_sft
    num_bytes: 154695659
    num_examples: 23110
  - name: train_gen
    num_bytes: 1347396812
    num_examples: 256032
  - name: test_gen
    num_bytes: 148276089
    num_examples: 28304
  download_size: 1624049723
  dataset_size: 3047427114
---

# Dataset Card for UltraChat200k

## Dataset Description

This is a pre-processed Supervised Fine-Tuning dataset used for training Zephyr-7b-beta, a state of the art 7b chat model.

The Zephyr-beta model is the best in class 7b model on three well known benchmarks:
- [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) - A multi-turn question set that uses GPT4 as a judge.
- [Alpaca eval](https://tatsu-lab.github.io/alpaca_eval/) - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which aims to track, rank and evaluate open LLMs and chatbots.

You can learn more about the techniques used to train Zephyr in the [Hugging Face Alignment Handbook](https://github.com/huggingface/alignment-handbook).


The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset.

The dataset contains:
- 🌏 **Questions about the World**: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
- ✍🏻 **Writing and Creation**: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
- 📋 **Assistance on Existent Materials**: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.

The following preprocessing was applied:
- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
- Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"

## Dataset Structure

The dataset contains two splits:
- train - containing 207,865 examples
- test - 23,110 examples

The dataset is stored in parquet format with each entry using the following schema:
```

{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
```