ultrachat_200k/README.md
2023-10-26 20:14:18 +00:00

4.4 KiB

language license size_categories task_categories pretty_name configs dataset_info
en
mit
100K<n<1M
conversational
text-generation
UltraChat200k
config_name data_files
default
split path
train_sft data/train_sft-*
split path
test_sft data/test_sft-*
split path
train_gen data/train_gen-*
split path
test_gen data/test_gen-*
features splits download_size dataset_size
name dtype
prompt string
name dtype
prompt_id string
name list
messages
name dtype
content string
name dtype
role string
name num_bytes num_examples
train_sft 1397058554 207865
name num_bytes num_examples
test_sft 154695659 23110
name num_bytes num_examples
train_gen 1347396812 256032
name num_bytes num_examples
test_gen 148276089 28304
1624049723 3047427114

Dataset Card for UltraChat200k

Dataset Description

This is a pre-processed Supervised Fine-Tuning dataset used for training Zephyr-7b-beta, a state of the art 7b chat model.

The Zephyr-beta model is the best in class 7b model on three well known benchmarks:

  • MT Bench - A multi-turn question set that uses GPT4 as a judge.
  • Alpaca eval - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
  • Open LLM Leaderboard which aims to track, rank and evaluate open LLMs and chatbots.

You can learn more about the techniques used to train Zephyr in the Hugging Face Alignment Handbook.

The base dataset is UltraChat: an open-source, large-scale, and multi-round dialogue dataset.

The dataset contains:

  • 🌏 Questions about the World: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
  • ✍🏻 Writing and Creation: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
  • 📋 Assistance on Existent Materials: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.

The following preprocessing was applied:

  • Selection of a subset of data for faster supervised fine tuning.
  • Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
  • Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"

Dataset Structure

The dataset contains two splits:

  • train - containing 207,865 examples
  • test - 23,110 examples

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}