HuggingFaceH4/ultrachat_200k

Lewis Tunstall b7fe606ecd Upload README.md with huggingface_hub

2023-10-26 20:14:18 +00:00

4.4 KiB

Raw Blame History

language

license

size_categories

task_categories

pretty_name

configs

dataset_info

mit

100K<n<1M

conversational

text-generation

UltraChat200k

config_name

data_files

default

split	path
train_sft	data/train_sft-*

split	path
test_sft	data/test_sft-*

split	path
train_gen	data/train_gen-*

split	path
test_gen	data/test_gen-*

features

splits

download_size

dataset_size

name	dtype
prompt	string

name	dtype
prompt_id	string

name

list

messages

name	dtype
content	string

name	dtype
role	string

name	num_bytes	num_examples
train_sft	1397058554	207865

name	num_bytes	num_examples
test_sft	154695659	23110

name	num_bytes	num_examples
train_gen	1347396812	256032

name	num_bytes	num_examples
test_gen	148276089	28304

1624049723

3047427114

Dataset Card for UltraChat200k

Dataset Description

This is a pre-processed Supervised Fine-Tuning dataset used for training Zephyr-7b-beta, a state of the art 7b chat model.

The Zephyr-beta model is the best in class 7b model on three well known benchmarks:

MT Bench - A multi-turn question set that uses GPT4 as a judge.
Alpaca eval - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
Open LLM Leaderboard which aims to track, rank and evaluate open LLMs and chatbots.

You can learn more about the techniques used to train Zephyr in the Hugging Face Alignment Handbook.

The base dataset is UltraChat: an open-source, large-scale, and multi-round dialogue dataset.

The dataset contains:

🌏 Questions about the World: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
✍🏻 Writing and Creation: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
📋 Assistance on Existent Materials: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.

The following preprocessing was applied:

Selection of a subset of data for faster supervised fine tuning.
Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"

Dataset Structure

The dataset contains two splits:

train - containing 207,865 examples
test - 23,110 examples

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}

4.4 KiB Raw Blame History

Dataset Card for UltraChat200k

Dataset Description

Dataset Structure

4.4 KiB

Raw Blame History