ultrachat_200k/README.md
2023-10-24 08:27:00 +00:00

3.7 KiB

license task_categories language size_categories pretty_name dataset_info
mit
conversational
text-generation
en
100K<n<1M
ZephyrIFT
features splits download_size dataset_size
name dtype
prompt string
name dtype
prompt_id string
name list
messages
name dtype
content string
name dtype
role string
name num_bytes num_examples
test 154695659 23110
name num_bytes num_examples
train 1397058554 207865
813207030 1551754213

Dataset Card for Dataset Name

Dataset Description

This is a pre-processed Intruction Fine-Tuning dataset used for training the Zephyr-7b-beta model.

The base dataset is UltraChat: an open-source, large-scale, and multi-round dialogue dataset.

The dataset contains:

  • 🌏 Questions about the World: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
  • ✍🏻 Writing and Creation: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
  • 📋 Assistance on Existent Materials: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.

The following preprocessing was applied:

  • Selection of a subset of data for faster supervised fine tuning.
  • Truecasing of the dataset, as we observed around %5 of the data contained grammatical errors.
  • Removal of dialogues where the assitant replies "I do not have emotions", "I don't have opinions" ...etc (TO BE CONFIRMED AFTER EXPS)

Dataset Structure

The dataset is stored in parquet format with each entry using the following schema:


{
    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
    "messages":[
        {
            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
            "role": "user"
        },
        {
            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
            "role": "assistant"
        },
        {
            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
            "role": "user"
        },
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
        {
            "content": "That's really interesting! I would love to hear more...",
            "role": "user"
        }
        {
            "content": "Certainly! ....",
            "role": "assistant"
        },
    ],
    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}


### Citation Information

```bibtex
@misc{ZephyrIFT,
  author = {Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Alexander M. Rush, and Thomas Wolf},
  title = {ZephyrIFT},
  year = {2023},
  publisher = {HuggingFace Hub},
  journal = {HuggingFace Hub repository},
  howpublished = {\url{https://huggingface.co/datasets/HuggingFaceH4/zephyr_ift_public}},
}