109 lines
4.3 KiB
Markdown
109 lines
4.3 KiB
Markdown
---
|
|
license: mit
|
|
task_categories:
|
|
- conversational
|
|
- text-generation
|
|
language:
|
|
- en
|
|
size_categories:
|
|
- 100K<n<1M
|
|
pretty_name: ZephyrIFT
|
|
dataset_info:
|
|
features:
|
|
- name: prompt
|
|
dtype: string
|
|
- name: prompt_id
|
|
dtype: string
|
|
- name: messages
|
|
list:
|
|
- name: content
|
|
dtype: string
|
|
- name: role
|
|
dtype: string
|
|
splits:
|
|
- name: test
|
|
num_bytes: 154695659
|
|
num_examples: 23110
|
|
- name: train
|
|
num_bytes: 1397058554
|
|
num_examples: 207865
|
|
download_size: 813207030
|
|
dataset_size: 1551754213
|
|
---
|
|
|
|
# Dataset Card for Dataset Name
|
|
|
|
## Dataset Description
|
|
|
|
This is a pre-processed Supervised Fine-Tuning dataset used for training the Zephyr-7b-beta model. A state of the art 7b chat model.
|
|
The Zephyr-beta model is the best in class 7b model on three well known benchmarks:
|
|
- [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) - A multi-turn question set that uses GPT4 as a judge.
|
|
- [Alpaca eval](https://tatsu-lab.github.io/alpaca_eval/) - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
|
|
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which aims to track, rank and evaluate open LLMs and chatbots.
|
|
|
|
|
|
The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset.
|
|
|
|
The dataset contains:
|
|
- 🌏 **Questions about the World**: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
|
|
- ✍🏻 **Writing and Creation**: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
|
|
- 📋 **Assistance on Existent Materials**: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.
|
|
|
|
The following preprocessing was applied:
|
|
- Selection of a subset of data for faster supervised fine tuning.
|
|
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
|
|
- Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"
|
|
|
|
## Dataset Structure
|
|
|
|
The dataset contains two splits
|
|
- train - containing 207,865 examples
|
|
- test - 23,110 examples
|
|
|
|
The dataset is stored in parquet format with each entry using the following schema:
|
|
```
|
|
|
|
{
|
|
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
|
|
"messages":[
|
|
{
|
|
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
|
|
"role": "user"
|
|
},
|
|
{
|
|
"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
|
|
"role": "assistant"
|
|
},
|
|
{
|
|
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
|
|
"role": "user"
|
|
},
|
|
{
|
|
"content": "Certainly! ....",
|
|
"role": "assistant"
|
|
},
|
|
{
|
|
"content": "That's really interesting! I would love to hear more...",
|
|
"role": "user"
|
|
}
|
|
{
|
|
"content": "Certainly! ....",
|
|
"role": "assistant"
|
|
},
|
|
],
|
|
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
|
|
}
|
|
```
|
|
|
|
### Citation Information
|
|
|
|
```bibtex
|
|
@misc{ZephyrIFT,
|
|
author = {Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Alexander M. Rush, and Thomas Wolf},
|
|
title = {ZephyrIFT},
|
|
year = {2023},
|
|
publisher = {HuggingFace Hub},
|
|
journal = {HuggingFace Hub repository},
|
|
howpublished = {\url{https://huggingface.co/datasets/HuggingFaceH4/zephyr_ift_public}},
|
|
}
|
|
``` |