Update README.md
This commit is contained in:
parent
b7fe606ecd
commit
b264c9865d
40
README.md
40
README.md
@ -7,7 +7,7 @@ size_categories:
|
|||||||
task_categories:
|
task_categories:
|
||||||
- conversational
|
- conversational
|
||||||
- text-generation
|
- text-generation
|
||||||
pretty_name: UltraChat200k
|
pretty_name: UltraChat 200k
|
||||||
configs:
|
configs:
|
||||||
- config_name: default
|
- config_name: default
|
||||||
data_files:
|
data_files:
|
||||||
@ -48,37 +48,31 @@ dataset_info:
|
|||||||
dataset_size: 3047427114
|
dataset_size: 3047427114
|
||||||
---
|
---
|
||||||
|
|
||||||
# Dataset Card for UltraChat200k
|
# Dataset Card for UltraChat 200k
|
||||||
|
|
||||||
## Dataset Description
|
## Dataset Description
|
||||||
|
|
||||||
This is a pre-processed Supervised Fine-Tuning dataset used for training Zephyr-7b-beta, a state of the art 7b chat model.
|
This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.
|
||||||
|
|
||||||
The Zephyr-beta model is the best in class 7b model on three well known benchmarks:
|
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:
|
||||||
- [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) - A multi-turn question set that uses GPT4 as a judge.
|
|
||||||
- [Alpaca eval](https://tatsu-lab.github.io/alpaca_eval/) - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
|
|
||||||
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which aims to track, rank and evaluate open LLMs and chatbots.
|
|
||||||
|
|
||||||
You can learn more about the techniques used to train Zephyr in the [Hugging Face Alignment Handbook](https://github.com/huggingface/alignment-handbook).
|
|
||||||
|
|
||||||
|
|
||||||
The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset.
|
|
||||||
|
|
||||||
The dataset contains:
|
|
||||||
- 🌏 **Questions about the World**: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
|
|
||||||
- ✍🏻 **Writing and Creation**: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
|
|
||||||
- 📋 **Assistance on Existent Materials**: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.
|
|
||||||
|
|
||||||
The following preprocessing was applied:
|
|
||||||
- Selection of a subset of data for faster supervised fine tuning.
|
- Selection of a subset of data for faster supervised fine tuning.
|
||||||
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
|
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
|
||||||
- Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"
|
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.
|
||||||
|
|
||||||
## Dataset Structure
|
## Dataset Structure
|
||||||
|
|
||||||
The dataset contains two splits:
|
The dataset has four splits, suitable for:
|
||||||
- train - containing 207,865 examples
|
|
||||||
- test - 23,110 examples
|
* Supervised fine-tuning (`sft`).
|
||||||
|
* Generation ranking (`gen`) via techniques like rejection sampling or PPO.
|
||||||
|
|
||||||
|
The number of examples per split is shown as follows:
|
||||||
|
|
||||||
|
|
||||||
|
| train_sft | test_sft | train_gen | test_gen |
|
||||||
|
|:-------:|:-----------:|:-----:| :-----:|
|
||||||
|
| 207865 | 23110 | 256032 | 28304 |
|
||||||
|
|
||||||
The dataset is stored in parquet format with each entry using the following schema:
|
The dataset is stored in parquet format with each entry using the following schema:
|
||||||
```
|
```
|
||||||
|
Loading…
x
Reference in New Issue
Block a user