Update README.md
This commit is contained in:
parent
666b81f100
commit
a17e18d249
13
README.md
13
README.md
@ -35,7 +35,12 @@ dataset_info:
|
||||
|
||||
## Dataset Description
|
||||
|
||||
This is a pre-processed Supervised Fine-Tuning dataset used for training the Zephyr-7b-beta model.
|
||||
This is a pre-processed Supervised Fine-Tuning dataset used for training the Zephyr-7b-beta model. A state of the art 7b chat model.
|
||||
The Zephyr-beta model is the best in class 7b model on three well known benchmarks:
|
||||
- [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) - A multi-turn question set that uses GPT4 as a judge.
|
||||
- [Alpaca eval](https://tatsu-lab.github.io/alpaca_eval/) - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions.
|
||||
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which aims to track, rank and evaluate open LLMs and chatbots.
|
||||
|
||||
|
||||
The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset.
|
||||
|
||||
@ -46,11 +51,15 @@ The dataset contains:
|
||||
|
||||
The following preprocessing was applied:
|
||||
- Selection of a subset of data for faster supervised fine tuning.
|
||||
- Truecasing of the dataset, as we observed around %5 of the data contained grammatical errors.
|
||||
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors.
|
||||
- Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions"
|
||||
|
||||
## Dataset Structure
|
||||
|
||||
The dataset contains two splits
|
||||
- train - containing 207,865 examples
|
||||
- test - 23,110 examples
|
||||
|
||||
The dataset is stored in parquet format with each entry using the following schema:
|
||||
```
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user