From a17e18d249218178484a295901a390a91d53fa0c Mon Sep 17 00:00:00 2001 From: Edward Beeching Date: Tue, 24 Oct 2023 08:51:33 +0000 Subject: [PATCH] Update README.md --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index fe026dd..8d56847 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,12 @@ dataset_info: ## Dataset Description -This is a pre-processed Supervised Fine-Tuning dataset used for training the Zephyr-7b-beta model. +This is a pre-processed Supervised Fine-Tuning dataset used for training the Zephyr-7b-beta model. A state of the art 7b chat model. +The Zephyr-beta model is the best in class 7b model on three well known benchmarks: +- [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) - A multi-turn question set that uses GPT4 as a judge. +- [Alpaca eval](https://tatsu-lab.github.io/alpaca_eval/) - An LLM-based automatic evaluation that is fast, cheap, and reliable. That tests the ability of models to follow general user instructions. +- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which aims to track, rank and evaluate open LLMs and chatbots. + The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset. @@ -46,11 +51,15 @@ The dataset contains: The following preprocessing was applied: - Selection of a subset of data for faster supervised fine tuning. -- Truecasing of the dataset, as we observed around %5 of the data contained grammatical errors. +- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors. - Removal of dialogues where the assistant replies "I do not have emotions", "I don't have opinions" ## Dataset Structure +The dataset contains two splits +- train - containing 207,865 examples +- test - 23,110 examples + The dataset is stored in parquet format with each entry using the following schema: ```