Update README.md

2023-10-24 08:27:00 +00:00 · 2023-10-24 08:27:00 +00:00 · 9ad6607fa9
commit 9ad6607fa9
parent ca063fbbc4
1 changed files with 76 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,4 +1,13 @@
 ---
+license: mit
+task_categories:
+- conversational
+- text-generation
+language:
+- en
+size_categories:
+- 100K<n<1M
+pretty_name: ZephyrIFT
 dataset_info:
  features:
  - name: prompt
@ -21,6 +30,71 @@ dataset_info:
  download_size: 813207030
  dataset_size: 1551754213
 ---
-# Dataset Card for "ultrachat_200k"

-[More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)
+# Dataset Card for Dataset Name
+
+## Dataset Description
+
+This is a pre-processed Intruction Fine-Tuning dataset used for training the Zephyr-7b-beta model.
+
+The base dataset is [UltraChat](https://github.com/thunlp/UltraChat): an open-source, large-scale, and multi-round dialogue dataset.
+
+The dataset contains:
+- 🌏 **Questions about the World**: The dialogue data in this sector is derived from a wide range of inquiries related to concepts, entities, and objects from the real world. The topics covered are extensive, spanning areas such as technology, art, and entrepreneurship.
+- ✍🏻 **Writing and Creation**: The dialogue data in this sector is driven by the demands for writing/creation from scratch, and encompasses any tasks that an AI assistant may aid within the creative process, spanning from email composition to crafting narratives and plays, and beyond.
+- 📋 **Assistance on Existent Materials**: The dialogue data in this sector is generated based on existing materials, including but not limited to rewriting, continuation, summarization, and inference, covering a diverse range of topics.
+
+The following preprocessing was applied:
+- Selection of a subset of data for faster supervised fine tuning.
+- Truecasing of the dataset, as we observed around %5 of the data contained grammatical errors.
+- Removal of dialogues where the assitant replies "I do not have emotions", "I don't have opinions" ...etc (TO BE CONFIRMED AFTER EXPS)
+
+## Dataset Structure
+
+The dataset is stored in parquet format with each entry using the following schema:
+```
+
+{
+    "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
+    "messages":[
+        {
+            "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
+            "role": "user"
+        },
+        {
+            "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
+            "role": "assistant"
+        },
+        {
+            "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details.  ...",
+            "role": "user"
+        },
+        {
+            "content": "Certainly! ....",
+            "role": "assistant"
+        },
+        {
+            "content": "That's really interesting! I would love to hear more...",
+            "role": "user"
+        }
+        {
+            "content": "Certainly! ....",
+            "role": "assistant"
+        },
+    ],
+    "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
+}
+
+
+### Citation Information
+
+```bibtex
+@misc{ZephyrIFT,
+  author = {Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Alexander M. Rush, and Thomas Wolf},
+  title = {ZephyrIFT},
+  year = {2023},
+  publisher = {HuggingFace Hub},
+  journal = {HuggingFace Hub repository},
+  howpublished = {\url{https://huggingface.co/datasets/HuggingFaceH4/zephyr_ift_public}},
+}
+```