From b47acbaf274b2df59abc9ab24ee817fb6c70523d Mon Sep 17 00:00:00 2001 From: Bleys Date: Sat, 1 Jul 2023 07:38:14 +0000 Subject: [PATCH] Update README.md --- README.md | 39 ++++++++++++++++++++++++++------------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 32fddf7..48f154f 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ It has been instrumental in generating high-performing model checkpoints and ser Dataset Summary The Open Orca dataset is a collection of unaugmented and augmented FLAN data. -Currently ~1M GPT-4 completions, and ~3.0M GPT-3.5 completions. +Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing. @@ -95,7 +95,7 @@ Further information on leaderboards will be updated as they become available. Languages -The language of the data primarily is English. +The language of the data is primarily English. @@ -105,20 +105,24 @@ Dataset Structure Data Instances -A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content. +A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5. +The response is then entered into the response field. Data Fields -The primary fields of interest are 'Original Text' and 'Augmented Text'. -Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included. +The fields are: +1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from. +2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint +3) 'question', representing a question entry as provided by the FLAN Collection +4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4. Data Splits -Details regarding data splits (train/test/validate) will be updated as the data generation progresses. +The split is 17.6% test. @@ -129,14 +133,22 @@ Dataset Creation Curation Rationale The dataset was created to provide a source of augmented text data for researchers and developers. -It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints. +The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4. +This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on. Source Data -The data is generated using techniques in alignment with the distributions outlined in the ORCA paper. -The original unaugmented data comes from the FLAN dataset. +The data is generated using techniques in alignment with the distributions outlined in the ORCA paper, except as noted below: + +1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use. + We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available. +2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original). + These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source. + However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively. + +Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work. @@ -152,12 +164,13 @@ The dataset can be used for tasks related to language understanding, natural lan Usage Caveats -Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements. -Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper. +Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements. +Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper. Getting Started -For information on getting started, please refer to the Hugging Face dataset loading utilities. -Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face. \ No newline at end of file +This dataset is organized such that it can be naively loaded via Hugging Face datasets library. +We recommend using streaming due to the large size of the files. +Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face. \ No newline at end of file