Update README.md

2023-07-01 07:38:14 +00:00 · 2023-07-01 07:38:14 +00:00 · b47acbaf27
commit b47acbaf27
parent c2edde545c
1 changed files with 26 additions and 13 deletions
--- a/README.md
+++ b/README.md
@ -48,7 +48,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
 Dataset Summary

 The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
-Currently ~1M GPT-4 completions, and ~3.0M GPT-3.5 completions.
+Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
 It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
 The data is primarily used for training and evaluation in the field of natural language processing.

@ -95,7 +95,7 @@ Further information on leaderboards will be updated as they become available.

 Languages

-The language of the data primarily is English.
+The language of the data is primarily English.

 <a name="dataset-structure"></a>

@ -105,20 +105,24 @@ Dataset Structure

 Data Instances

-A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
+A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
+The response is then entered into the response field.

 <a name="data-fields"></a>

 Data Fields

-The primary fields of interest are 'Original Text' and 'Augmented Text'.
-Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
+The fields are:
+1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
+2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint
+3) 'question', representing a question entry as provided by the FLAN Collection
+4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4.

 <a name="data-splits"></a>

 Data Splits

-Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
+The split is 17.6% test.

 <a name="dataset-creation"></a>

@ -129,14 +133,22 @@ Dataset Creation
 Curation Rationale

 The dataset was created to provide a source of augmented text data for researchers and developers.
-It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
+The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
+This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.

 <a name="source-data"></a>

 Source Data

-The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
-The original unaugmented data comes from the FLAN dataset.
+The data is generated using techniques in alignment with the distributions outlined in the ORCA paper, except as noted below:
+
+1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use.
+ We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
+2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
+ These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
+ However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
+
+Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.

 <a name="dataset-use"></a>

@ -152,12 +164,13 @@ The dataset can be used for tasks related to language understanding, natural lan

 Usage Caveats

-Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
-Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
+Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
+Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.

 <a name="getting-started"></a>

 Getting Started

-For information on getting started, please refer to the Hugging Face dataset loading utilities.
-Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
+This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
+We recommend using streaming due to the large size of the files.
+Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face.