From b47acbaf274b2df59abc9ab24ee817fb6c70523d Mon Sep 17 00:00:00 2001
From: Bleys <bleysg@users.noreply.huggingface.co>
Date: Sat, 1 Jul 2023 07:38:14 +0000
Subject: [PATCH] Update README.md

---
 README.md | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/README.md b/README.md
index 32fddf7..48f154f 100644
--- a/README.md
+++ b/README.md
@@ -48,7 +48,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
 Dataset Summary
 
 The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
-Currently ~1M GPT-4 completions, and ~3.0M GPT-3.5 completions.
+Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
 It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
 The data is primarily used for training and evaluation in the field of natural language processing.
 
@@ -95,7 +95,7 @@ Further information on leaderboards will be updated as they become available.
 
 Languages
 
-The language of the data primarily is English.
+The language of the data is primarily English.
 
 <a name="dataset-structure"></a>
 
@@ -105,20 +105,24 @@ Dataset Structure
 
 Data Instances
 
-A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
+A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
+The response is then entered into the response field.
 
 <a name="data-fields"></a>
 
 Data Fields
 
-The primary fields of interest are 'Original Text' and 'Augmented Text'.
-Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
+The fields are:
+1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
+2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint
+3) 'question', representing a question entry as provided by the FLAN Collection
+4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4.
 
 <a name="data-splits"></a>
 
 Data Splits
 
-Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
+The split is 17.6% test.
 
 <a name="dataset-creation"></a>
 
@@ -129,14 +133,22 @@ Dataset Creation
 Curation Rationale
 
 The dataset was created to provide a source of augmented text data for researchers and developers.
-It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
+The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
+This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.
 
 <a name="source-data"></a>
 
 Source Data
 
-The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
-The original unaugmented data comes from the FLAN dataset.
+The data is generated using techniques in alignment with the distributions outlined in the ORCA paper, except as noted below:
+
+1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use.
+ We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
+2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
+ These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
+ However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
+
+Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
 
 <a name="dataset-use"></a>
 
@@ -152,12 +164,13 @@ The dataset can be used for tasks related to language understanding, natural lan
 
 Usage Caveats
 
-Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
-Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
+Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
+Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
 
 <a name="getting-started"></a>
 
 Getting Started
 
-For information on getting started, please refer to the Hugging Face dataset loading utilities.
-Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
\ No newline at end of file
+This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
+We recommend using streaming due to the large size of the files.
+Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face.
\ No newline at end of file