Update README.md

This commit is contained in:
Bleys 2023-07-01 07:38:14 +00:00 committed by huggingface-web
parent c2edde545c
commit b47acbaf27

@ -48,7 +48,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.0M GPT-3.5 completions.
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
@ -95,7 +95,7 @@ Further information on leaderboards will be updated as they become available.
Languages
The language of the data primarily is English.
The language of the data is primarily English.
<a name="dataset-structure"></a>
@ -105,20 +105,24 @@ Dataset Structure
Data Instances
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
The response is then entered into the response field.
<a name="data-fields"></a>
Data Fields
The primary fields of interest are 'Original Text' and 'Augmented Text'.
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
The fields are:
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint
3) 'question', representing a question entry as provided by the FLAN Collection
4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4.
<a name="data-splits"></a>
Data Splits
Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
The split is 17.6% test.
<a name="dataset-creation"></a>
@ -129,14 +133,22 @@ Dataset Creation
Curation Rationale
The dataset was created to provide a source of augmented text data for researchers and developers.
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.
<a name="source-data"></a>
Source Data
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
The original unaugmented data comes from the FLAN dataset.
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper, except as noted below:
1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use.
We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
<a name="dataset-use"></a>
@ -152,12 +164,13 @@ The dataset can be used for tasks related to language understanding, natural lan
Usage Caveats
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
<a name="getting-started"></a>
Getting Started
For information on getting started, please refer to the Hugging Face dataset loading utilities.
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
We recommend using streaming due to the large size of the files.
Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face.