Update README.md
This commit is contained in:
parent
c2edde545c
commit
b47acbaf27
39
README.md
39
README.md
@ -48,7 +48,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
|
||||
Dataset Summary
|
||||
|
||||
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
|
||||
Currently ~1M GPT-4 completions, and ~3.0M GPT-3.5 completions.
|
||||
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
||||
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
|
||||
The data is primarily used for training and evaluation in the field of natural language processing.
|
||||
|
||||
@ -95,7 +95,7 @@ Further information on leaderboards will be updated as they become available.
|
||||
|
||||
Languages
|
||||
|
||||
The language of the data primarily is English.
|
||||
The language of the data is primarily English.
|
||||
|
||||
<a name="dataset-structure"></a>
|
||||
|
||||
@ -105,20 +105,24 @@ Dataset Structure
|
||||
|
||||
Data Instances
|
||||
|
||||
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
|
||||
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
|
||||
The response is then entered into the response field.
|
||||
|
||||
<a name="data-fields"></a>
|
||||
|
||||
Data Fields
|
||||
|
||||
The primary fields of interest are 'Original Text' and 'Augmented Text'.
|
||||
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
|
||||
The fields are:
|
||||
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
|
||||
2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint
|
||||
3) 'question', representing a question entry as provided by the FLAN Collection
|
||||
4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4.
|
||||
|
||||
<a name="data-splits"></a>
|
||||
|
||||
Data Splits
|
||||
|
||||
Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
|
||||
The split is 17.6% test.
|
||||
|
||||
<a name="dataset-creation"></a>
|
||||
|
||||
@ -129,14 +133,22 @@ Dataset Creation
|
||||
Curation Rationale
|
||||
|
||||
The dataset was created to provide a source of augmented text data for researchers and developers.
|
||||
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
|
||||
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
|
||||
This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.
|
||||
|
||||
<a name="source-data"></a>
|
||||
|
||||
Source Data
|
||||
|
||||
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
|
||||
The original unaugmented data comes from the FLAN dataset.
|
||||
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper, except as noted below:
|
||||
|
||||
1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use.
|
||||
We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
|
||||
2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
|
||||
These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
|
||||
However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
|
||||
|
||||
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
|
||||
|
||||
<a name="dataset-use"></a>
|
||||
|
||||
@ -152,12 +164,13 @@ The dataset can be used for tasks related to language understanding, natural lan
|
||||
|
||||
Usage Caveats
|
||||
|
||||
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
|
||||
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
|
||||
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
|
||||
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
|
||||
|
||||
<a name="getting-started"></a>
|
||||
|
||||
Getting Started
|
||||
|
||||
For information on getting started, please refer to the Hugging Face dataset loading utilities.
|
||||
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
|
||||
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
|
||||
We recommend using streaming due to the large size of the files.
|
||||
Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face.
|
Loading…
x
Reference in New Issue
Block a user