Update README.md
This commit is contained in:
parent
e1416b3b2f
commit
e2d7c0224a
37
README.md
37
README.md
@ -45,9 +45,16 @@ We are thrilled to announce the release of the Open Orca dataset!
|
|||||||
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
|
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
|
||||||
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
|
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
|
||||||
|
|
||||||
|
## Preview Model Release
|
||||||
|
|
||||||
|
We have now released our first model preview!
|
||||||
|
[OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B)
|
||||||
|
This model was trained in less than a day, for <$200, with <10% of our data.
|
||||||
|
It beats current state of the art models on BigBench-Hard and AGIEval, and achieves ~60% of the improvements reported in the Orca paper.
|
||||||
|
|
||||||
<a name="dataset-summary"></a>
|
<a name="dataset-summary"></a>
|
||||||
|
|
||||||
Dataset Summary
|
# Dataset Summary
|
||||||
|
|
||||||
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
|
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
|
||||||
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
||||||
@ -56,7 +63,7 @@ The data is primarily used for training and evaluation in the field of natural l
|
|||||||
|
|
||||||
<a name="dataset-attribution"></a>
|
<a name="dataset-attribution"></a>
|
||||||
|
|
||||||
Dataset Attribution
|
# Dataset Attribution
|
||||||
|
|
||||||
We would like to give special recognition to the following contributors for their significant efforts and dedication:
|
We would like to give special recognition to the following contributors for their significant efforts and dedication:
|
||||||
|
|
||||||
@ -90,7 +97,7 @@ Want to visualize our full dataset? Check out our [Nomic Atlas Map](https://atla
|
|||||||
|
|
||||||
<a name="supported-tasks-and-leaderboards"></a>
|
<a name="supported-tasks-and-leaderboards"></a>
|
||||||
|
|
||||||
Supported Tasks and Leaderboards
|
# Supported Tasks and Leaderboards
|
||||||
|
|
||||||
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
||||||
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
||||||
@ -98,24 +105,24 @@ Further information on leaderboards will be updated as they become available.
|
|||||||
|
|
||||||
<a name="languages"></a>
|
<a name="languages"></a>
|
||||||
|
|
||||||
Languages
|
# Languages
|
||||||
|
|
||||||
The language of the data is primarily English.
|
The language of the data is primarily English.
|
||||||
|
|
||||||
<a name="dataset-structure"></a>
|
<a name="dataset-structure"></a>
|
||||||
|
|
||||||
Dataset Structure
|
# Dataset Structure
|
||||||
|
|
||||||
<a name="data-instances"></a>
|
<a name="data-instances"></a>
|
||||||
|
|
||||||
Data Instances
|
## Data Instances
|
||||||
|
|
||||||
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
|
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
|
||||||
The response is then entered into the response field.
|
The response is then entered into the response field.
|
||||||
|
|
||||||
<a name="data-fields"></a>
|
<a name="data-fields"></a>
|
||||||
|
|
||||||
Data Fields
|
## Data Fields
|
||||||
|
|
||||||
The fields are:
|
The fields are:
|
||||||
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
|
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
|
||||||
@ -125,17 +132,17 @@ The fields are:
|
|||||||
|
|
||||||
<a name="data-splits"></a>
|
<a name="data-splits"></a>
|
||||||
|
|
||||||
Data Splits
|
## Data Splits
|
||||||
|
|
||||||
The data is unsplit.
|
The data is unsplit.
|
||||||
|
|
||||||
<a name="dataset-creation"></a>
|
<a name="dataset-creation"></a>
|
||||||
|
|
||||||
Dataset Creation
|
# Dataset Creation
|
||||||
|
|
||||||
<a name="curation-rationale"></a>
|
<a name="curation-rationale"></a>
|
||||||
|
|
||||||
Curation Rationale
|
## Curation Rationale
|
||||||
|
|
||||||
The dataset was created to provide a source of augmented text data for researchers and developers.
|
The dataset was created to provide a source of augmented text data for researchers and developers.
|
||||||
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
|
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
|
||||||
@ -143,7 +150,7 @@ This "reasoning trace" augmentation has demonstrated exceptional results, allowi
|
|||||||
|
|
||||||
<a name="source-data"></a>
|
<a name="source-data"></a>
|
||||||
|
|
||||||
Source Data
|
## Source Data
|
||||||
|
|
||||||
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
|
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
|
||||||
|
|
||||||
@ -157,24 +164,24 @@ Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. C
|
|||||||
|
|
||||||
<a name="dataset-use"></a>
|
<a name="dataset-use"></a>
|
||||||
|
|
||||||
Dataset Use
|
# Dataset Use
|
||||||
|
|
||||||
<a name="use-cases"></a>
|
<a name="use-cases"></a>
|
||||||
|
|
||||||
Use Cases
|
## Use Cases
|
||||||
|
|
||||||
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
||||||
|
|
||||||
<a name="usage-caveats"></a>
|
<a name="usage-caveats"></a>
|
||||||
|
|
||||||
Usage Caveats
|
## Usage Caveats
|
||||||
|
|
||||||
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
|
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
|
||||||
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
|
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
|
||||||
|
|
||||||
<a name="getting-started"></a>
|
<a name="getting-started"></a>
|
||||||
|
|
||||||
Getting Started
|
## Getting Started
|
||||||
|
|
||||||
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
|
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
|
||||||
We recommend using streaming due to the large size of the files.
|
We recommend using streaming due to the large size of the files.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user