Update README.md
This commit is contained in:
parent
e1416b3b2f
commit
e2d7c0224a
37
README.md
37
README.md
@ -45,9 +45,16 @@ We are thrilled to announce the release of the Open Orca dataset!
|
||||
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
|
||||
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
|
||||
|
||||
## Preview Model Release
|
||||
|
||||
We have now released our first model preview!
|
||||
[OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B)
|
||||
This model was trained in less than a day, for <$200, with <10% of our data.
|
||||
It beats current state of the art models on BigBench-Hard and AGIEval, and achieves ~60% of the improvements reported in the Orca paper.
|
||||
|
||||
<a name="dataset-summary"></a>
|
||||
|
||||
Dataset Summary
|
||||
# Dataset Summary
|
||||
|
||||
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
|
||||
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
||||
@ -56,7 +63,7 @@ The data is primarily used for training and evaluation in the field of natural l
|
||||
|
||||
<a name="dataset-attribution"></a>
|
||||
|
||||
Dataset Attribution
|
||||
# Dataset Attribution
|
||||
|
||||
We would like to give special recognition to the following contributors for their significant efforts and dedication:
|
||||
|
||||
@ -90,7 +97,7 @@ Want to visualize our full dataset? Check out our [Nomic Atlas Map](https://atla
|
||||
|
||||
<a name="supported-tasks-and-leaderboards"></a>
|
||||
|
||||
Supported Tasks and Leaderboards
|
||||
# Supported Tasks and Leaderboards
|
||||
|
||||
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
||||
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
||||
@ -98,24 +105,24 @@ Further information on leaderboards will be updated as they become available.
|
||||
|
||||
<a name="languages"></a>
|
||||
|
||||
Languages
|
||||
# Languages
|
||||
|
||||
The language of the data is primarily English.
|
||||
|
||||
<a name="dataset-structure"></a>
|
||||
|
||||
Dataset Structure
|
||||
# Dataset Structure
|
||||
|
||||
<a name="data-instances"></a>
|
||||
|
||||
Data Instances
|
||||
## Data Instances
|
||||
|
||||
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
|
||||
The response is then entered into the response field.
|
||||
|
||||
<a name="data-fields"></a>
|
||||
|
||||
Data Fields
|
||||
## Data Fields
|
||||
|
||||
The fields are:
|
||||
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
|
||||
@ -125,17 +132,17 @@ The fields are:
|
||||
|
||||
<a name="data-splits"></a>
|
||||
|
||||
Data Splits
|
||||
## Data Splits
|
||||
|
||||
The data is unsplit.
|
||||
|
||||
<a name="dataset-creation"></a>
|
||||
|
||||
Dataset Creation
|
||||
# Dataset Creation
|
||||
|
||||
<a name="curation-rationale"></a>
|
||||
|
||||
Curation Rationale
|
||||
## Curation Rationale
|
||||
|
||||
The dataset was created to provide a source of augmented text data for researchers and developers.
|
||||
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
|
||||
@ -143,7 +150,7 @@ This "reasoning trace" augmentation has demonstrated exceptional results, allowi
|
||||
|
||||
<a name="source-data"></a>
|
||||
|
||||
Source Data
|
||||
## Source Data
|
||||
|
||||
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
|
||||
|
||||
@ -157,24 +164,24 @@ Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. C
|
||||
|
||||
<a name="dataset-use"></a>
|
||||
|
||||
Dataset Use
|
||||
# Dataset Use
|
||||
|
||||
<a name="use-cases"></a>
|
||||
|
||||
Use Cases
|
||||
## Use Cases
|
||||
|
||||
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
||||
|
||||
<a name="usage-caveats"></a>
|
||||
|
||||
Usage Caveats
|
||||
## Usage Caveats
|
||||
|
||||
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
|
||||
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
|
||||
|
||||
<a name="getting-started"></a>
|
||||
|
||||
Getting Started
|
||||
## Getting Started
|
||||
|
||||
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
|
||||
We recommend using streaming due to the large size of the files.
|
||||
|
Loading…
x
Reference in New Issue
Block a user