Update README.md

This commit is contained in:
Bleys 2023-07-13 03:14:57 +00:00 committed by huggingface-web
parent e1416b3b2f
commit e2d7c0224a

@ -45,9 +45,16 @@ We are thrilled to announce the release of the Open Orca dataset!
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
## Preview Model Release
We have now released our first model preview!
[OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B)
This model was trained in less than a day, for <$200, with <10% of our data.
It beats current state of the art models on BigBench-Hard and AGIEval, and achieves ~60% of the improvements reported in the Orca paper.
<a name="dataset-summary"></a>
Dataset Summary
# Dataset Summary
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
@ -56,7 +63,7 @@ The data is primarily used for training and evaluation in the field of natural l
<a name="dataset-attribution"></a>
Dataset Attribution
# Dataset Attribution
We would like to give special recognition to the following contributors for their significant efforts and dedication:
@ -90,7 +97,7 @@ Want to visualize our full dataset? Check out our [Nomic Atlas Map](https://atla
<a name="supported-tasks-and-leaderboards"></a>
Supported Tasks and Leaderboards
# Supported Tasks and Leaderboards
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
@ -98,24 +105,24 @@ Further information on leaderboards will be updated as they become available.
<a name="languages"></a>
Languages
# Languages
The language of the data is primarily English.
<a name="dataset-structure"></a>
Dataset Structure
# Dataset Structure
<a name="data-instances"></a>
Data Instances
## Data Instances
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
The response is then entered into the response field.
<a name="data-fields"></a>
Data Fields
## Data Fields
The fields are:
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
@ -125,17 +132,17 @@ The fields are:
<a name="data-splits"></a>
Data Splits
## Data Splits
The data is unsplit.
<a name="dataset-creation"></a>
Dataset Creation
# Dataset Creation
<a name="curation-rationale"></a>
Curation Rationale
## Curation Rationale
The dataset was created to provide a source of augmented text data for researchers and developers.
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
@ -143,7 +150,7 @@ This "reasoning trace" augmentation has demonstrated exceptional results, allowi
<a name="source-data"></a>
Source Data
## Source Data
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
@ -157,24 +164,24 @@ Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. C
<a name="dataset-use"></a>
Dataset Use
# Dataset Use
<a name="use-cases"></a>
Use Cases
## Use Cases
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
<a name="usage-caveats"></a>
Usage Caveats
## Usage Caveats
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
<a name="getting-started"></a>
Getting Started
## Getting Started
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
We recommend using streaming due to the large size of the files.