Update README.md

This commit is contained in:
Bleys 2023-06-29 19:28:57 +00:00 committed by huggingface-web
parent 378ac6e247
commit 094c8b4f29

@ -18,45 +18,44 @@ size_categories:
- 10M<n<100M
---
---
language:
- en
---Dataset Card for Open Orca
Table of Contents
Dataset Description
Dataset Summary
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
Data Fields
Data Splits
Dataset Creation
Curation Rationale
Source Data
Dataset Use
Use Cases
Usage Caveats
Getting Started
<a name="credit where credit is due"></a>
Dataset Card for Open Orca
The Orca paper has been replicated to as fine of a degree of precision as several obsessive nerds sweating for weeks could pull off (a very high degree)
We will be releasing Orcas as the models continue to be trained and the dataset after we wipe off all the sweat and tears.
Right now, we're testing our fifth iteration of orca on a subset of the final data, and are just about to jump into the final stages!
Table of Contents
Dataset Attribution
Dataset Summary
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
Data Fields
Data Splits
Dataset Creation
Curation Rationale
Source Data
Dataset Use
Use Cases
Usage Caveats
Getting Started
Thanks to the
Team:
<a name="dataset-attribution"></a>
winglian
erhartford
Nanobit
Pankajmathur
We are thrilled to announce the release of the Open Orca dataset!
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the ORCA paper.
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
http://AlignmentLab.ai:
Autometa
Entropi
AtlasUnified
NeverendingToast
We would like to give special recognition to the following contributors for their significant efforts and dedication:
winglian
erhartford
Nanobit
Pankajmathur
http://AlignmentLab.ai:
Autometa
Entropi
AtlasUnified
NeverendingToast
Also of course, as always, TheBloke, for being the backbone of the whole community.
@ -67,18 +66,25 @@ OrcaMini
Samantha
WizardVicuna, and more!
and maybe even one of our projects at: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
Maybe even one of our projects at: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
We are looking for sponsors or collaborators to help us build these models to the scale they deserve; stacks of 3090s wont quite cut it this time, we think.
Not for Falcon 40b, it won't!
<a name="dataset-summary"></a>
Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing.
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
<a name="supported-tasks-and-leaderboards"></a>
Supported Tasks and Leaderboards
This dataset supports a range of tasks including language modeling, text generation, and text augmentation. It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing. Further information on leaderboards will be updated as they become available.
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
Further information on leaderboards will be updated as they become available.
<a name="languages"></a>
@ -88,6 +94,7 @@ The language of the data primarily is English.
<a name="dataset-structure"></a>
Dataset Structure
<a name="data-instances"></a>
Data Instances
@ -96,7 +103,8 @@ A data instance in this dataset represents an augmented and unaugmented set of t
<a name="data-fields"></a>
Data Fields
The primary fields of interest are 'Original Text' and 'Augmented Text'. Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
The primary fields of interest are 'Original Text' and 'Augmented Text'.
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
<a name="data-splits"></a>
@ -106,19 +114,23 @@ Details regarding data splits (train/test/validate) will be updated as the data
<a name="dataset-creation"></a>
Dataset Creation
<a name="curation-rationale"></a>
Curation Rationale
The dataset was created to provide a source of augmented text data for researchers and developers. It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
The dataset was created to provide a source of augmented text data for researchers and developers.
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
<a name="source-data"></a>
Source Data
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper. The original unaugmented data comes from the FLAN dataset.
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
The original unaugmented data comes from the FLAN dataset.
<a name="dataset-use"></a>
Dataset Use
<a name="use-cases"></a>
Use Cases
@ -127,9 +139,11 @@ The dataset can be used for tasks related to language understanding, natural lan
<a name="usage-caveats"></a>
Usage Caveats
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements. Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
<a name="getting-started"></a>
Getting Started
For information on getting started, please refer to the Hugging Face dataset loading utilities. Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
For information on getting started, please refer to the Hugging Face dataset loading utilities.
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.