160 lines
4.8 KiB
Markdown
160 lines
4.8 KiB
Markdown
---
|
|
language:
|
|
- en
|
|
license: mit
|
|
task_categories:
|
|
- conversational
|
|
- text-classification
|
|
- token-classification
|
|
- table-question-answering
|
|
- question-answering
|
|
- zero-shot-classification
|
|
- summarization
|
|
- feature-extraction
|
|
- text-generation
|
|
- text2text-generation
|
|
pretty_name: Open Orca
|
|
size_categories:
|
|
- 10M<n<100M
|
|
---
|
|
---
|
|
|
|
Dataset Card for Open Orca
|
|
|
|
Table of Contents
|
|
Dataset Attribution
|
|
Dataset Summary
|
|
Supported Tasks and Leaderboards
|
|
Languages
|
|
Dataset Structure
|
|
Data Instances
|
|
Data Fields
|
|
Data Splits
|
|
Dataset Creation
|
|
Curation Rationale
|
|
Source Data
|
|
Dataset Use
|
|
Use Cases
|
|
Usage Caveats
|
|
Getting Started
|
|
|
|
<a name="dataset-attribution"></a>
|
|
|
|
We are thrilled to announce the release of the Open Orca dataset!
|
|
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the ORCA paper.
|
|
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
|
|
|
|
We would like to give special recognition to the following contributors for their significant efforts and dedication:
|
|
|
|
winglian
|
|
erhartford
|
|
Nanobit
|
|
Pankajmathur
|
|
|
|
http://AlignmentLab.ai:
|
|
Autometa
|
|
Entropi
|
|
AtlasUnified
|
|
NeverendingToast
|
|
|
|
Also of course, as always, TheBloke, for being the backbone of the whole community.
|
|
|
|
Be sure to check out Axolotl on github, developed by Nano and Winglian, the platform that developed and trained manticore, minotaur, and many others!
|
|
|
|
Other team projects on huggingface:
|
|
OrcaMini
|
|
Samantha
|
|
WizardVicuna, and more!
|
|
|
|
Maybe even one of our projects at: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
|
|
|
|
We are looking for sponsors or collaborators to help us build these models to the scale they deserve; stacks of 3090s wont quite cut it this time, we think.
|
|
Not for Falcon 40b, it won't!
|
|
|
|
<a name="dataset-summary"></a>
|
|
|
|
Dataset Summary
|
|
|
|
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
|
|
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
|
|
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
|
|
The data is primarily used for training and evaluation in the field of natural language processing.
|
|
|
|
<a name="supported-tasks-and-leaderboards"></a>
|
|
|
|
Supported Tasks and Leaderboards
|
|
|
|
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
|
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
|
Further information on leaderboards will be updated as they become available.
|
|
|
|
<a name="languages"></a>
|
|
|
|
Languages
|
|
|
|
The language of the data primarily is English.
|
|
|
|
<a name="dataset-structure"></a>
|
|
|
|
Dataset Structure
|
|
|
|
<a name="data-instances"></a>
|
|
|
|
Data Instances
|
|
|
|
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
|
|
|
|
<a name="data-fields"></a>
|
|
|
|
Data Fields
|
|
|
|
The primary fields of interest are 'Original Text' and 'Augmented Text'.
|
|
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
|
|
|
|
<a name="data-splits"></a>
|
|
|
|
Data Splits
|
|
|
|
Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
|
|
|
|
<a name="dataset-creation"></a>
|
|
|
|
Dataset Creation
|
|
|
|
<a name="curation-rationale"></a>
|
|
|
|
Curation Rationale
|
|
|
|
The dataset was created to provide a source of augmented text data for researchers and developers.
|
|
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
|
|
|
|
<a name="source-data"></a>
|
|
|
|
Source Data
|
|
|
|
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
|
|
The original unaugmented data comes from the FLAN dataset.
|
|
|
|
<a name="dataset-use"></a>
|
|
|
|
Dataset Use
|
|
|
|
<a name="use-cases"></a>
|
|
|
|
Use Cases
|
|
|
|
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
|
|
|
<a name="usage-caveats"></a>
|
|
|
|
Usage Caveats
|
|
|
|
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
|
|
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
|
|
|
|
<a name="getting-started"></a>
|
|
|
|
Getting Started
|
|
|
|
For information on getting started, please refer to the Hugging Face dataset loading utilities.
|
|
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face. |