188 lines
7.6 KiB
Markdown
188 lines
7.6 KiB
Markdown
---
|
|
language:
|
|
- en
|
|
license: mit
|
|
task_categories:
|
|
- conversational
|
|
- text-classification
|
|
- token-classification
|
|
- table-question-answering
|
|
- question-answering
|
|
- zero-shot-classification
|
|
- summarization
|
|
- feature-extraction
|
|
- text-generation
|
|
- text2text-generation
|
|
pretty_name: Open Orca
|
|
size_categories:
|
|
- 10M<n<100M
|
|
---
|
|
## Table of Contents
|
|
- [Dataset Summary](#dataset-summary)
|
|
- [Dataset Attribution](#dataset-attribution)
|
|
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
|
|
- [Languages](#languages)
|
|
- [Dataset Structure](#dataset-structure)
|
|
- [Data Instances](#data-instances)
|
|
- [Data Fields](#data-fields)
|
|
- [Data Splits](#data-splits)
|
|
- [Dataset Creation](#dataset-creation)
|
|
- [Curation Rationale](#curation-rationale)
|
|
- [Source Data](#source-data)
|
|
- [Dataset Use](#dataset-use)
|
|
- [Use Cases](#use-cases)
|
|
- [Usage Caveats](#usage-caveats)
|
|
- [Getting Started](#getting-started)
|
|
|
|
|
|
<p><h1>🐋 The Open Orca Dataset! 🐋</h1></p>
|
|
|
|
![OpenOrca Logo](https://huggingface.co/datasets/Open-Orca/OpenOrca/resolve/main/OpenOrcaLogo.png "OpenOrca Logo")
|
|
|
|
<a name="dataset-announcement"></a>
|
|
|
|
We are thrilled to announce the release of the Open Orca dataset!
|
|
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
|
|
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
|
|
|
|
## Preview Model Release
|
|
|
|
We have now released our first model preview!
|
|
[OpenOrca-Preview1-13B](https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B)
|
|
This model was trained in less than a day, for <$200, with <10% of our data.
|
|
It beats current state of the art models on BigBench-Hard and AGIEval, and achieves ~60% of the improvements reported in the Orca paper.
|
|
|
|
<a name="dataset-summary"></a>
|
|
|
|
# Dataset Summary
|
|
|
|
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
|
|
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
|
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
|
|
The data is primarily used for training and evaluation in the field of natural language processing.
|
|
|
|
<a name="dataset-attribution"></a>
|
|
|
|
# Dataset Attribution
|
|
|
|
We would like to give special recognition to the following contributors for their significant efforts and dedication:
|
|
|
|
|
|
Teknium
|
|
WingLian/Caseus
|
|
Eric Hartford
|
|
NanoBit
|
|
Pankaj
|
|
Winddude
|
|
Rohan
|
|
|
|
http://AlignmentLab.ai:
|
|
Autometa
|
|
Entropi
|
|
AtlasUnified
|
|
NeverendingToast
|
|
NanoBit
|
|
WingLian/Caseus
|
|
|
|
Also of course, as always, TheBloke, for being the backbone of the whole community.
|
|
|
|
Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others!
|
|
|
|
We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials:
|
|
http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
|
|
|
|
Want to visualize our full dataset? Check out our [Nomic Atlas Map](https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2).
|
|
[<img src="https://huggingface.co/Open-Orca/OpenOrca-Preview1-13B/resolve/main/OpenOrca%20Nomic%20Atlas.png" alt="Atlas Nomic Dataset Map" width="400" height="400" />](https://atlas.nomic.ai/map/c1b88b47-2d9b-47e0-9002-b80766792582/2560fd25-52fe-42f1-a58f-ff5eccc890d2)
|
|
|
|
|
|
<a name="supported-tasks-and-leaderboards"></a>
|
|
|
|
# Supported Tasks and Leaderboards
|
|
|
|
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
|
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
|
Further information on leaderboards will be updated as they become available.
|
|
|
|
<a name="languages"></a>
|
|
|
|
# Languages
|
|
|
|
The language of the data is primarily English.
|
|
|
|
<a name="dataset-structure"></a>
|
|
|
|
# Dataset Structure
|
|
|
|
<a name="data-instances"></a>
|
|
|
|
## Data Instances
|
|
|
|
A data instance in this dataset represents entries from the FLAN collection which have been augmented by submitting the listed question to either GPT-4 or GPT-3.5.
|
|
The response is then entered into the response field.
|
|
|
|
<a name="data-fields"></a>
|
|
|
|
## Data Fields
|
|
|
|
The fields are:
|
|
1) 'id', a unique numbered identifier which includes one of 'niv', 't0', 'cot', or 'flan' to represent which source FLAN Collection submix the 'question' is sourced from.
|
|
2) 'system_prompt', representing the System Prompt presented to the GPT-3.5 or GPT-4 API for the datapoint
|
|
3) 'question', representing a question entry as provided by the FLAN Collection
|
|
4) 'response', a response to that question received from a query to either GPT-3.5 or GPT-4.
|
|
|
|
<a name="data-splits"></a>
|
|
|
|
## Data Splits
|
|
|
|
The data is unsplit.
|
|
|
|
<a name="dataset-creation"></a>
|
|
|
|
# Dataset Creation
|
|
|
|
<a name="curation-rationale"></a>
|
|
|
|
## Curation Rationale
|
|
|
|
The dataset was created to provide a source of augmented text data for researchers and developers.
|
|
The datapoints are intended primarily to provide an enhancement of the core FLAN Collection data which relies upon the detailed step by step reasoning capabilities of GPT-3.5 and GPT-4.
|
|
This "reasoning trace" augmentation has demonstrated exceptional results, allowing a LLaMA-13B model trained with this data to rival or beat GPT-3.5 on broad sets of hard reasoning tasks which all models below 100B parameters had previously performed dramatically worse on.
|
|
|
|
<a name="source-data"></a>
|
|
|
|
## Source Data
|
|
|
|
The data is generated using techniques in alignment with the distributions outlined in the Orca paper, except as noted below:
|
|
|
|
1) There is not enough CoT data in the FLAN Collection to generate 150K zero-shot entries, as the paper purports to use.
|
|
We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
|
|
2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
|
|
These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
|
|
However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
|
|
|
|
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
|
|
|
|
<a name="dataset-use"></a>
|
|
|
|
# Dataset Use
|
|
|
|
<a name="use-cases"></a>
|
|
|
|
## Use Cases
|
|
|
|
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
|
|
|
<a name="usage-caveats"></a>
|
|
|
|
## Usage Caveats
|
|
|
|
Given that this is a work-in-progress dataset, it is recommended to regularly check for updates and improvements.
|
|
Further, the data should be used in accordance with the guidelines and recommendations outlined in the Orca paper.
|
|
|
|
<a name="getting-started"></a>
|
|
|
|
## Getting Started
|
|
|
|
This dataset is organized such that it can be naively loaded via Hugging Face datasets library.
|
|
We recommend using streaming due to the large size of the files.
|
|
Regular updates and data generation progress can be monitored through the OpenOrca repository on Hugging Face. |