OpenOrca/README.md

---
language:
- en
license: mit
task_categories:
- conversational
- text-classification
- token-classification
- table-question-answering
- question-answering
- zero-shot-classification
- summarization
- feature-extraction
- text-generation
- text2text-generation
pretty_name: Open Orca
size_categories:
- 10M<n<100M
---
## Table of Contents
- [Dataset Attribution](#dataset-attribution)
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
  - [Data Instances](#data-instances)
  - [Data Fields](#data-fields)
  - [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
  - [Curation Rationale](#curation-rationale)
  - [Source Data](#source-data)
- [Dataset Use](#dataset-use)
  - [Use Cases](#use-cases)
  - [Usage Caveats](#usage-caveats)
  - [Getting Started](#getting-started)


<p><h1>🐋 The Open Orca Dataset! 🐋</h1></p>

<a name="dataset-attribution"></a>

We are thrilled to announce the release of the Open Orca dataset!
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the ORCA paper.
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!

We would like to give special recognition to the following contributors for their significant efforts and dedication:
        

    Teknium                     
    Caseus
    Eric Hartford
    NanoBit
    Pankaj
    Winddude
    Rohan

    http://AlignmentLab.ai:
    Autometa
    Entropi
    AtlasUnified
    NeverendingToast
    lightningRalf
    NanoBit
    Caseus

Also of course, as always, TheBloke, for being the backbone of the whole community.

Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others! 

We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials:
http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx


We have been made aware that Eric Hartford, a team member who chose to depart our team the day prior to the official release of this repo after some internal discussion of our grievances, has made claims to be the sole originator of the Open Orca project and to claim the work as his own.
We wish to clarify that this was a team effort from the outset, and he was one of over a dozen data scientists, machine learning engineers, and other specialists who have been involved in this project from the outset.
Eric joined the team with the mutual understanding that we were all to  be treated as equals and get our due credit for involvement, as well as say in group decisions.
He made snap decisions on behalf of the team contrary to long term plans, including announcing the project publicly on his blog, and implying that he was the sole originator and project lead.
We attempted to reconcile this internally, but he chose to depart from the team.
As such, we elected to release the data publicly in advance of original plans.
We have appropriately attributed he and all other contributors, as was originally planned.
We thank Eric for his contributions to the project and wish him well on his individual endeavors.

This repo is the original repo from which the entire team had agreed to work out of and publish out of from the outset.
Eric's repo represents his duplication and augmentation of the team's collective effort, initiated after he had chosen to depart the team.

<a name="dataset-summary"></a>

Dataset Summary

The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.

<a name="supported-tasks-and-leaderboards"></a>

Supported Tasks and Leaderboards

This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
Further information on leaderboards will be updated as they become available.

<a name="languages"></a>

Languages

The language of the data primarily is English.

<a name="dataset-structure"></a>

Dataset Structure

<a name="data-instances"></a>

Data Instances

A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.

<a name="data-fields"></a>

Data Fields

The primary fields of interest are 'Original Text' and 'Augmented Text'.
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.

<a name="data-splits"></a>

Data Splits

Details regarding data splits (train/test/validate) will be updated as the data generation progresses.

<a name="dataset-creation"></a>

Dataset Creation

<a name="curation-rationale"></a>

Curation Rationale

The dataset was created to provide a source of augmented text data for researchers and developers.
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.

<a name="source-data"></a>

Source Data

The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
The original unaugmented data comes from the FLAN dataset.

<a name="dataset-use"></a>

Dataset Use

<a name="use-cases"></a>

Use Cases

The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.

<a name="usage-caveats"></a>

Usage Caveats

Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.

<a name="getting-started"></a>

Getting Started

For information on getting started, please refer to the Hugging Face dataset loading utilities.
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
Create README.md 2023-06-17 12:38:51 +00:00			`---`
			`language:`
			`- en`
Update README.md 2023-06-29 18:58:17 +00:00			`license: mit`
			`task_categories:`
			`- conversational`
			`- text-classification`
			`- token-classification`
			`- table-question-answering`
			`- question-answering`
			`- zero-shot-classification`
			`- summarization`
			`- feature-extraction`
			`- text-generation`
			`- text2text-generation`
			`pretty_name: Open Orca`
			`size_categories:`
			`- 10M<n<100M`
			`---`
Update README.md 2023-06-29 20:51:01 +00:00			`## Table of Contents`
			`- [Dataset Attribution](#dataset-attribution)`
			`- [Dataset Summary](#dataset-summary)`
			`- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)`
			`- [Languages](#languages)`
			`- [Dataset Structure](#dataset-structure)`
			`- [Data Instances](#data-instances)`
			`- [Data Fields](#data-fields)`
			`- [Data Splits](#data-splits)`
			`- [Dataset Creation](#dataset-creation)`
			`- [Curation Rationale](#curation-rationale)`
			`- [Source Data](#source-data)`
			`- [Dataset Use](#dataset-use)`
			`- [Use Cases](#use-cases)`
			`- [Usage Caveats](#usage-caveats)`
			`- [Getting Started](#getting-started)`

Update README.md 2023-06-29 19:28:57 +00:00
Update README.md 2023-06-29 20:46:54 +00:00			`<p><h1>🐋 The Open Orca Dataset! 🐋</h1></p>`
Update README.md 2023-06-29 19:28:57 +00:00
			`<a name="dataset-attribution"></a>`

			`We are thrilled to announce the release of the Open Orca dataset!`
			`This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the ORCA paper.`
			`It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!`

			`We would like to give special recognition to the following contributors for their significant efforts and dedication:`
Update README.md 2023-06-29 22:53:09 +00:00

			`Teknium`
			`Caseus`
Update README.md 2023-06-29 20:27:12 +00:00			`Eric Hartford`
			`NanoBit`
			`Pankaj`
Update README.md 2023-06-29 22:53:09 +00:00			`Winddude`
Update README.md 2023-06-29 20:27:12 +00:00			`Rohan`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`http://AlignmentLab.ai:`
			`Autometa`
			`Entropi`
			`AtlasUnified`
			`NeverendingToast`
Update README.md 2023-06-29 20:27:12 +00:00			`lightningRalf`
			`NanoBit`
Update README.md 2023-06-29 22:53:09 +00:00			`Caseus`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-29 19:15:08 +00:00			`Also of course, as always, TheBloke, for being the backbone of the whole community.`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-30 06:55:12 +00:00			`Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others!`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-30 03:07:04 +00:00			`We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials:`
			`http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-30 03:07:04 +00:00
			`We have been made aware that Eric Hartford, a team member who chose to depart our team the day prior to the official release of this repo after some internal discussion of our grievances, has made claims to be the sole originator of the Open Orca project and to claim the work as his own.`
			`We wish to clarify that this was a team effort from the outset, and he was one of over a dozen data scientists, machine learning engineers, and other specialists who have been involved in this project from the outset.`
			`Eric joined the team with the mutual understanding that we were all to be treated as equals and get our due credit for involvement, as well as say in group decisions.`
			`He made snap decisions on behalf of the team contrary to long term plans, including announcing the project publicly on his blog, and implying that he was the sole originator and project lead.`
			`We attempted to reconcile this internally, but he chose to depart from the team.`
			`As such, we elected to release the data publicly in advance of original plans.`
			`We have appropriately attributed he and all other contributors, as was originally planned.`
			`We thank Eric for his contributions to the project and wish him well on his individual endeavors.`

			`This repo is the original repo from which the entire team had agreed to work out of and publish out of from the outset.`
			`Eric's repo represents his duplication and augmentation of the team's collective effort, initiated after he had chosen to depart the team.`
Update README.md 2023-06-29 18:58:17 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`<a name="dataset-summary"></a>`

Update README.md 2023-06-29 18:58:17 +00:00			`Dataset Summary`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`The Open Orca dataset is a collection of unaugmented and augmented FLAN data.`
			`Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.`
			`It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.`
			`The data is primarily used for training and evaluation in the field of natural language processing.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="supported-tasks-and-leaderboards"></a>`

			`Supported Tasks and Leaderboards`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`This dataset supports a range of tasks including language modeling, text generation, and text augmentation.`
			`It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.`
			`Further information on leaderboards will be updated as they become available.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="languages"></a>`

			`Languages`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`The language of the data primarily is English.`

			`<a name="dataset-structure"></a>`

			`Dataset Structure`
Update README.md 2023-06-29 19:28:57 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`<a name="data-instances"></a>`

			`Data Instances`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.`

			`<a name="data-fields"></a>`

			`Data Fields`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`The primary fields of interest are 'Original Text' and 'Augmented Text'.`
			`Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="data-splits"></a>`

			`Data Splits`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`Details regarding data splits (train/test/validate) will be updated as the data generation progresses.`

			`<a name="dataset-creation"></a>`

			`Dataset Creation`
Update README.md 2023-06-29 19:28:57 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`<a name="curation-rationale"></a>`

			`Curation Rationale`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`The dataset was created to provide a source of augmented text data for researchers and developers.`
			`It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="source-data"></a>`

			`Source Data`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.`
			`The original unaugmented data comes from the FLAN dataset.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="dataset-use"></a>`

			`Dataset Use`
Update README.md 2023-06-29 19:28:57 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`<a name="use-cases"></a>`

			`Use Cases`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 18:58:17 +00:00			`The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.`

			`<a name="usage-caveats"></a>`

			`Usage Caveats`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.`
			`Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.`
Update README.md 2023-06-29 18:58:17 +00:00
			`<a name="getting-started"></a>`

			`Getting Started`
Update README.md 2023-06-29 20:02:22 +00:00
Update README.md 2023-06-29 19:28:57 +00:00			`For information on getting started, please refer to the Hugging Face dataset loading utilities.`
			`Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.`