2023-06-17 12:38:51 +00:00
---
language:
- en
2023-06-29 18:58:17 +00:00
license: mit
task_categories:
- conversational
- text-classification
- token-classification
- table-question-answering
- question-answering
- zero-shot-classification
- summarization
- feature-extraction
- text-generation
- text2text-generation
pretty_name: Open Orca
size_categories:
- 10M< n < 100M
---
2023-06-29 20:51:01 +00:00
## Table of Contents
- [Dataset Summary ](#dataset-summary )
2023-07-01 02:43:16 +00:00
- [Dataset Attribution ](#dataset-attribution )
2023-06-29 20:51:01 +00:00
- [Supported Tasks and Leaderboards ](#supported-tasks-and-leaderboards )
- [Languages ](#languages )
- [Dataset Structure ](#dataset-structure )
- [Data Instances ](#data-instances )
- [Data Fields ](#data-fields )
- [Data Splits ](#data-splits )
- [Dataset Creation ](#dataset-creation )
- [Curation Rationale ](#curation-rationale )
- [Source Data ](#source-data )
- [Dataset Use ](#dataset-use )
- [Use Cases ](#use-cases )
- [Usage Caveats ](#usage-caveats )
- [Getting Started ](#getting-started )
2023-06-29 19:28:57 +00:00
2023-06-29 20:46:54 +00:00
< p > < h1 > 🐋 The Open Orca Dataset! 🐋< / h1 > < / p >
2023-06-29 19:28:57 +00:00
2023-07-01 02:43:16 +00:00
< a name = "dataset-announcement" > < / a >
2023-06-29 19:28:57 +00:00
We are thrilled to announce the release of the Open Orca dataset!
2023-06-30 07:08:45 +00:00
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper ](https://arxiv.org/abs/2306.02707 ).
2023-06-29 19:28:57 +00:00
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
2023-07-01 02:43:16 +00:00
< a name = "dataset-summary" > < / a >
Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
< a name = "dataset-attribution" > < / a >
Dataset Attribution
2023-06-29 19:28:57 +00:00
We would like to give special recognition to the following contributors for their significant efforts and dedication:
2023-06-29 22:53:09 +00:00
Teknium
Caseus
2023-06-29 20:27:12 +00:00
Eric Hartford
NanoBit
Pankaj
2023-06-29 22:53:09 +00:00
Winddude
2023-06-29 20:27:12 +00:00
Rohan
2023-06-29 18:58:17 +00:00
2023-06-29 19:28:57 +00:00
http://AlignmentLab.ai:
Autometa
Entropi
AtlasUnified
NeverendingToast
2023-06-29 20:27:12 +00:00
lightningRalf
NanoBit
2023-06-29 22:53:09 +00:00
Caseus
2023-06-29 18:58:17 +00:00
2023-06-29 19:15:08 +00:00
Also of course, as always, TheBloke, for being the backbone of the whole community.
2023-06-29 18:58:17 +00:00
2023-06-30 06:55:12 +00:00
Many thanks to NanoBit and Caseus, makers of [Axolotl ](https://github.com/OpenAccess-AI-Collective/axolotl ), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others!
2023-06-29 18:58:17 +00:00
2023-06-30 03:07:04 +00:00
We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials:
http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
2023-06-29 18:58:17 +00:00
< a name = "supported-tasks-and-leaderboards" > < / a >
Supported Tasks and Leaderboards
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
Further information on leaderboards will be updated as they become available.
2023-06-29 18:58:17 +00:00
< a name = "languages" > < / a >
Languages
2023-06-29 20:02:22 +00:00
2023-06-29 18:58:17 +00:00
The language of the data primarily is English.
< a name = "dataset-structure" > < / a >
Dataset Structure
2023-06-29 19:28:57 +00:00
2023-06-29 18:58:17 +00:00
< a name = "data-instances" > < / a >
Data Instances
2023-06-29 20:02:22 +00:00
2023-06-29 18:58:17 +00:00
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
< a name = "data-fields" > < / a >
Data Fields
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
The primary fields of interest are 'Original Text' and 'Augmented Text'.
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
2023-06-29 18:58:17 +00:00
< a name = "data-splits" > < / a >
Data Splits
2023-06-29 20:02:22 +00:00
2023-06-29 18:58:17 +00:00
Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
< a name = "dataset-creation" > < / a >
Dataset Creation
2023-06-29 19:28:57 +00:00
2023-06-29 18:58:17 +00:00
< a name = "curation-rationale" > < / a >
Curation Rationale
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
The dataset was created to provide a source of augmented text data for researchers and developers.
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
2023-06-29 18:58:17 +00:00
< a name = "source-data" > < / a >
Source Data
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
The original unaugmented data comes from the FLAN dataset.
2023-06-29 18:58:17 +00:00
< a name = "dataset-use" > < / a >
Dataset Use
2023-06-29 19:28:57 +00:00
2023-06-29 18:58:17 +00:00
< a name = "use-cases" > < / a >
Use Cases
2023-06-29 20:02:22 +00:00
2023-06-29 18:58:17 +00:00
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
< a name = "usage-caveats" > < / a >
Usage Caveats
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
2023-06-29 18:58:17 +00:00
< a name = "getting-started" > < / a >
Getting Started
2023-06-29 20:02:22 +00:00
2023-06-29 19:28:57 +00:00
For information on getting started, please refer to the Hugging Face dataset loading utilities.
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.