Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl), for lending us their expertise on the platform that developed and trained manticore, minotaur, and many others!
We have been made aware that Eric Hartford, a team member who chose to depart our team the day prior to the official release of this repo after some internal discussion of our grievances, has made claims to be the sole originator of the Open Orca project and to claim the work as his own.
We wish to clarify that this was a team effort from the outset, and he was one of over a dozen data scientists, machine learning engineers, and other specialists who have been involved in this project from the outset.
Eric joined the team with the mutual understanding that we were all to be treated as equals and get our due credit for involvement, as well as say in group decisions.
He made snap decisions on behalf of the team contrary to long term plans, including announcing the project publicly on his blog, and implying that he was the sole originator and project lead.
We attempted to reconcile this internally, but he chose to depart from the team.
As such, we elected to release the data publicly in advance of original plans.
We have appropriately attributed he and all other contributors, as was originally planned.
We thank Eric for his contributions to the project and wish him well on his individual endeavors.
This repo is the original repo from which the entire team had agreed to work out of and publish out of from the outset.
Eric's repo represents his duplication and augmentation of the team's collective effort, initiated after he had chosen to depart the team.
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
Further information on leaderboards will be updated as they become available.
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.