diff --git a/README.md b/README.md index 47c6af4..9a5e487 100644 --- a/README.md +++ b/README.md @@ -18,8 +18,8 @@ size_categories: - 10M

🐋 The Open Orca Dataset! 🐋

- + We are thrilled to announce the release of the Open Orca dataset! This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707). It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers! + + +Dataset Summary + +The Open Orca dataset is a collection of unaugmented and augmented FLAN data. +Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions. +It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. +The data is primarily used for training and evaluation in the field of natural language processing. + + + +Dataset Attribution + We would like to give special recognition to the following contributors for their significant efforts and dedication: @@ -70,15 +83,6 @@ Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAc We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials: http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx - - -Dataset Summary - -The Open Orca dataset is a collection of unaugmented and augmented FLAN data. -Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions. -It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. -The data is primarily used for training and evaluation in the field of natural language processing. - Supported Tasks and Leaderboards