Update README.md

This commit is contained in:
Bleys 2023-07-01 02:43:16 +00:00 committed by huggingface-web
parent 1217df5932
commit bf7d7cc428

@ -18,8 +18,8 @@ size_categories:
- 10M<n<100M - 10M<n<100M
--- ---
## Table of Contents ## Table of Contents
- [Dataset Attribution](#dataset-attribution)
- [Dataset Summary](#dataset-summary) - [Dataset Summary](#dataset-summary)
- [Dataset Attribution](#dataset-attribution)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages) - [Languages](#languages)
- [Dataset Structure](#dataset-structure) - [Dataset Structure](#dataset-structure)
@ -37,12 +37,25 @@ size_categories:
<p><h1>🐋 The Open Orca Dataset! 🐋</h1></p> <p><h1>🐋 The Open Orca Dataset! 🐋</h1></p>
<a name="dataset-attribution"></a> <a name="dataset-announcement"></a>
We are thrilled to announce the release of the Open Orca dataset! We are thrilled to announce the release of the Open Orca dataset!
This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707). This rich collection of augmented FLAN data aligns, as best as possible, with the distributions outlined in the [Orca paper](https://arxiv.org/abs/2306.02707).
It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers! It has been instrumental in generating high-performing model checkpoints and serves as a valuable resource for all NLP researchers and developers!
<a name="dataset-summary"></a>
Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
<a name="dataset-attribution"></a>
Dataset Attribution
We would like to give special recognition to the following contributors for their significant efforts and dedication: We would like to give special recognition to the following contributors for their significant efforts and dedication:
@ -70,15 +83,6 @@ Many thanks to NanoBit and Caseus, makers of [Axolotl](https://github.com/OpenAc
We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials: We are welcoming sponsors or collaborators to help us build these models to the scale they deserve. Please reach out via our socials:
http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx http://Alignmentlab.ai https://discord.gg/n9hXaBPWxx
<a name="dataset-summary"></a>
Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing.
<a name="supported-tasks-and-leaderboards"></a> <a name="supported-tasks-and-leaderboards"></a>
Supported Tasks and Leaderboards Supported Tasks and Leaderboards