Update README.md

This commit is contained in:
Bleys 2023-06-29 20:02:22 +00:00 committed by huggingface-web
parent 094c8b4f29
commit 19c9ee4153

@ -74,6 +74,7 @@ Not for Falcon 40b, it won't!
<a name="dataset-summary"></a> <a name="dataset-summary"></a>
Dataset Summary Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data. The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions. Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
@ -82,6 +83,7 @@ The data is primarily used for training and evaluation in the field of natural l
<a name="supported-tasks-and-leaderboards"></a> <a name="supported-tasks-and-leaderboards"></a>
Supported Tasks and Leaderboards Supported Tasks and Leaderboards
This dataset supports a range of tasks including language modeling, text generation, and text augmentation. This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing. It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
Further information on leaderboards will be updated as they become available. Further information on leaderboards will be updated as they become available.
@ -89,6 +91,7 @@ Further information on leaderboards will be updated as they become available.
<a name="languages"></a> <a name="languages"></a>
Languages Languages
The language of the data primarily is English. The language of the data primarily is English.
<a name="dataset-structure"></a> <a name="dataset-structure"></a>
@ -98,17 +101,20 @@ Dataset Structure
<a name="data-instances"></a> <a name="data-instances"></a>
Data Instances Data Instances
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content. A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
<a name="data-fields"></a> <a name="data-fields"></a>
Data Fields Data Fields
The primary fields of interest are 'Original Text' and 'Augmented Text'. The primary fields of interest are 'Original Text' and 'Augmented Text'.
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included. Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
<a name="data-splits"></a> <a name="data-splits"></a>
Data Splits Data Splits
Details regarding data splits (train/test/validate) will be updated as the data generation progresses. Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
<a name="dataset-creation"></a> <a name="dataset-creation"></a>
@ -118,12 +124,14 @@ Dataset Creation
<a name="curation-rationale"></a> <a name="curation-rationale"></a>
Curation Rationale Curation Rationale
The dataset was created to provide a source of augmented text data for researchers and developers. The dataset was created to provide a source of augmented text data for researchers and developers.
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints. It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
<a name="source-data"></a> <a name="source-data"></a>
Source Data Source Data
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper. The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
The original unaugmented data comes from the FLAN dataset. The original unaugmented data comes from the FLAN dataset.
@ -134,16 +142,19 @@ Dataset Use
<a name="use-cases"></a> <a name="use-cases"></a>
Use Cases Use Cases
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation. The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
<a name="usage-caveats"></a> <a name="usage-caveats"></a>
Usage Caveats Usage Caveats
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements. Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper. Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
<a name="getting-started"></a> <a name="getting-started"></a>
Getting Started Getting Started
For information on getting started, please refer to the Hugging Face dataset loading utilities. For information on getting started, please refer to the Hugging Face dataset loading utilities.
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face. Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.