Update README.md
This commit is contained in:
parent
094c8b4f29
commit
19c9ee4153
11
README.md
11
README.md
@ -74,6 +74,7 @@ Not for Falcon 40b, it won't!
|
||||
<a name="dataset-summary"></a>
|
||||
|
||||
Dataset Summary
|
||||
|
||||
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
|
||||
Currently ~1M GPT-4 completions, and ~3.5M GPT-3.5 completions.
|
||||
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
|
||||
@ -82,6 +83,7 @@ The data is primarily used for training and evaluation in the field of natural l
|
||||
<a name="supported-tasks-and-leaderboards"></a>
|
||||
|
||||
Supported Tasks and Leaderboards
|
||||
|
||||
This dataset supports a range of tasks including language modeling, text generation, and text augmentation.
|
||||
It has been instrumental in the generation of multiple high-performing model checkpoints which have exhibited exceptional performance in our unit testing.
|
||||
Further information on leaderboards will be updated as they become available.
|
||||
@ -89,6 +91,7 @@ Further information on leaderboards will be updated as they become available.
|
||||
<a name="languages"></a>
|
||||
|
||||
Languages
|
||||
|
||||
The language of the data primarily is English.
|
||||
|
||||
<a name="dataset-structure"></a>
|
||||
@ -98,17 +101,20 @@ Dataset Structure
|
||||
<a name="data-instances"></a>
|
||||
|
||||
Data Instances
|
||||
|
||||
A data instance in this dataset represents an augmented and unaugmented set of text data, containing fields for the original and modified text content.
|
||||
|
||||
<a name="data-fields"></a>
|
||||
|
||||
Data Fields
|
||||
|
||||
The primary fields of interest are 'Original Text' and 'Augmented Text'.
|
||||
Other metadata fields, as well as specifics of the augmentation process used for each instance, are also included.
|
||||
|
||||
<a name="data-splits"></a>
|
||||
|
||||
Data Splits
|
||||
|
||||
Details regarding data splits (train/test/validate) will be updated as the data generation progresses.
|
||||
|
||||
<a name="dataset-creation"></a>
|
||||
@ -118,12 +124,14 @@ Dataset Creation
|
||||
<a name="curation-rationale"></a>
|
||||
|
||||
Curation Rationale
|
||||
|
||||
The dataset was created to provide a source of augmented text data for researchers and developers.
|
||||
It is particularly valuable in advancing the capabilities of language models, and fostering the generation of high-performing model checkpoints.
|
||||
|
||||
<a name="source-data"></a>
|
||||
|
||||
Source Data
|
||||
|
||||
The data is generated using techniques in alignment with the distributions outlined in the ORCA paper.
|
||||
The original unaugmented data comes from the FLAN dataset.
|
||||
|
||||
@ -134,16 +142,19 @@ Dataset Use
|
||||
<a name="use-cases"></a>
|
||||
|
||||
Use Cases
|
||||
|
||||
The dataset can be used for tasks related to language understanding, natural language processing, machine learning model training, and model performance evaluation.
|
||||
|
||||
<a name="usage-caveats"></a>
|
||||
|
||||
Usage Caveats
|
||||
|
||||
Given that this is a work-in-progress dataset, it's recommended to regularly check for updates and improvements.
|
||||
Further, the data should be used in accordance with the guidelines and recommendations outlined in the ORCA paper.
|
||||
|
||||
<a name="getting-started"></a>
|
||||
|
||||
Getting Started
|
||||
|
||||
For information on getting started, please refer to the Hugging Face dataset loading utilities.
|
||||
Regular updates and data generation progress can be monitored through the Open Orca repository on Hugging Face.
|
Loading…
x
Reference in New Issue
Block a user