Update README.md
This commit is contained in:
parent
b47acbaf27
commit
87e5279ad7
@ -47,7 +47,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
|
||||
|
||||
Dataset Summary
|
||||
|
||||
The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
|
||||
The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
|
||||
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
|
||||
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
|
||||
The data is primarily used for training and evaluation in the field of natural language processing.
|
||||
@ -146,7 +146,7 @@ The data is generated using techniques in alignment with the distributions outli
|
||||
We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
|
||||
2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
|
||||
These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
|
||||
However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
|
||||
However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
|
||||
|
||||
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user