Update README.md

This commit is contained in:
Bleys 2023-07-01 07:47:21 +00:00 committed by huggingface-web
parent b47acbaf27
commit 87e5279ad7

@ -47,7 +47,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
Dataset Summary Dataset Summary
The Open Orca dataset is a collection of unaugmented and augmented FLAN data. The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
The data is primarily used for training and evaluation in the field of natural language processing. The data is primarily used for training and evaluation in the field of natural language processing.
@ -146,7 +146,7 @@ The data is generated using techniques in alignment with the distributions outli
We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available. We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original). 2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source. These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively. However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work. Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.