diff --git a/README.md b/README.md index 48f154f..8e9f69e 100644 --- a/README.md +++ b/README.md @@ -47,7 +47,7 @@ It has been instrumental in generating high-performing model checkpoints and ser Dataset Summary -The Open Orca dataset is a collection of unaugmented and augmented FLAN data. +The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688). Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions. It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope. The data is primarily used for training and evaluation in the field of natural language processing. @@ -146,7 +146,7 @@ The data is generated using techniques in alignment with the distributions outli We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available. 2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original). These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source. - However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively. + However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively. Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.