Update README.md

2023-07-01 07:47:21 +00:00 · 2023-07-01 07:47:21 +00:00 · 87e5279ad7
commit 87e5279ad7
parent b47acbaf27
1 changed files with 2 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -47,7 +47,7 @@ It has been instrumental in generating high-performing model checkpoints and ser

 Dataset Summary

-The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
+The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
 Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
 It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
 The data is primarily used for training and evaluation in the field of natural language processing.
@ -146,7 +146,7 @@ The data is generated using techniques in alignment with the distributions outli
 We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
 2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
 These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
- However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
+ However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.

 Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.