diff --git a/README.md b/README.md
index 48f154f..8e9f69e 100644
--- a/README.md
+++ b/README.md
@@ -47,7 +47,7 @@ It has been instrumental in generating high-performing model checkpoints and ser
 
 Dataset Summary
 
-The Open Orca dataset is a collection of unaugmented and augmented FLAN data.
+The Open Orca dataset is a collection of augmented [FLAN Collection data](https://arxiv.org/abs/2301.13688).
 Currently ~1M GPT-4 completions, and ~3.2M GPT-3.5 completions.
 It is tabularized in alignment with the distributions presented in the ORCA paper and currently represents a partial completion of the full intended dataset, with ongoing generation to expand its scope.
 The data is primarily used for training and evaluation in the field of natural language processing.
@@ -146,7 +146,7 @@ The data is generated using techniques in alignment with the distributions outli
  We suspect this portion was either undocumented or misrepresented. We have used the ~75K points available.
 2) We used the pre-generated FLAN Collection datasets hosted on HuggingFace under conceptofmind, e.g. [conceptofmind/flan2021](https://huggingface.co/datasets/conceptofmind/flan2021_submix_original).
  These are referenced by the [official FLAN Collection repo](https://github.com/google-research/FLAN/tree/main/flan/v2) as the preferred data source.
- However, these are a subset of the full [FLAN Collection data](https://arxiv.org/abs/2301.13688), and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
+ However, these are a subset of the full FLAN Collection data, and have less than the required entries for the flan2021 and t0 submixes, by ~1.25M and 200k respectively.
 
 Combined, this gave us ~1.5M fewer datapoints than in the original Orca paper. Completing the set is an ongoing work.