Update README.md

2023-04-15 15:47:38 +00:00 · 2023-04-15 15:47:38 +00:00 · c5ca5f1f7c
commit c5ca5f1f7c
parent 06c4be1951
1 changed files with 23 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -110,7 +110,7 @@ language:
 tags:
 - human-feedback
 size_categories:
- 10K<n<100K
+- 100K<n<1M
 pretty_name: OpenAssistant Conversations
 ---
@ -249,8 +249,8 @@ all messages of a message tree or to select trees by their state.
 ### Ready For Export Trees
 ```
-2023-04-12_oasst_ready.trees.jsonl.gz       10364 trees with 88838 total messages
+2023-04-12_oasst_ready.trees.jsonl.gz       10,364 trees with 88,838 total messages
-2023-04-12_oasst_ready.messages.jsonl.gz    88838 messages
+2023-04-12_oasst_ready.messages.jsonl.gz    88,838 messages
 ```
 Trees in `ready_for_export` state without spam and deleted messages including message labels.
 The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
@ -259,8 +259,8 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
 ### All Trees
 ```
-2023-04-12_oasst_all.trees.jsonl.gz         66497 trees with 161443 total messages
+2023-04-12_oasst_all.trees.jsonl.gz         66,497 trees with 161,443 total messages
-2023-04-12_oasst_all.messages.jsonl.gz     161443 messages
+2023-04-12_oasst_all.messages.jsonl.gz     161,443 messages
 ```
 All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
 `aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
@ -274,12 +274,28 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
 Messages which were deleted or have a negative review result (`"review_result": false`).
 Beside low quality a frequent reason for message deletion is a wrong language tag.
 ```
 2023-04-12_oasst_prompts.messages.jsonl.gz
 ```
 All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
 ### Using the Huggingface Datasets
 While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
 Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation 
 split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
 To load the oasst1 train & validation splits use:
 ```python
 from datasets import load_dataset
 ds = load_dataset("OpenAssistant/oasst1")
 train = ds['train']      # len(train)=84437 (95%)
 val = ds['validation']   # len(val)=4401 (5%)
 ```
 The messages appear in depth-first order of the message trees.
 ### Languages
@ -332,4 +348,4 @@ OpenAssistant Conversations incorporates 35 different languages with a distribut
 - Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
 - GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
+- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)