merge changes

2023-04-15 20:31:33 +02:00 · 2023-04-15 20:31:33 +02:00 · bcd04ea561
commit bcd04ea561
parent ce7ebe148f 5498d94a6e
1 changed files with 18 additions and 19 deletions
--- a/README.md
+++ b/README.md
@ -134,18 +134,18 @@ Please refer to our [paper](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf) fo

 ### Dataset Structure

-This dataset contains message trees which each have an inital prompt message as root which can have
-multiple child messages as replies which itself again can have multiple replies. 
+This dataset contains message trees. Each message tree has an initial prompt message as the root node, 
+which can have multiple child messages as replies, and these child messages can have multiple replies. 

-All messages have a role property which can either be "assistant" or "prompter". The roles in 
-conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant".
+All messages have a role property: this can either be "assistant" or "prompter". The roles in 
+conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".

-This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
+This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April 12 2023.

 ### JSON Example: Message

-For readability the following JSON examples are shown formatted with indentation on multiple lines.
-Objects are stored without indentation on a single lines in the actual jsonl files.
+For readability, the following JSON examples are shown formatted with indentation on multiple lines.
+Objects are stored without indentation (on single lines) in the actual jsonl files.

 ```json
 {
@ -179,7 +179,7 @@ Objects are stored without indentation on a single lines in the actual jsonl fil

 ### JSON Example: Conversation Tree

-For readability only a subset of the message properties is shown here.
+For readability, only a subset of the message properties is shown here.

 ```json
 {
@ -236,7 +236,7 @@ details about the data structure and Python code to read and write jsonl files c
 ## Main Dataset Files

 Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`) 
-or as flat list (table) of messages (extension `.messages.jsonl.gz`).
+or as a flat list (table) of messages (extension `.messages.jsonl.gz`).

 ### Ready For Export Trees

@ -245,7 +245,7 @@ or as flat list (table) of messages (extension `.messages.jsonl.gz`).
 2023-04-12_oasst_ready.messages.jsonl.gz    88,838 messages
 ```
 Trees in `ready_for_export` state without spam and deleted messages including message labels.
-The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
+The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.


 ### All Trees
@ -254,7 +254,7 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
 2023-04-12_oasst_all.trees.jsonl.gz         66,497 trees with 161,443 total messages
 2023-04-12_oasst_all.messages.jsonl.gz     161,443 messages
 ```
-All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
+All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt),
 `aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.


@ -263,19 +263,19 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
 ```
 2023-04-12_oasst_spam.messages.jsonl.gz
 ```
-Messages which were deleted or have a negative review result (`"review_result": false`).
-Beside low quality a frequent reason for message deletion is a wrong language tag.
+These are messages which were deleted or have a negative review result (`"review_result": false`).
+Besides low quality, a frequent reason for message deletion is a wrong language tag.

 ```
 2023-04-12_oasst_prompts.messages.jsonl.gz
 ```
-All non-deleted initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
+These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.

 ### Using the Huggingface Datasets

-While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
-Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation 
-split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
+While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
+Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits. 
+These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).

 To load the oasst1 train & validation splits use:

@ -290,8 +290,7 @@ The messages appear in depth-first order of the message trees.

 Full conversation trees can be reconstructed from the flat messages table by using the `parent_id` 
 and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id` 
-and `tree_state` properties (only present in flat messages files) can be used to find all
-all messages of a message tree or to select trees by their state.
+and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.

 ### Languages