merge changes

This commit is contained in:
Andreas Koepf 2023-04-15 20:31:33 +02:00
commit bcd04ea561

@ -134,18 +134,18 @@ Please refer to our [paper](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf) fo
### Dataset Structure
This dataset contains message trees which each have an inital prompt message as root which can have
multiple child messages as replies which itself again can have multiple replies.
This dataset contains message trees. Each message tree has an initial prompt message as the root node,
which can have multiple child messages as replies, and these child messages can have multiple replies.
All messages have a role property which can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant".
All messages have a role property: this can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".
This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April 12 2023.
### JSON Example: Message
For readability the following JSON examples are shown formatted with indentation on multiple lines.
Objects are stored without indentation on a single lines in the actual jsonl files.
For readability, the following JSON examples are shown formatted with indentation on multiple lines.
Objects are stored without indentation (on single lines) in the actual jsonl files.
```json
{
@ -179,7 +179,7 @@ Objects are stored without indentation on a single lines in the actual jsonl fil
### JSON Example: Conversation Tree
For readability only a subset of the message properties is shown here.
For readability, only a subset of the message properties is shown here.
```json
{
@ -236,7 +236,7 @@ details about the data structure and Python code to read and write jsonl files c
## Main Dataset Files
Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
or as flat list (table) of messages (extension `.messages.jsonl.gz`).
or as a flat list (table) of messages (extension `.messages.jsonl.gz`).
### Ready For Export Trees
@ -245,7 +245,7 @@ or as flat list (table) of messages (extension `.messages.jsonl.gz`).
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
```
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
### All Trees
@ -254,7 +254,7 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
```
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt),
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
@ -263,19 +263,19 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
```
2023-04-12_oasst_spam.messages.jsonl.gz
```
Messages which were deleted or have a negative review result (`"review_result": false`).
Beside low quality a frequent reason for message deletion is a wrong language tag.
These are messages which were deleted or have a negative review result (`"review_result": false`).
Besides low quality, a frequent reason for message deletion is a wrong language tag.
```
2023-04-12_oasst_prompts.messages.jsonl.gz
```
All non-deleted initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
### Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits.
These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).
To load the oasst1 train & validation splits use:
@ -290,8 +290,7 @@ The messages appear in depth-first order of the message trees.
Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
and `tree_state` properties (only present in flat messages files) can be used to find all
all messages of a message tree or to select trees by their state.
and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.
### Languages