merge changes
This commit is contained in:
commit
bcd04ea561
37
README.md
37
README.md
@ -134,18 +134,18 @@ Please refer to our [paper](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf) fo
|
||||
|
||||
### Dataset Structure
|
||||
|
||||
This dataset contains message trees which each have an inital prompt message as root which can have
|
||||
multiple child messages as replies which itself again can have multiple replies.
|
||||
This dataset contains message trees. Each message tree has an initial prompt message as the root node,
|
||||
which can have multiple child messages as replies, and these child messages can have multiple replies.
|
||||
|
||||
All messages have a role property which can either be "assistant" or "prompter". The roles in
|
||||
conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant".
|
||||
All messages have a role property: this can either be "assistant" or "prompter". The roles in
|
||||
conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".
|
||||
|
||||
This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
|
||||
This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April 12 2023.
|
||||
|
||||
### JSON Example: Message
|
||||
|
||||
For readability the following JSON examples are shown formatted with indentation on multiple lines.
|
||||
Objects are stored without indentation on a single lines in the actual jsonl files.
|
||||
For readability, the following JSON examples are shown formatted with indentation on multiple lines.
|
||||
Objects are stored without indentation (on single lines) in the actual jsonl files.
|
||||
|
||||
```json
|
||||
{
|
||||
@ -179,7 +179,7 @@ Objects are stored without indentation on a single lines in the actual jsonl fil
|
||||
|
||||
### JSON Example: Conversation Tree
|
||||
|
||||
For readability only a subset of the message properties is shown here.
|
||||
For readability, only a subset of the message properties is shown here.
|
||||
|
||||
```json
|
||||
{
|
||||
@ -236,7 +236,7 @@ details about the data structure and Python code to read and write jsonl files c
|
||||
## Main Dataset Files
|
||||
|
||||
Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
|
||||
or as flat list (table) of messages (extension `.messages.jsonl.gz`).
|
||||
or as a flat list (table) of messages (extension `.messages.jsonl.gz`).
|
||||
|
||||
### Ready For Export Trees
|
||||
|
||||
@ -245,7 +245,7 @@ or as flat list (table) of messages (extension `.messages.jsonl.gz`).
|
||||
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
|
||||
```
|
||||
Trees in `ready_for_export` state without spam and deleted messages including message labels.
|
||||
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
||||
The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
||||
|
||||
|
||||
### All Trees
|
||||
@ -254,7 +254,7 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
|
||||
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
|
||||
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
|
||||
```
|
||||
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
|
||||
All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt),
|
||||
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
|
||||
|
||||
|
||||
@ -263,19 +263,19 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
|
||||
```
|
||||
2023-04-12_oasst_spam.messages.jsonl.gz
|
||||
```
|
||||
Messages which were deleted or have a negative review result (`"review_result": false`).
|
||||
Beside low quality a frequent reason for message deletion is a wrong language tag.
|
||||
These are messages which were deleted or have a negative review result (`"review_result": false`).
|
||||
Besides low quality, a frequent reason for message deletion is a wrong language tag.
|
||||
|
||||
```
|
||||
2023-04-12_oasst_prompts.messages.jsonl.gz
|
||||
```
|
||||
All non-deleted initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
||||
These are all the kept initial prompt messages with positive review result (no spam) of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
||||
|
||||
### Using the Huggingface Datasets
|
||||
|
||||
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
|
||||
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
|
||||
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
|
||||
While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees.
|
||||
Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits.
|
||||
These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/).
|
||||
|
||||
To load the oasst1 train & validation splits use:
|
||||
|
||||
@ -290,8 +290,7 @@ The messages appear in depth-first order of the message trees.
|
||||
|
||||
Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
|
||||
and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
|
||||
and `tree_state` properties (only present in flat messages files) can be used to find all
|
||||
all messages of a message tree or to select trees by their state.
|
||||
and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.
|
||||
|
||||
### Languages
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user