Update README.md

This commit is contained in:
Andreas Köpf 2023-04-15 14:27:59 +00:00 committed by huggingface-web
parent 670f5bce5d
commit 56e16363c1

@ -111,6 +111,7 @@ tags:
- human-feedback
size_categories:
- 10K<n<100K
pretty_name: OpenAssistant Conversations
---
# OpenAssistant Conversations Dataset (OASST1)
@ -129,6 +130,8 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Please our paper for further details.
### Dataset Structure
This dataset contains demonstrations of human-assistant conversations which were collected
@ -144,46 +147,51 @@ conversation threads from prompt to leaf node in a conversation tree are stricly
between "assistant" and "prompter".
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
details about the data structure and python code to load and write conversation tree and message
jsonl files.
details about the data structure and python code to read and write jsonl files containing oasst objects.
## Main Dataset Files
Data is provided either as nested messages in conversation trees or as flat list of messages.
Data is provided either as nested messages in conversation trees (extension `.trees.jsonl.gz`)
or as flat list of messages (extension `.messages.jsonl.gz`).
The type of file can be inferred from the file name extension:
- `.trees.jsonl.gz`: Conversation trees with nested messages
- `.messages.jsonl.gz`: Flat list of messages
Full conversation trees can be reconstructed from a flat messages using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. To
select all messages of a single conversation tree `message_tree_id` can be used (only present
in flat messages files).
Full conversation trees can be reconstructed from flat messages using the `parent_id`
and `message_id` properties to identify their parent-child relationship. The `message_tree_id`
and `tree_state` properties (only present in flat messages files) can be used to find all
all messages of a message tree or to select trees by their state.
### Ready For Export Trees
This set of exported trees contains all trees in `ready_for_export` state without spam
and deleted messages. For supervised fine-tuning & reward model training this file is ideal.
```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
```
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
### All Trees
```
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
```
All trees including those in states `prompt_lottery_waiting`, `aborted_low_grade`, `halted_by_moderator`.
### Supplemental Exports: Spam & Prompts
```
2023-04-12_oasst_spam.messages.jsonl.gz
```
Messages which were deleted or have a negative review result (`"review_result": false`).
Beside low quality a frequent reason for message deletion is a wrong language tag.
```
2023-04-12_oasst_prompts.messages.jsonl.gz
```
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
### Languages
@ -231,4 +239,4 @@ OpenAssistant Conversations incorporates 35 different languages with a distribut
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
</details>
</details>