Update README.md
This commit is contained in:
parent
670f5bce5d
commit
56e16363c1
38
README.md
38
README.md
@ -111,6 +111,7 @@ tags:
|
|||||||
- human-feedback
|
- human-feedback
|
||||||
size_categories:
|
size_categories:
|
||||||
- 10K<n<100K
|
- 10K<n<100K
|
||||||
|
pretty_name: OpenAssistant Conversations
|
||||||
---
|
---
|
||||||
|
|
||||||
# OpenAssistant Conversations Dataset (OASST1)
|
# OpenAssistant Conversations Dataset (OASST1)
|
||||||
@ -129,6 +130,8 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
|
|||||||
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
||||||
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
||||||
|
|
||||||
|
Please our paper for further details.
|
||||||
|
|
||||||
### Dataset Structure
|
### Dataset Structure
|
||||||
|
|
||||||
This dataset contains demonstrations of human-assistant conversations which were collected
|
This dataset contains demonstrations of human-assistant conversations which were collected
|
||||||
@ -144,46 +147,51 @@ conversation threads from prompt to leaf node in a conversation tree are stricly
|
|||||||
between "assistant" and "prompter".
|
between "assistant" and "prompter".
|
||||||
|
|
||||||
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
|
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
|
||||||
details about the data structure and python code to load and write conversation tree and message
|
details about the data structure and python code to read and write jsonl files containing oasst objects.
|
||||||
jsonl files.
|
|
||||||
|
|
||||||
|
|
||||||
## Main Dataset Files
|
## Main Dataset Files
|
||||||
|
|
||||||
Data is provided either as nested messages in conversation trees or as flat list of messages.
|
Data is provided either as nested messages in conversation trees (extension `.trees.jsonl.gz`)
|
||||||
|
or as flat list of messages (extension `.messages.jsonl.gz`).
|
||||||
|
|
||||||
The type of file can be inferred from the file name extension:
|
Full conversation trees can be reconstructed from flat messages using the `parent_id`
|
||||||
- `.trees.jsonl.gz`: Conversation trees with nested messages
|
and `message_id` properties to identify their parent-child relationship. The `message_tree_id`
|
||||||
- `.messages.jsonl.gz`: Flat list of messages
|
and `tree_state` properties (only present in flat messages files) can be used to find all
|
||||||
|
all messages of a message tree or to select trees by their state.
|
||||||
Full conversation trees can be reconstructed from a flat messages using the `parent_id`
|
|
||||||
and `message_id` properties to identify the parent-child relationship of messages. To
|
|
||||||
select all messages of a single conversation tree `message_tree_id` can be used (only present
|
|
||||||
in flat messages files).
|
|
||||||
|
|
||||||
### Ready For Export Trees
|
### Ready For Export Trees
|
||||||
|
|
||||||
This set of exported trees contains all trees in `ready_for_export` state without spam
|
|
||||||
and deleted messages. For supervised fine-tuning & reward model training this file is ideal.
|
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
||||||
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
||||||
```
|
```
|
||||||
|
Trees in `ready_for_export` state without spam and deleted messages including message labels.
|
||||||
|
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
||||||
|
|
||||||
|
|
||||||
### All Trees
|
### All Trees
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
|
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
|
||||||
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
||||||
```
|
```
|
||||||
|
All trees including those in states `prompt_lottery_waiting`, `aborted_low_grade`, `halted_by_moderator`.
|
||||||
|
|
||||||
|
|
||||||
### Supplemental Exports: Spam & Prompts
|
### Supplemental Exports: Spam & Prompts
|
||||||
|
|
||||||
```
|
```
|
||||||
|
2023-04-12_oasst_spam.messages.jsonl.gz
|
||||||
|
```
|
||||||
|
Messages which were deleted or have a negative review result (`"review_result": false`).
|
||||||
|
Beside low quality a frequent reason for message deletion is a wrong language tag.
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
2023-04-12_oasst_prompts.messages.jsonl.gz
|
||||||
|
```
|
||||||
|
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
||||||
|
|
||||||
|
|
||||||
### Languages
|
### Languages
|
||||||
|
Loading…
x
Reference in New Issue
Block a user