Update README.md

This commit is contained in:
Andreas Köpf 2023-04-15 14:03:23 +00:00 committed by huggingface-web
parent 12c7304a36
commit 670f5bce5d

@ -134,7 +134,7 @@ of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
This dataset contains demonstrations of human-assistant conversations which were collected
on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
Conversations are exported as conversation trees which contain conversation messages as nodes.
Conversations are exported as conversation trees with messages as nodes.
The root node of a conversation tree is called the initial prompt. Each message can have
multiple replies. Nodes with more than one reply can have a `rank` field indicating the
user preference (the most preferred message has rank 0).
@ -143,6 +143,11 @@ All messages have a role which can either be "assistant" or "prompter". The role
conversation threads from prompt to leaf node in a conversation tree are stricly alternating
between "assistant" and "prompter".
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
details about the data structure and python code to load and write conversation tree and message
jsonl files.
## Main Dataset Files
Data is provided either as nested messages in conversation trees or as flat list of messages.
@ -151,15 +156,22 @@ The type of file can be inferred from the file name extension:
- `.trees.jsonl.gz`: Conversation trees with nested messages
- `.messages.jsonl.gz`: Flat list of messages
Full conversation trees can be reconstructed from a flat messages using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. To
select all messages of a single conversation tree `message_tree_id` can be used (only present
in flat messages files).
### Ready for export trees
### Ready For Export Trees
This set of exported trees contains all trees in `ready_for_export` state without spam
and deleted messages. For supervised fine-tuning & reward model training this file is ideal.
```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
```
### All trees
### All Trees
```
@ -167,6 +179,12 @@ The type of file can be inferred from the file name extension:
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
```
### Supplemental Exports: Spam & Prompts
```
```
### Languages