Update README.md

This commit is contained in:
Andreas Köpf 2023-04-15 13:15:54 +00:00 committed by huggingface-web
parent dee0b5f871
commit a2c461b87b

@ -129,19 +129,37 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
The dataset was exported from the open-assistant.io production database on April, 12 2023.
### Dataset Structure
Thes dataset contains demonstrations of of human-assistant conversations that were collected
on the open-assistant.io website.
This dataset contains demonstrations of human-assistant conversations which were collected
on the open-assistant.io website until April, 12 2023.
All conversations are exported as message trees which contain conversation messages nodes. Each message has a
role which can either be "assistant" or "prompter". The root node of a message tree is called the initial prompt.
Nodes with at least two replies of completed trees have a `rank` field which indicates the users' preference consensus.
The lower the rank the better the message.
Conversations are exported as message trees which contain conversation messages as nodes.
The root node of a message tree is called the initial prompt. Each message node can have
multiple replies. Nodes with more than one reply can have a `rank` field indicating the
order among the siblings sorted by user preference (the most preferred message has rank 0).
All messages have a role which can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node in a message tree are stricly alternating
between "assistant" and "prompter".
## Main Dataset Files
Data is provided either as nested as a message tree or as flat list (table) of messages.
Names of files containing message trees end in `.trees.jsonl.gz` while files containing
a list of messages with a file name ending in `.messages.jsonl.gz`.
Mesages
```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
```
```
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
```