Update README.md

This commit is contained in:
Andreas Köpf 2023-04-15 13:26:01 +00:00 committed by huggingface-web
parent a2c461b87b
commit 12c7304a36

@ -132,29 +132,35 @@ of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
### Dataset Structure ### Dataset Structure
This dataset contains demonstrations of human-assistant conversations which were collected This dataset contains demonstrations of human-assistant conversations which were collected
on the open-assistant.io website until April, 12 2023. on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
Conversations are exported as message trees which contain conversation messages as nodes. Conversations are exported as conversation trees which contain conversation messages as nodes.
The root node of a message tree is called the initial prompt. Each message node can have The root node of a conversation tree is called the initial prompt. Each message can have
multiple replies. Nodes with more than one reply can have a `rank` field indicating the multiple replies. Nodes with more than one reply can have a `rank` field indicating the
order among the siblings sorted by user preference (the most preferred message has rank 0). user preference (the most preferred message has rank 0).
All messages have a role which can either be "assistant" or "prompter". The roles in All messages have a role which can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node in a message tree are stricly alternating conversation threads from prompt to leaf node in a conversation tree are stricly alternating
between "assistant" and "prompter". between "assistant" and "prompter".
## Main Dataset Files ## Main Dataset Files
Data is provided either as nested as a message tree or as flat list (table) of messages. Data is provided either as nested messages in conversation trees or as flat list of messages.
Names of files containing message trees end in `.trees.jsonl.gz` while files containing
a list of messages with a file name ending in `.messages.jsonl.gz`.
Mesages The type of file can be inferred from the file name extension:
- `.trees.jsonl.gz`: Conversation trees with nested messages
- `.messages.jsonl.gz`: Flat list of messages
### Ready for export trees
``` ```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages 2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages 2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
``` ```
### All trees
``` ```
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages 2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
@ -162,7 +168,6 @@ Mesages
``` ```
### Languages ### Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows: OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows: