diff --git a/README.md b/README.md index 0d905a8..bd36fc5 100644 --- a/README.md +++ b/README.md @@ -129,19 +129,37 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. -The dataset was exported from the open-assistant.io production database on April, 12 2023. - ### Dataset Structure -Thes dataset contains demonstrations of of human-assistant conversations that were collected -on the open-assistant.io website. +This dataset contains demonstrations of human-assistant conversations which were collected +on the open-assistant.io website until April, 12 2023. -All conversations are exported as message trees which contain conversation messages nodes. Each message has a -role which can either be "assistant" or "prompter". The root node of a message tree is called the initial prompt. -Nodes with at least two replies of completed trees have a `rank` field which indicates the users' preference consensus. -The lower the rank the better the message. +Conversations are exported as message trees which contain conversation messages as nodes. +The root node of a message tree is called the initial prompt. Each message node can have +multiple replies. Nodes with more than one reply can have a `rank` field indicating the +order among the siblings sorted by user preference (the most preferred message has rank 0). +All messages have a role which can either be "assistant" or "prompter". The roles in +conversation threads from prompt to leaf node in a message tree are stricly alternating +between "assistant" and "prompter". + +## Main Dataset Files + +Data is provided either as nested as a message tree or as flat list (table) of messages. +Names of files containing message trees end in `.trees.jsonl.gz` while files containing +a list of messages with a file name ending in `.messages.jsonl.gz`. + +Mesages + +``` +2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages +2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages +``` +``` +2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages +2023-04-12_oasst_all.messages.jsonl.gz 161443 messages +```