From 12c7304a3644c0420f4ec06bc0aa955196527da6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20K=C3=B6pf?= Date: Sat, 15 Apr 2023 13:26:01 +0000 Subject: [PATCH] Update README.md --- README.md | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index bd36fc5..27426d8 100644 --- a/README.md +++ b/README.md @@ -132,29 +132,35 @@ of a worldwide crowd-sourcing effort involving over 13,500 volunteers. ### Dataset Structure This dataset contains demonstrations of human-assistant conversations which were collected -on the open-assistant.io website until April, 12 2023. +on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023. -Conversations are exported as message trees which contain conversation messages as nodes. -The root node of a message tree is called the initial prompt. Each message node can have +Conversations are exported as conversation trees which contain conversation messages as nodes. +The root node of a conversation tree is called the initial prompt. Each message can have multiple replies. Nodes with more than one reply can have a `rank` field indicating the -order among the siblings sorted by user preference (the most preferred message has rank 0). +user preference (the most preferred message has rank 0). + All messages have a role which can either be "assistant" or "prompter". The roles in -conversation threads from prompt to leaf node in a message tree are stricly alternating +conversation threads from prompt to leaf node in a conversation tree are stricly alternating between "assistant" and "prompter". ## Main Dataset Files -Data is provided either as nested as a message tree or as flat list (table) of messages. -Names of files containing message trees end in `.trees.jsonl.gz` while files containing -a list of messages with a file name ending in `.messages.jsonl.gz`. +Data is provided either as nested messages in conversation trees or as flat list of messages. -Mesages +The type of file can be inferred from the file name extension: +- `.trees.jsonl.gz`: Conversation trees with nested messages +- `.messages.jsonl.gz`: Flat list of messages + + +### Ready for export trees ``` 2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages 2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages ``` +### All trees + ``` 2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages @@ -162,7 +168,6 @@ Mesages ``` - ### Languages OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows: