diff --git a/README.md b/README.md index 27426d8..cf48cc0 100644 --- a/README.md +++ b/README.md @@ -134,7 +134,7 @@ of a worldwide crowd-sourcing effort involving over 13,500 volunteers. This dataset contains demonstrations of human-assistant conversations which were collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023. -Conversations are exported as conversation trees which contain conversation messages as nodes. +Conversations are exported as conversation trees with messages as nodes. The root node of a conversation tree is called the initial prompt. Each message can have multiple replies. Nodes with more than one reply can have a `rank` field indicating the user preference (the most preferred message has rank 0). @@ -143,6 +143,11 @@ All messages have a role which can either be "assistant" or "prompter". The role conversation threads from prompt to leaf node in a conversation tree are stricly alternating between "assistant" and "prompter". +Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for +details about the data structure and python code to load and write conversation tree and message +jsonl files. + + ## Main Dataset Files Data is provided either as nested messages in conversation trees or as flat list of messages. @@ -151,15 +156,22 @@ The type of file can be inferred from the file name extension: - `.trees.jsonl.gz`: Conversation trees with nested messages - `.messages.jsonl.gz`: Flat list of messages +Full conversation trees can be reconstructed from a flat messages using the `parent_id` +and `message_id` properties to identify the parent-child relationship of messages. To +select all messages of a single conversation tree `message_tree_id` can be used (only present +in flat messages files). -### Ready for export trees +### Ready For Export Trees + +This set of exported trees contains all trees in `ready_for_export` state without spam +and deleted messages. For supervised fine-tuning & reward model training this file is ideal. ``` 2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages 2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages ``` -### All trees +### All Trees ``` @@ -167,6 +179,12 @@ The type of file can be inferred from the file name extension: 2023-04-12_oasst_all.messages.jsonl.gz 161443 messages ``` +### Supplemental Exports: Spam & Prompts + +``` + +``` + ### Languages