Update README.md
This commit is contained in:
parent
12c7304a36
commit
670f5bce5d
24
README.md
24
README.md
@ -134,7 +134,7 @@ of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
|||||||
This dataset contains demonstrations of human-assistant conversations which were collected
|
This dataset contains demonstrations of human-assistant conversations which were collected
|
||||||
on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
|
on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
|
||||||
|
|
||||||
Conversations are exported as conversation trees which contain conversation messages as nodes.
|
Conversations are exported as conversation trees with messages as nodes.
|
||||||
The root node of a conversation tree is called the initial prompt. Each message can have
|
The root node of a conversation tree is called the initial prompt. Each message can have
|
||||||
multiple replies. Nodes with more than one reply can have a `rank` field indicating the
|
multiple replies. Nodes with more than one reply can have a `rank` field indicating the
|
||||||
user preference (the most preferred message has rank 0).
|
user preference (the most preferred message has rank 0).
|
||||||
@ -143,6 +143,11 @@ All messages have a role which can either be "assistant" or "prompter". The role
|
|||||||
conversation threads from prompt to leaf node in a conversation tree are stricly alternating
|
conversation threads from prompt to leaf node in a conversation tree are stricly alternating
|
||||||
between "assistant" and "prompter".
|
between "assistant" and "prompter".
|
||||||
|
|
||||||
|
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
|
||||||
|
details about the data structure and python code to load and write conversation tree and message
|
||||||
|
jsonl files.
|
||||||
|
|
||||||
|
|
||||||
## Main Dataset Files
|
## Main Dataset Files
|
||||||
|
|
||||||
Data is provided either as nested messages in conversation trees or as flat list of messages.
|
Data is provided either as nested messages in conversation trees or as flat list of messages.
|
||||||
@ -151,15 +156,22 @@ The type of file can be inferred from the file name extension:
|
|||||||
- `.trees.jsonl.gz`: Conversation trees with nested messages
|
- `.trees.jsonl.gz`: Conversation trees with nested messages
|
||||||
- `.messages.jsonl.gz`: Flat list of messages
|
- `.messages.jsonl.gz`: Flat list of messages
|
||||||
|
|
||||||
|
Full conversation trees can be reconstructed from a flat messages using the `parent_id`
|
||||||
|
and `message_id` properties to identify the parent-child relationship of messages. To
|
||||||
|
select all messages of a single conversation tree `message_tree_id` can be used (only present
|
||||||
|
in flat messages files).
|
||||||
|
|
||||||
### Ready for export trees
|
### Ready For Export Trees
|
||||||
|
|
||||||
|
This set of exported trees contains all trees in `ready_for_export` state without spam
|
||||||
|
and deleted messages. For supervised fine-tuning & reward model training this file is ideal.
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
||||||
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
||||||
```
|
```
|
||||||
|
|
||||||
### All trees
|
### All Trees
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
@ -167,6 +179,12 @@ The type of file can be inferred from the file name extension:
|
|||||||
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Supplemental Exports: Spam & Prompts
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
### Languages
|
### Languages
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user