oasst1/README.md

347 lines
9.6 KiB
Markdown
Raw Normal View History

2023-04-13 15:48:16 +00:00
---
license: apache-2.0
2023-04-13 22:14:10 +00:00
dataset_info:
features:
- name: message_id
dtype: string
- name: parent_id
dtype: string
- name: user_id
dtype: string
- name: created_date
dtype: string
- name: text
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
- name: review_count
dtype: int32
- name: review_result
dtype: bool
- name: deleted
dtype: bool
- name: rank
dtype: int32
- name: synthetic
dtype: bool
- name: model_name
dtype: string
- name: detoxify
struct:
- name: toxicity
dtype: float64
- name: severe_toxicity
dtype: float64
- name: obscene
dtype: float64
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: threat
dtype: float64
- name: sexual_explicit
dtype: float64
- name: message_tree_id
dtype: string
- name: tree_state
dtype: string
- name: emojis
sequence:
- name: name
dtype: string
- name: count
dtype: int32
- name: labels
sequence:
- name: name
dtype: string
- name: value
dtype: float64
- name: count
dtype: int32
splits:
- name: train
2023-04-14 23:53:21 +00:00
num_bytes: 100367999
num_examples: 84437
- name: validation
num_bytes: 5243405
num_examples: 4401
download_size: 41596430
dataset_size: 105611404
2023-04-15 00:17:23 +00:00
language:
- en
- es
- ru
- de
- pl
- th
- vi
- sv
- bn
- da
- he
- it
- fa
- sk
- id
- nb
- el
- nl
- hu
- eu
- zh
- eo
- ja
- ca
- cs
- bg
- fi
- pt
- tr
- ro
- ar
- uk
- gl
- fr
- ko
2023-04-15 01:00:54 +00:00
tags:
- human-feedback
size_categories:
2023-04-15 15:47:38 +00:00
- 100K<n<1M
2023-04-15 14:27:59 +00:00
pretty_name: OpenAssistant Conversations
2023-04-13 15:48:16 +00:00
---
2023-04-13 22:15:32 +00:00
2023-04-15 12:36:25 +00:00
# OpenAssistant Conversations Dataset (OASST1)
2023-04-13 22:15:32 +00:00
## Dataset Description
2023-04-15 00:16:00 +00:00
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
2023-04-15 16:51:49 +00:00
- **Paper:** https://www.ykilcher.com/OA_Paper_2023_04_15.pdf (temporary)
2023-04-13 22:15:32 +00:00
### Dataset Summary
2023-04-15 00:16:00 +00:00
In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
2023-04-13 22:15:32 +00:00
2023-04-15 16:51:49 +00:00
Please refer to our [paper](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf) for further details.
2023-04-15 14:27:59 +00:00
2023-04-15 12:36:25 +00:00
### Dataset Structure
2023-04-15 16:06:25 +00:00
This dataset contains message trees which each have an inital prompt message as root which can have
multiple child messages as replies which itself again can have multiple replies.
2023-04-15 13:26:01 +00:00
2023-04-15 14:54:40 +00:00
All messages have a role property which can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant".
2023-04-15 13:15:54 +00:00
2023-04-15 16:06:25 +00:00
This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
2023-04-15 14:39:59 +00:00
### JSON Example: Message
For readability the following JSON examples are shown formatted with indentation on multiple lines.
Objects are stored without indentation on a single lines in the actual jsonl files.
2023-04-15 14:55:21 +00:00
```json
2023-04-15 14:39:59 +00:00
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
}
```
### JSON Example: Conversation Tree
2023-04-15 14:54:40 +00:00
For readability only a subset of the message properties is shown here.
2023-04-15 14:39:59 +00:00
2023-04-15 14:55:21 +00:00
```json
2023-04-15 14:39:59 +00:00
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
// ...
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
// ...
]
}
]
}
]
}
}
```
2023-04-15 14:03:23 +00:00
Please refer to [oasst-data](https://github.com/LAION-AI/Open-Assistant/tree/main/oasst-data) for
2023-04-15 14:54:40 +00:00
details about the data structure and Python code to read and write jsonl files containing oasst data objects.
2023-04-15 14:03:23 +00:00
2023-04-15 13:15:54 +00:00
## Main Dataset Files
2023-04-15 14:54:40 +00:00
Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`)
or as flat list (table) of messages (extension `.messages.jsonl.gz`).
2023-04-15 13:26:01 +00:00
2023-04-15 14:03:23 +00:00
### Ready For Export Trees
2023-04-15 13:15:54 +00:00
```
2023-04-15 15:47:38 +00:00
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
2023-04-15 13:15:54 +00:00
```
2023-04-15 14:27:59 +00:00
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
2023-04-15 12:36:25 +00:00
2023-04-15 13:26:01 +00:00
2023-04-15 14:27:59 +00:00
### All Trees
2023-04-15 12:36:25 +00:00
2023-04-15 13:15:54 +00:00
```
2023-04-15 15:47:38 +00:00
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
2023-04-15 13:15:54 +00:00
```
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
2023-04-15 14:27:59 +00:00
2023-04-15 12:36:25 +00:00
2023-04-15 14:03:23 +00:00
### Supplemental Exports: Spam & Prompts
```
2023-04-15 14:27:59 +00:00
2023-04-12_oasst_spam.messages.jsonl.gz
```
Messages which were deleted or have a negative review result (`"review_result": false`).
Beside low quality a frequent reason for message deletion is a wrong language tag.
2023-04-15 14:03:23 +00:00
2023-04-15 14:27:59 +00:00
```
2023-04-12_oasst_prompts.messages.jsonl.gz
2023-04-15 14:03:23 +00:00
```
2023-04-15 14:27:59 +00:00
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
2023-04-15 14:03:23 +00:00
2023-04-15 15:47:38 +00:00
### Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
To load the oasst1 train & validation splits use:
```python
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
```
The messages appear in depth-first order of the message trees.
2023-04-15 15:51:39 +00:00
Full conversation trees can be reconstructed from the flat messages table by using the `parent_id`
and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id`
and `tree_state` properties (only present in flat messages files) can be used to find all
all messages of a message tree or to select trees by their state.
2023-04-15 12:36:25 +00:00
2023-04-13 22:15:32 +00:00
### Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages**
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
<details>
<summary><b>Languages with under 1000 messages</b></summary>
<ul>
<li>Vietnamese: 952</li>
<li>Basque: 947</li>
<li>Polish: 886</li>
<li>Hungarian: 811</li>
<li>Arabic: 666</li>
<li>Dutch: 628</li>
<li>Swedish: 512</li>
<li>Turkish: 454</li>
<li>Finnish: 386</li>
<li>Czech: 372</li>
<li>Danish: 358</li>
<li>Galician: 339</li>
<li>Hebrew: 255</li>
<li>Romanian: 200</li>
<li>Norwegian Bokmål: 133</li>
<li>Indonesian: 115</li>
<li>Bulgarian: 95</li>
<li>Bengali: 82</li>
<li>Persian: 72</li>
<li>Greek: 66</li>
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
2023-04-15 14:46:30 +00:00
</details>
## Contact
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
2023-04-15 15:47:38 +00:00
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)