oasst1/README.md

217 lines
4.6 KiB
Markdown
Raw Normal View History

2023-04-13 15:48:16 +00:00
---
license: apache-2.0
2023-04-13 22:14:10 +00:00
dataset_info:
features:
- name: message_id
dtype: string
- name: parent_id
dtype: string
- name: user_id
dtype: string
- name: created_date
dtype: string
- name: text
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
- name: review_count
dtype: int32
- name: review_result
dtype: bool
- name: deleted
dtype: bool
- name: rank
dtype: int32
- name: synthetic
dtype: bool
- name: model_name
dtype: string
- name: detoxify
struct:
- name: toxicity
dtype: float64
- name: severe_toxicity
dtype: float64
- name: obscene
dtype: float64
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: threat
dtype: float64
- name: sexual_explicit
dtype: float64
- name: message_tree_id
dtype: string
- name: tree_state
dtype: string
- name: emojis
sequence:
- name: name
dtype: string
- name: count
dtype: int32
- name: labels
sequence:
- name: name
dtype: string
- name: value
dtype: float64
- name: count
dtype: int32
splits:
- name: train
2023-04-14 23:53:21 +00:00
num_bytes: 100367999
num_examples: 84437
- name: validation
num_bytes: 5243405
num_examples: 4401
download_size: 41596430
dataset_size: 105611404
2023-04-15 00:17:23 +00:00
language:
- en
- es
- ru
- de
- pl
- th
- vi
- sv
- bn
- da
- he
- it
- fa
- sk
- id
- nb
- el
- nl
- hu
- eu
- zh
- eo
- ja
- ca
- cs
- bg
- fi
- pt
- tr
- ro
- ar
- uk
- gl
- fr
- ko
2023-04-15 01:00:54 +00:00
tags:
- human-feedback
size_categories:
- 10K<n<100K
2023-04-13 15:48:16 +00:00
---
2023-04-13 22:15:32 +00:00
2023-04-15 12:36:25 +00:00
# OpenAssistant Conversations Dataset (OASST1)
2023-04-13 22:15:32 +00:00
## Dataset Description
2023-04-15 00:16:00 +00:00
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** TBA on April 17, 2023
2023-04-13 22:15:32 +00:00
### Dataset Summary
2023-04-15 00:16:00 +00:00
In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
2023-04-13 22:15:32 +00:00
2023-04-15 12:36:25 +00:00
### Dataset Structure
2023-04-15 13:15:54 +00:00
This dataset contains demonstrations of human-assistant conversations which were collected
2023-04-15 13:26:01 +00:00
on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023.
2023-04-15 13:15:54 +00:00
2023-04-15 13:26:01 +00:00
Conversations are exported as conversation trees which contain conversation messages as nodes.
The root node of a conversation tree is called the initial prompt. Each message can have
2023-04-15 13:15:54 +00:00
multiple replies. Nodes with more than one reply can have a `rank` field indicating the
2023-04-15 13:26:01 +00:00
user preference (the most preferred message has rank 0).
2023-04-15 13:15:54 +00:00
All messages have a role which can either be "assistant" or "prompter". The roles in
2023-04-15 13:26:01 +00:00
conversation threads from prompt to leaf node in a conversation tree are stricly alternating
2023-04-15 13:15:54 +00:00
between "assistant" and "prompter".
## Main Dataset Files
2023-04-15 13:26:01 +00:00
Data is provided either as nested messages in conversation trees or as flat list of messages.
The type of file can be inferred from the file name extension:
- `.trees.jsonl.gz`: Conversation trees with nested messages
- `.messages.jsonl.gz`: Flat list of messages
2023-04-15 13:15:54 +00:00
2023-04-15 13:26:01 +00:00
### Ready for export trees
2023-04-15 12:36:25 +00:00
2023-04-15 13:15:54 +00:00
```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
```
2023-04-15 12:36:25 +00:00
2023-04-15 13:26:01 +00:00
### All trees
2023-04-15 12:36:25 +00:00
2023-04-15 13:15:54 +00:00
```
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
```
2023-04-15 12:36:25 +00:00
2023-04-13 22:15:32 +00:00
### Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages**
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
<details>
<summary><b>Languages with under 1000 messages</b></summary>
<ul>
<li>Vietnamese: 952</li>
<li>Basque: 947</li>
<li>Polish: 886</li>
<li>Hungarian: 811</li>
<li>Arabic: 666</li>
<li>Dutch: 628</li>
<li>Swedish: 512</li>
<li>Turkish: 454</li>
<li>Finnish: 386</li>
<li>Czech: 372</li>
<li>Danish: 358</li>
<li>Galician: 339</li>
<li>Hebrew: 255</li>
<li>Romanian: 200</li>
<li>Norwegian Bokmål: 133</li>
<li>Indonesian: 115</li>
<li>Bulgarian: 95</li>
<li>Bengali: 82</li>
<li>Persian: 72</li>
<li>Greek: 66</li>
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
</details>