194 lines
3.9 KiB
Markdown
194 lines
3.9 KiB
Markdown
---
|
|
license: apache-2.0
|
|
dataset_info:
|
|
features:
|
|
- name: message_id
|
|
dtype: string
|
|
- name: parent_id
|
|
dtype: string
|
|
- name: user_id
|
|
dtype: string
|
|
- name: created_date
|
|
dtype: string
|
|
- name: text
|
|
dtype: string
|
|
- name: role
|
|
dtype: string
|
|
- name: lang
|
|
dtype: string
|
|
- name: review_count
|
|
dtype: int32
|
|
- name: review_result
|
|
dtype: bool
|
|
- name: deleted
|
|
dtype: bool
|
|
- name: rank
|
|
dtype: int32
|
|
- name: synthetic
|
|
dtype: bool
|
|
- name: model_name
|
|
dtype: string
|
|
- name: detoxify
|
|
struct:
|
|
- name: toxicity
|
|
dtype: float64
|
|
- name: severe_toxicity
|
|
dtype: float64
|
|
- name: obscene
|
|
dtype: float64
|
|
- name: identity_attack
|
|
dtype: float64
|
|
- name: insult
|
|
dtype: float64
|
|
- name: threat
|
|
dtype: float64
|
|
- name: sexual_explicit
|
|
dtype: float64
|
|
- name: message_tree_id
|
|
dtype: string
|
|
- name: tree_state
|
|
dtype: string
|
|
- name: emojis
|
|
sequence:
|
|
- name: name
|
|
dtype: string
|
|
- name: count
|
|
dtype: int32
|
|
- name: labels
|
|
sequence:
|
|
- name: name
|
|
dtype: string
|
|
- name: value
|
|
dtype: float64
|
|
- name: count
|
|
dtype: int32
|
|
splits:
|
|
- name: train
|
|
num_bytes: 100367999
|
|
num_examples: 84437
|
|
- name: validation
|
|
num_bytes: 5243405
|
|
num_examples: 4401
|
|
download_size: 41596430
|
|
dataset_size: 105611404
|
|
language:
|
|
- en
|
|
- es
|
|
- ru
|
|
- de
|
|
- pl
|
|
- th
|
|
- vi
|
|
- sv
|
|
- bn
|
|
- da
|
|
- he
|
|
- it
|
|
- fa
|
|
- sk
|
|
- id
|
|
- nb
|
|
- el
|
|
- nl
|
|
- hu
|
|
- eu
|
|
- zh
|
|
- eo
|
|
- ja
|
|
- ca
|
|
- cs
|
|
- bg
|
|
- fi
|
|
- pt
|
|
- tr
|
|
- ro
|
|
- ar
|
|
- uk
|
|
- gl
|
|
- fr
|
|
- ko
|
|
tags:
|
|
- human-feedback
|
|
size_categories:
|
|
- 10K<n<100K
|
|
---
|
|
|
|
# OpenAssistant Conversations Dataset (OASST1)
|
|
|
|
## Dataset Description
|
|
|
|
- **Homepage:** https://www.open-assistant.io/
|
|
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
|
- **Paper:** TBA on April 17, 2023
|
|
|
|
### Dataset Summary
|
|
|
|
In an effort to democratize research on large-scale alignment, we release OpenAssistant
|
|
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
|
|
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
|
|
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
|
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
|
|
|
The dataset was exported from the open-assistant.io production database on April, 12 2023.
|
|
|
|
### Dataset Structure
|
|
|
|
Thes dataset contains demonstrations of of human-assistant conversations that were collected
|
|
on the open-assistant.io website.
|
|
|
|
All conversations are exported as message trees which contain conversation messages nodes. Each message has a
|
|
role which can either be "assistant" or "prompter". The root node of a message tree is called the initial prompt.
|
|
Nodes with at least two replies of completed trees have a `rank` field which indicates the users' preference consensus.
|
|
The lower the rank the better the message.
|
|
|
|
|
|
|
|
|
|
|
|
### Languages
|
|
|
|
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
|
|
|
|
**Languages with over 1000 messages**
|
|
- English: 71956
|
|
- Spanish: 43061
|
|
- Russian: 9089
|
|
- German: 5279
|
|
- Chinese: 4962
|
|
- French: 4251
|
|
- Thai: 3042
|
|
- Portuguese (Brazil): 2969
|
|
- Catalan: 2260
|
|
- Korean: 1553
|
|
- Ukrainian: 1352
|
|
- Italian: 1320
|
|
- Japanese: 1018
|
|
|
|
<details>
|
|
<summary><b>Languages with under 1000 messages</b></summary>
|
|
<ul>
|
|
<li>Vietnamese: 952</li>
|
|
<li>Basque: 947</li>
|
|
<li>Polish: 886</li>
|
|
<li>Hungarian: 811</li>
|
|
<li>Arabic: 666</li>
|
|
<li>Dutch: 628</li>
|
|
<li>Swedish: 512</li>
|
|
<li>Turkish: 454</li>
|
|
<li>Finnish: 386</li>
|
|
<li>Czech: 372</li>
|
|
<li>Danish: 358</li>
|
|
<li>Galician: 339</li>
|
|
<li>Hebrew: 255</li>
|
|
<li>Romanian: 200</li>
|
|
<li>Norwegian Bokmål: 133</li>
|
|
<li>Indonesian: 115</li>
|
|
<li>Bulgarian: 95</li>
|
|
<li>Bengali: 82</li>
|
|
<li>Persian: 72</li>
|
|
<li>Greek: 66</li>
|
|
<li>Esperanto: 59</li>
|
|
<li>Slovak: 19</li>
|
|
</ul>
|
|
</details>
|