2023-04-13 15:48:16 +00:00
|
|
|
---
|
|
|
|
license: apache-2.0
|
2023-04-13 22:14:10 +00:00
|
|
|
dataset_info:
|
|
|
|
features:
|
|
|
|
- name: message_id
|
|
|
|
dtype: string
|
|
|
|
- name: parent_id
|
|
|
|
dtype: string
|
|
|
|
- name: user_id
|
|
|
|
dtype: string
|
|
|
|
- name: created_date
|
|
|
|
dtype: string
|
|
|
|
- name: text
|
|
|
|
dtype: string
|
|
|
|
- name: role
|
|
|
|
dtype: string
|
|
|
|
- name: lang
|
|
|
|
dtype: string
|
|
|
|
- name: review_count
|
|
|
|
dtype: int32
|
|
|
|
- name: review_result
|
|
|
|
dtype: bool
|
|
|
|
- name: deleted
|
|
|
|
dtype: bool
|
|
|
|
- name: rank
|
|
|
|
dtype: int32
|
|
|
|
- name: synthetic
|
|
|
|
dtype: bool
|
|
|
|
- name: model_name
|
|
|
|
dtype: string
|
|
|
|
- name: detoxify
|
|
|
|
struct:
|
|
|
|
- name: toxicity
|
|
|
|
dtype: float64
|
|
|
|
- name: severe_toxicity
|
|
|
|
dtype: float64
|
|
|
|
- name: obscene
|
|
|
|
dtype: float64
|
|
|
|
- name: identity_attack
|
|
|
|
dtype: float64
|
|
|
|
- name: insult
|
|
|
|
dtype: float64
|
|
|
|
- name: threat
|
|
|
|
dtype: float64
|
|
|
|
- name: sexual_explicit
|
|
|
|
dtype: float64
|
|
|
|
- name: message_tree_id
|
|
|
|
dtype: string
|
|
|
|
- name: tree_state
|
|
|
|
dtype: string
|
|
|
|
- name: emojis
|
|
|
|
sequence:
|
|
|
|
- name: name
|
|
|
|
dtype: string
|
|
|
|
- name: count
|
|
|
|
dtype: int32
|
|
|
|
- name: labels
|
|
|
|
sequence:
|
|
|
|
- name: name
|
|
|
|
dtype: string
|
|
|
|
- name: value
|
|
|
|
dtype: float64
|
|
|
|
- name: count
|
|
|
|
dtype: int32
|
|
|
|
splits:
|
|
|
|
- name: train
|
2023-04-14 23:53:21 +00:00
|
|
|
num_bytes: 100367999
|
|
|
|
num_examples: 84437
|
|
|
|
- name: validation
|
|
|
|
num_bytes: 5243405
|
|
|
|
num_examples: 4401
|
|
|
|
download_size: 41596430
|
|
|
|
dataset_size: 105611404
|
2023-04-15 00:17:23 +00:00
|
|
|
language:
|
|
|
|
- en
|
|
|
|
- es
|
|
|
|
- ru
|
|
|
|
- de
|
|
|
|
- pl
|
|
|
|
- th
|
|
|
|
- vi
|
|
|
|
- sv
|
|
|
|
- bn
|
|
|
|
- da
|
|
|
|
- he
|
|
|
|
- it
|
|
|
|
- fa
|
|
|
|
- sk
|
|
|
|
- id
|
|
|
|
- nb
|
|
|
|
- el
|
|
|
|
- nl
|
|
|
|
- hu
|
|
|
|
- eu
|
|
|
|
- zh
|
|
|
|
- eo
|
|
|
|
- ja
|
|
|
|
- ca
|
|
|
|
- cs
|
|
|
|
- bg
|
|
|
|
- fi
|
|
|
|
- pt
|
|
|
|
- tr
|
|
|
|
- ro
|
|
|
|
- ar
|
|
|
|
- uk
|
|
|
|
- gl
|
|
|
|
- fr
|
|
|
|
- ko
|
2023-04-15 01:00:54 +00:00
|
|
|
tags:
|
|
|
|
- human-feedback
|
|
|
|
size_categories:
|
|
|
|
- 10K<n<100K
|
2023-04-13 15:48:16 +00:00
|
|
|
---
|
2023-04-13 22:15:32 +00:00
|
|
|
|
2023-04-15 00:16:00 +00:00
|
|
|
# Dataset Card for OASST1
|
2023-04-13 22:15:32 +00:00
|
|
|
|
|
|
|
## Dataset Description
|
|
|
|
|
2023-04-15 00:16:00 +00:00
|
|
|
- **Homepage:** https://www.open-assistant.io/
|
|
|
|
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
2023-04-15 09:06:58 +00:00
|
|
|
- **Paper:** TBA on April 17, 2023
|
2023-04-13 22:15:32 +00:00
|
|
|
|
|
|
|
### Dataset Summary
|
|
|
|
|
2023-04-15 00:16:00 +00:00
|
|
|
In an effort to democratize research on large-scale alignment, we release OpenAssistant
|
|
|
|
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
|
|
|
|
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
|
|
|
|
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
|
|
|
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
2023-04-13 22:15:32 +00:00
|
|
|
|
|
|
|
|
|
|
|
### Languages
|
|
|
|
|
2023-04-15 09:06:58 +00:00
|
|
|
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
|
|
|
|
|
2023-04-15 11:33:38 +00:00
|
|
|
**Languages with over 1000 messages**
|
2023-04-15 09:06:58 +00:00
|
|
|
- English: 71956
|
|
|
|
- Spanish: 43061
|
|
|
|
- Russian: 9089
|
|
|
|
- German: 5279
|
|
|
|
- Chinese: 4962
|
|
|
|
- French: 4251
|
|
|
|
- Thai: 3042
|
|
|
|
- Portuguese (Brazil): 2969
|
|
|
|
- Catalan: 2260
|
|
|
|
- Korean: 1553
|
|
|
|
- Ukrainian: 1352
|
|
|
|
- Italian: 1320
|
|
|
|
- Japanese: 1018
|
|
|
|
|
|
|
|
<details>
|
2023-04-15 11:33:38 +00:00
|
|
|
<summary><b>Languages with under 1000 messages</b></summary>
|
2023-04-15 09:06:58 +00:00
|
|
|
<ul>
|
|
|
|
<li>Vietnamese: 952</li>
|
|
|
|
<li>Basque: 947</li>
|
|
|
|
<li>Polish: 886</li>
|
|
|
|
<li>Hungarian: 811</li>
|
|
|
|
<li>Arabic: 666</li>
|
|
|
|
<li>Dutch: 628</li>
|
|
|
|
<li>Swedish: 512</li>
|
|
|
|
<li>Turkish: 454</li>
|
|
|
|
<li>Finnish: 386</li>
|
|
|
|
<li>Czech: 372</li>
|
|
|
|
<li>Danish: 358</li>
|
|
|
|
<li>Galician: 339</li>
|
|
|
|
<li>Hebrew: 255</li>
|
|
|
|
<li>Romanian: 200</li>
|
|
|
|
<li>Norwegian Bokmål: 133</li>
|
|
|
|
<li>Indonesian: 115</li>
|
|
|
|
<li>Bulgarian: 95</li>
|
|
|
|
<li>Bengali: 82</li>
|
|
|
|
<li>Persian: 72</li>
|
|
|
|
<li>Greek: 66</li>
|
|
|
|
<li>Esperanto: 59</li>
|
|
|
|
<li>Slovak: 19</li>
|
|
|
|
</ul>
|
|
|
|
</details>
|
2023-04-13 22:15:32 +00:00
|
|
|
|
|
|
|
## Dataset Structure
|
|
|
|
|
|
|
|
### Data Instances
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Data Fields
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Data Splits
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
## Dataset Creation
|
|
|
|
|
|
|
|
### Curation Rationale
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Source Data
|
|
|
|
|
|
|
|
#### Initial Data Collection and Normalization
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
#### Who are the source language producers?
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Annotations
|
|
|
|
|
|
|
|
#### Annotation process
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
#### Who are the annotators?
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Personal and Sensitive Information
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
## Considerations for Using the Data
|
|
|
|
|
|
|
|
### Social Impact of Dataset
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Discussion of Biases
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Other Known Limitations
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
## Additional Information
|
|
|
|
|
|
|
|
### Dataset Curators
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Licensing Information
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Citation Information
|
|
|
|
|
|
|
|
[More Information Needed]
|
|
|
|
|
|
|
|
### Contributions
|
|
|
|
|
|
|
|
[More Information Needed]
|