3.9 KiB
license | dataset_info | language | tags | size_categories | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
|
|
|
|
OpenAssistant Conversations Dataset (OASST1)
Dataset Description
- Homepage: https://www.open-assistant.io/
- Repository: https://github.com/LAION-AI/Open-Assistant
- Paper: TBA on April 17, 2023
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
The dataset was exported from the open-assistant.io production database on April, 12 2023.
Dataset Structure
Thes dataset contains demonstrations of of human-assistant conversations that were collected on the open-assistant.io website.
All conversations are exported as message trees which contain conversation messages nodes. Each message has a
role which can either be "assistant" or "prompter". The root node of a message tree is called the initial prompt.
Nodes with at least two replies of completed trees have a rank
field which indicates the users' preference consensus.
The lower the rank the better the message.
Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
Languages with under 1000 messages
- Vietnamese: 952
- Basque: 947
- Polish: 886
- Hungarian: 811
- Arabic: 666
- Dutch: 628
- Swedish: 512
- Turkish: 454
- Finnish: 386
- Czech: 372
- Danish: 358
- Galician: 339
- Hebrew: 255
- Romanian: 200
- Norwegian Bokmål: 133
- Indonesian: 115
- Bulgarian: 95
- Bengali: 82
- Persian: 72
- Greek: 66
- Esperanto: 59
- Slovak: 19