4.6 KiB
license | dataset_info | language | tags | size_categories | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
|
|
|
|
OpenAssistant Conversations Dataset (OASST1)
Dataset Description
- Homepage: https://www.open-assistant.io/
- Repository: https://github.com/LAION-AI/Open-Assistant
- Paper: TBA on April 17, 2023
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Dataset Structure
This dataset contains demonstrations of human-assistant conversations which were collected on the open-assistant.io website until April, 12 2023.
Conversations are exported as conversation trees which contain conversation messages as nodes.
The root node of a conversation tree is called the initial prompt. Each message can have
multiple replies. Nodes with more than one reply can have a rank
field indicating the
user preference (the most preferred message has rank 0).
All messages have a role which can either be "assistant" or "prompter". The roles in conversation threads from prompt to leaf node in a conversation tree are stricly alternating between "assistant" and "prompter".
Main Dataset Files
Data is provided either as nested messages in conversation trees or as flat list of messages.
The type of file can be inferred from the file name extension:
.trees.jsonl.gz
: Conversation trees with nested messages.messages.jsonl.gz
: Flat list of messages
Ready for export trees
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
All trees
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
Languages with under 1000 messages
- Vietnamese: 952
- Basque: 947
- Polish: 886
- Hungarian: 811
- Arabic: 666
- Dutch: 628
- Swedish: 512
- Turkish: 454
- Finnish: 386
- Czech: 372
- Danish: 358
- Galician: 339
- Hebrew: 255
- Romanian: 200
- Norwegian Bokmål: 133
- Indonesian: 115
- Bulgarian: 95
- Bengali: 82
- Persian: 72
- Greek: 66
- Esperanto: 59
- Slovak: 19