4.6 KiB
license | dataset_info | language | tags | size_categories | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
|
|
|
|
OpenAssistant Conversations Dataset (OASST1)
Dataset Description
- Homepage: https://www.open-assistant.io/
- Repository: https://github.com/LAION-AI/Open-Assistant
- Paper: TBA on April 17, 2023
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Dataset Structure
This dataset contains demonstrations of human-assistant conversations which were collected on the open-assistant.io website until April, 12 2023.
Conversations are exported as message trees which contain conversation messages as nodes.
The root node of a message tree is called the initial prompt. Each message node can have
multiple replies. Nodes with more than one reply can have a rank
field indicating the
order among the siblings sorted by user preference (the most preferred message has rank 0).
All messages have a role which can either be "assistant" or "prompter". The roles in
conversation threads from prompt to leaf node in a message tree are stricly alternating
between "assistant" and "prompter".
Main Dataset Files
Data is provided either as nested as a message tree or as flat list (table) of messages.
Names of files containing message trees end in .trees.jsonl.gz
while files containing
a list of messages with a file name ending in .messages.jsonl.gz
.
Mesages
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
Languages with under 1000 messages
- Vietnamese: 952
- Basque: 947
- Polish: 886
- Hungarian: 811
- Arabic: 666
- Dutch: 628
- Swedish: 512
- Turkish: 454
- Finnish: 386
- Czech: 372
- Danish: 358
- Galician: 339
- Hebrew: 255
- Romanian: 200
- Norwegian Bokmål: 133
- Indonesian: 115
- Bulgarian: 95
- Bengali: 82
- Persian: 72
- Greek: 66
- Esperanto: 59
- Slovak: 19