9.6 KiB
license | dataset_info | language | tags | size_categories | pretty_name | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
|
|
|
|
OpenAssistant Conversations |
OpenAssistant Conversations Dataset (OASST1)
Dataset Description
- Homepage: https://www.open-assistant.io/
- Repository: https://github.com/LAION-AI/Open-Assistant
- Paper: https://www.ykilcher.com/OA_Paper_2023_04_15.pdf
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Please refer to our paper for further details.
Dataset Structure
This dataset contains message trees which each have an inital prompt message as root which can have multiple child messages as replies which itself again can have multiple replies.
All messages have a role property which can either be "assistant" or "prompter". The roles in conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant".
This version of the dataset contains data collected on the open-assistant.io website until April, 12 2023.
JSON Example: Message
For readability the following JSON examples are shown formatted with indentation on multiple lines. Objects are stored without indentation on a single lines in the actual jsonl files.
{
"message_id": "218440fd-5317-4355-91dc-d001416df62b",
"parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
"user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
"text": "It was the winter of 2035, and artificial intelligence (..)",
"role": "assistant",
"lang": "en",
"review_count": 3,
"review_result": true,
"deleted": false,
"rank": 0,
"synthetic": true,
"model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
"labels": {
"spam": { "value": 0.0, "count": 3 },
"lang_mismatch": { "value": 0.0, "count": 3 },
"pii": { "value": 0.0, "count": 3 },
"not_appropriate": { "value": 0.0, "count": 3 },
"hate_speech": { "value": 0.0, "count": 3 },
"sexual_content": { "value": 0.0, "count": 3 },
"quality": { "value": 0.416, "count": 3 },
"toxicity": { "value": 0.16, "count": 3 },
"humor": { "value": 0.0, "count": 3 },
"creativity": { "value": 0.33, "count": 3 },
"violence": { "value": 0.16, "count": 3 }
}
}
JSON Example: Conversation Tree
For readability only a subset of the message properties is shown here.
{
"message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"tree_state": "ready_for_export",
"prompt": {
"message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
"text": "Why can't we divide by 0? (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
"text": "The reason we cannot divide by zero is because (..)",
"role": "assistant",
"lang": "en",
"replies": [
// ...
]
},
{
"message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
"text": "The reason that the result of a division by zero is (..)",
"role": "assistant",
"lang": "en",
"replies": [
{
"message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
"text": "Math is confusing. Like those weird Irrational (..)",
"role": "prompter",
"lang": "en",
"replies": [
{
"message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
"text": "Irrational numbers are simply numbers (..)",
"role": "assistant",
"lang": "en",
"replies": []
},
// ...
]
}
]
}
]
}
}
Please refer to oasst-data for details about the data structure and Python code to read and write jsonl files containing oasst data objects.
Main Dataset Files
Conversation data is provided either as nested messages in trees (extension .trees.jsonl.gz
)
or as flat list (table) of messages (extension .messages.jsonl.gz
).
Ready For Export Trees
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
Trees in ready_for_export
state without spam and deleted messages including message labels.
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
All Trees
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
All trees including those in states prompt_lottery_waiting
(trees that consist of only one message, namely the inital prompt),
aborted_low_grade
(trees that stopped growing because the messages had low quality), and halted_by_moderator
.
Supplemental Exports: Spam & Prompts
2023-04-12_oasst_spam.messages.jsonl.gz
Messages which were deleted or have a negative review result ("review_result": false
).
Beside low quality a frequent reason for message deletion is a wrong language tag.
2023-04-12_oasst_prompts.messages.jsonl.gz
All non-deleted initial prompt messages with positive review result (no spam) of trees in ready_for_export
or prompt_lottery_waiting
state.
Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless we make all messages which can alse be found in the file 2023-04-12_oasst_ready.trees.jsonl.gz
available as parquet train/validation
split which is directly loadable by the Huggingface Datasets.
To load the oasst1 train & validation splits use:
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
The messages appear in depth-first order of the message trees.
Full conversation trees can be reconstructed from the flat messages table by using the parent_id
and message_id
properties to identify the parent-child relationship of messages. The message_tree_id
and tree_state
properties (only present in flat messages files) can be used to find all
all messages of a message tree or to select trees by their state.
Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
Languages with under 1000 messages
- Vietnamese: 952
- Basque: 947
- Polish: 886
- Hungarian: 811
- Arabic: 666
- Dutch: 628
- Swedish: 512
- Turkish: 454
- Finnish: 386
- Czech: 372
- Danish: 358
- Galician: 339
- Hebrew: 255
- Romanian: 200
- Norwegian Bokmål: 133
- Indonesian: 115
- Bulgarian: 95
- Bengali: 82
- Persian: 72
- Greek: 66
- Esperanto: 59
- Slovak: 19
Contact
- Discord Open Assistant Discord Server
- GitHub: LAION-AI/Open-Assistant
- E-Mail: open-assistent@laion.ai (yes, with e)