oasst1/README.md
2023-05-02 13:21:21 +00:00

9.9 KiB

license dataset_info language tags size_categories pretty_name
apache-2.0
features splits download_size dataset_size
name dtype
message_id string
name dtype
parent_id string
name dtype
user_id string
name dtype
created_date string
name dtype
text string
name dtype
role string
name dtype
lang string
name dtype
review_count int32
name dtype
review_result bool
name dtype
deleted bool
name dtype
rank int32
name dtype
synthetic bool
name dtype
model_name string
name struct
detoxify
name dtype
toxicity float64
name dtype
severe_toxicity float64
name dtype
obscene float64
name dtype
identity_attack float64
name dtype
insult float64
name dtype
threat float64
name dtype
sexual_explicit float64
name dtype
message_tree_id string
name dtype
tree_state string
name sequence
emojis
name dtype
name string
name dtype
count int32
name sequence
labels
name dtype
name string
name dtype
value float64
name dtype
count int32
name num_bytes num_examples
train 100367999 84437
name num_bytes num_examples
validation 5243405 4401
41596430 105611404
en
es
ru
de
pl
th
vi
sv
bn
da
he
it
fa
sk
id
nb
el
nl
hu
eu
zh
eo
ja
ca
cs
bg
fi
pt
tr
ro
ar
uk
gl
fr
ko
human-feedback
100K<n<1M
OpenAssistant Conversations

OpenAssistant Conversations Dataset (OASST1)

Dataset Description

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Please refer to our paper for further details.

Dataset Structure

This dataset contains message trees. Each message tree has an initial prompt message as the root node, which can have multiple child messages as replies, and these child messages can have multiple replies.

All messages have a role property: this can either be "assistant" or "prompter". The roles in conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant".

This version of the dataset contains data collected on the open-assistant.io website until April 12 2023.

JSON Example: Message

For readability, the following JSON examples are shown formatted with indentation on multiple lines. Objects are stored without indentation (on single lines) in the actual jsonl files.

{
    "message_id": "218440fd-5317-4355-91dc-d001416df62b",
    "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
    "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
    "text": "It was the winter of 2035, and artificial intelligence (..)",
    "role": "assistant",
    "lang": "en",
    "review_count": 3,
    "review_result": true,
    "deleted": false,
    "rank": 0,
    "synthetic": true,
    "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
    "labels": {
        "spam": { "value": 0.0, "count": 3 },
        "lang_mismatch": { "value": 0.0, "count": 3 },
        "pii": { "value": 0.0, "count": 3 },
        "not_appropriate": { "value": 0.0, "count": 3 },
        "hate_speech": { "value": 0.0, "count": 3 },
        "sexual_content": { "value": 0.0, "count": 3 },
        "quality": { "value": 0.416, "count": 3 },
        "toxicity": { "value": 0.16, "count": 3 },
        "humor": { "value": 0.0, "count": 3 },
        "creativity": { "value": 0.33, "count": 3 },
        "violence": { "value": 0.16, "count": 3 }
    }
}

JSON Example: Conversation Tree

For readability, only a subset of the message properties is shown here.

{
  "message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
  "tree_state": "ready_for_export",
  "prompt": {
    "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
    "text": "Why can't we divide by 0? (..)",
    "role": "prompter",
    "lang": "en",
    "replies": [
      {
        "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
        "text": "The reason we cannot divide by zero is because (..)",
        "role": "assistant",
        "lang": "en",
        "replies": [
          // ...
        ]
      },
      {
        "message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
        "text": "The reason that the result of a division by zero is (..)",
        "role": "assistant",
        "lang": "en",
        "replies": [
          {
            "message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
            "text": "Math is confusing. Like those weird Irrational (..)",
            "role": "prompter",
            "lang": "en",
            "replies": [
              {
                "message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
                "text": "Irrational numbers are simply numbers (..)",
                "role": "assistant",
                "lang": "en",
                "replies": []
              },
              // ...
            ]
          }
        ]
      }
    ]
  }
}

Please refer to oasst-data for details about the data structure and Python code to read and write jsonl files containing oasst data objects.

If you would like to explore the dataset yourself you can find a getting-started notebook in the notebooks/openassistant-oasst1 folder of the LAION-AI/Open-Assistant github repository.

Main Dataset Files

Conversation data is provided either as nested messages in trees (extension .trees.jsonl.gz) or as a flat list (table) of messages (extension .messages.jsonl.gz).

Ready For Export Trees

2023-04-12_oasst_ready.trees.jsonl.gz       10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz    88,838 messages

Trees in ready_for_export state without spam and deleted messages including message labels. The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training.

All Trees

2023-04-12_oasst_all.trees.jsonl.gz         66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz     161,443 messages

All trees, including those in states prompt_lottery_waiting (trees that consist of only one message, namely the initial prompt), aborted_low_grade (trees that stopped growing because the messages had low quality), and halted_by_moderator.

Supplemental Exports: Spam & Prompts

2023-04-12_oasst_spam.messages.jsonl.gz

These are messages which were deleted or have a negative review result ("review_result": false). Besides low quality, a frequent reason for message deletion is a wrong language tag.

2023-04-12_oasst_prompts.messages.jsonl.gz

These are all the kept initial prompt messages with positive review result (no spam) of trees in ready_for_export or prompt_lottery_waiting state.

Using the Huggingface Datasets

While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees. Nevertheless, we make all messages which can also be found in the file 2023-04-12_oasst_ready.trees.jsonl.gz available in parquet as train/validation splits. These are directly loadable by Huggingface Datasets.

To load the oasst1 train & validation splits use:

from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train']      # len(train)=84437 (95%)
val = ds['validation']   # len(val)=4401 (5%)

The messages appear in depth-first order of the message trees.

Full conversation trees can be reconstructed from the flat messages table by using the parent_id and message_id properties to identify the parent-child relationship of messages. The message_tree_id and tree_state properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state.

Languages

OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:

Languages with over 1000 messages

  • English: 71956
  • Spanish: 43061
  • Russian: 9089
  • German: 5279
  • Chinese: 4962
  • French: 4251
  • Thai: 3042
  • Portuguese (Brazil): 2969
  • Catalan: 2260
  • Korean: 1553
  • Ukrainian: 1352
  • Italian: 1320
  • Japanese: 1018
Languages with under 1000 messages
  • Vietnamese: 952
  • Basque: 947
  • Polish: 886
  • Hungarian: 811
  • Arabic: 666
  • Dutch: 628
  • Swedish: 512
  • Turkish: 454
  • Finnish: 386
  • Czech: 372
  • Danish: 358
  • Galician: 339
  • Hebrew: 255
  • Romanian: 200
  • Norwegian Bokmål: 133
  • Indonesian: 115
  • Bulgarian: 95
  • Bengali: 82
  • Persian: 72
  • Greek: 66
  • Esperanto: 59
  • Slovak: 19

Contact