OpenAssistant/oasst1

Andreas Köpf 8ba5a5a34d Update README.md

2023-04-15 14:46:30 +00:00

8.8 KiB

Raw Blame History

license

dataset_info

language

OpenAssistant Conversations Dataset (OASST1)

Dataset Description

Homepage: https://www.open-assistant.io/
Repository: https://github.com/LAION-AI/Open-Assistant
Paper: TBA on April 17, 2023

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Please refer to our paper for further details.

Dataset Structure

This dataset contains demonstrations of human-assistant conversations which were collected on the open-assistant.io website until April, 12 2023.

Conversations are exported as conversation trees with messages as nodes. The root node of a conversation tree is called the initial prompt. Each message can have multiple replies. Nodes with more than one reply can have a rank field indicating the user preference (the most preferred message has rank 0).

All messages have a role which can either be "assistant" or "prompter". The roles in conversation threads from prompt to leaf node in a conversation tree are stricly alternating between "prompter" and "assistant".

JSON Example: Message

For readability the following JSON examples are shown formatted with indentation on multiple lines. Objects are stored without indentation on a single lines in the actual jsonl files.

{
    "message_id": "218440fd-5317-4355-91dc-d001416df62b",
    "parent_id": "13592dfb-a6f9-4748-a92c-32b34e239bb4",
    "user_id": "8e95461f-5e94-4d8b-a2fb-d4717ce973e4",
    "text": "It was the winter of 2035, and artificial intelligence (..)",
    "role": "assistant",
    "lang": "en",
    "review_count": 3,
    "review_result": true,
    "deleted": false,
    "rank": 0,
    "synthetic": true,
    "model_name": "oasst-sft-0_3000,max_new_tokens=400 (..)",
    "labels": {
        "spam": { "value": 0.0, "count": 3 },
        "lang_mismatch": { "value": 0.0, "count": 3 },
        "pii": { "value": 0.0, "count": 3 },
        "not_appropriate": { "value": 0.0, "count": 3 },
        "hate_speech": { "value": 0.0, "count": 3 },
        "sexual_content": { "value": 0.0, "count": 3 },
        "quality": { "value": 0.416, "count": 3 },
        "toxicity": { "value": 0.16, "count": 3 },
        "humor": { "value": 0.0, "count": 3 },
        "creativity": { "value": 0.33, "count": 3 },
        "violence": { "value": 0.16, "count": 3 }
    }
}

JSON Example: Conversation Tree

For readability only a subset of properties are shown here.

{
  "message_tree_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
  "tree_state": "ready_for_export",
  "prompt": {
    "message_id": "14fbb664-a620-45ce-bee4-7c519b16a793",
    "text": "Why can't we divide by 0? (..)",
    "role": "prompter",
    "lang": "en",
    "replies": [
      {
        "message_id": "894d30b6-56b4-4605-a504-89dd15d4d1c8",
        "text": "The reason we cannot divide by zero is because (..)",
        "role": "assistant",
        "lang": "en",
        "replies": [
          // ...
        ]
      },
      {
        "message_id": "84d0913b-0fd9-4508-8ef5-205626a7039d",
        "text": "The reason that the result of a division by zero is (..)",
        "role": "assistant",
        "lang": "en",
        "replies": [
          {
            "message_id": "3352725e-f424-4e3b-a627-b6db831bdbaa",
            "text": "Math is confusing. Like those weird Irrational (..)",
            "role": "prompter",
            "lang": "en",
            "replies": [
              {
                "message_id": "f46207ca-3149-46e9-a466-9163d4ce499c",
                "text": "Irrational numbers are simply numbers (..)",
                "role": "assistant",
                "lang": "en",
                "replies": []
              },
              // ...
            ]
          }
        ]
      }
    ]
  }
}

Please refer to oasst-data for details about the data structure and python code to read and write jsonl files containing oasst objects.

Main Dataset Files

Data is provided either as nested messages in conversation trees (extension .trees.jsonl.gz) or as flat list of messages (extension .messages.jsonl.gz).

Full conversation trees can be reconstructed from flat messages using the parent_id and message_id properties to identify their parent-child relationship. The message_tree_id and tree_state properties (only present in flat messages files) can be used to find all all messages of a message tree or to select trees by their state.

Ready For Export Trees

2023-04-12_oasst_ready.trees.jsonl.gz       10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz    88838 messages

Trees in ready_for_export state without spam and deleted messages including message labels. The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.

All Trees

2023-04-12_oasst_all.trees.jsonl.gz         66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz     161443 messages

All trees including those in states prompt_lottery_waiting, aborted_low_grade, halted_by_moderator.

Supplemental Exports: Spam & Prompts

2023-04-12_oasst_spam.messages.jsonl.gz

Messages which were deleted or have a negative review result ("review_result": false). Beside low quality a frequent reason for message deletion is a wrong language tag.

2023-04-12_oasst_prompts.messages.jsonl.gz

All non-deleted initial prompt messages with positile spam review result of trees in ready_for_export or prompt_lottery_waiting state.

Languages

OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:

Languages with over 1000 messages

English: 71956
Spanish: 43061
Russian: 9089
German: 5279
Chinese: 4962
French: 4251
Thai: 3042
Portuguese (Brazil): 2969
Catalan: 2260
Korean: 1553
Ukrainian: 1352
Italian: 1320
Japanese: 1018

Languages with under 1000 messages

Vietnamese: 952
Basque: 947
Polish: 886
Hungarian: 811
Arabic: 666
Dutch: 628
Swedish: 512
Turkish: 454
Finnish: 386
Czech: 372
Danish: 358
Galician: 339
Hebrew: 255
Romanian: 200
Norwegian Bokmål: 133
Indonesian: 115
Bulgarian: 95
Bengali: 82
Persian: 72
Greek: 66
Esperanto: 59
Slovak: 19

Contact

Discord Open Assistant Discord Server
GitHub: LAION-AI/Open-Assistant
E-Mail: open-assistent@laion.ai (yes, with e)

8.8 KiB Raw Blame History