OpenAssistant/oasst1

Andreas Köpf dee0b5f871 Update README.md

2023-04-15 12:36:25 +00:00

3.9 KiB

Raw Blame History

license

dataset_info

language

tags

size_categories

apache-2.0

features

splits

download_size

dataset_size

name	dtype
message_id	string

name	dtype
parent_id	string

name	dtype
user_id	string

name	dtype
created_date	string

name	dtype
text	string

name	dtype
role	string

name	dtype
lang	string

name	dtype
review_count	int32

name	dtype
review_result	bool

name	dtype
deleted	bool

name	dtype
rank	int32

name	dtype
synthetic	bool

name	dtype
model_name	string

name

struct

detoxify

name	dtype
toxicity	float64

name	dtype
severe_toxicity	float64

name	dtype
obscene	float64

name	dtype
identity_attack	float64

name	dtype
insult	float64

name	dtype
threat	float64

name	dtype
sexual_explicit	float64

name	dtype
message_tree_id	string

name	dtype
tree_state	string

name

sequence

emojis

name	dtype
name	string

name	dtype
count	int32

name

sequence

labels

name	dtype
name	string

name	dtype
value	float64

name	dtype
count	int32

name	num_bytes	num_examples
train	100367999	84437

name	num_bytes	num_examples
validation	5243405	4401

41596430

105611404

human-feedback

10K<n<100K

OpenAssistant Conversations Dataset (OASST1)

Dataset Description

Homepage: https://www.open-assistant.io/
Repository: https://github.com/LAION-AI/Open-Assistant
Paper: TBA on April 17, 2023

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

The dataset was exported from the open-assistant.io production database on April, 12 2023.

Dataset Structure

Thes dataset contains demonstrations of of human-assistant conversations that were collected on the open-assistant.io website.

All conversations are exported as message trees which contain conversation messages nodes. Each message has a role which can either be "assistant" or "prompter". The root node of a message tree is called the initial prompt. Nodes with at least two replies of completed trees have a rank field which indicates the users' preference consensus. The lower the rank the better the message.

Languages

OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:

Languages with over 1000 messages

English: 71956
Spanish: 43061
Russian: 9089
German: 5279
Chinese: 4962
French: 4251
Thai: 3042
Portuguese (Brazil): 2969
Catalan: 2260
Korean: 1553
Ukrainian: 1352
Italian: 1320
Japanese: 1018

Languages with under 1000 messages

Vietnamese: 952
Basque: 947
Polish: 886
Hungarian: 811
Arabic: 666
Dutch: 628
Swedish: 512
Turkish: 454
Finnish: 386
Czech: 372
Danish: 358
Galician: 339
Hebrew: 255
Romanian: 200
Norwegian Bokmål: 133
Indonesian: 115
Bulgarian: 95
Bengali: 82
Persian: 72
Greek: 66
Esperanto: 59
Slovak: 19

3.9 KiB Raw Blame History

OpenAssistant Conversations Dataset (OASST1)

Dataset Description

Dataset Summary

Dataset Structure

Languages

3.9 KiB

Raw Blame History