oasst1/README.md
Andreas Köpf f8b2b76b6c Fix formatting (#3)
- Fix formatting (2f39b5f0ae44bec26b22b1a299d804e6572d270f)


Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>
2023-04-15 11:33:38 +00:00

4.3 KiB

license dataset_info language tags size_categories
apache-2.0
features splits download_size dataset_size
name dtype
message_id string
name dtype
parent_id string
name dtype
user_id string
name dtype
created_date string
name dtype
text string
name dtype
role string
name dtype
lang string
name dtype
review_count int32
name dtype
review_result bool
name dtype
deleted bool
name dtype
rank int32
name dtype
synthetic bool
name dtype
model_name string
name struct
detoxify
name dtype
toxicity float64
name dtype
severe_toxicity float64
name dtype
obscene float64
name dtype
identity_attack float64
name dtype
insult float64
name dtype
threat float64
name dtype
sexual_explicit float64
name dtype
message_tree_id string
name dtype
tree_state string
name sequence
emojis
name dtype
name string
name dtype
count int32
name sequence
labels
name dtype
name string
name dtype
value float64
name dtype
count int32
name num_bytes num_examples
train 100367999 84437
name num_bytes num_examples
validation 5243405 4401
41596430 105611404
en
es
ru
de
pl
th
vi
sv
bn
da
he
it
fa
sk
id
nb
el
nl
hu
eu
zh
eo
ja
ca
cs
bg
fi
pt
tr
ro
ar
uk
gl
fr
ko
human-feedback
10K<n<100K

Dataset Card for OASST1

Dataset Description

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Languages

OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:

Languages with over 1000 messages

  • English: 71956
  • Spanish: 43061
  • Russian: 9089
  • German: 5279
  • Chinese: 4962
  • French: 4251
  • Thai: 3042
  • Portuguese (Brazil): 2969
  • Catalan: 2260
  • Korean: 1553
  • Ukrainian: 1352
  • Italian: 1320
  • Japanese: 1018
Languages with under 1000 messages
  • Vietnamese: 952
  • Basque: 947
  • Polish: 886
  • Hungarian: 811
  • Arabic: 666
  • Dutch: 628
  • Swedish: 512
  • Turkish: 454
  • Finnish: 386
  • Czech: 372
  • Danish: 358
  • Galician: 339
  • Hebrew: 255
  • Romanian: 200
  • Norwegian Bokmål: 133
  • Indonesian: 115
  • Bulgarian: 95
  • Bengali: 82
  • Persian: 72
  • Greek: 66
  • Esperanto: 59
  • Slovak: 19

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]