oasst1/README.md
2023-04-15 00:16:00 +00:00

3.1 KiB

license dataset_info
apache-2.0
features splits download_size dataset_size
name dtype
message_id string
name dtype
parent_id string
name dtype
user_id string
name dtype
created_date string
name dtype
text string
name dtype
role string
name dtype
lang string
name dtype
review_count int32
name dtype
review_result bool
name dtype
deleted bool
name dtype
rank int32
name dtype
synthetic bool
name dtype
model_name string
name struct
detoxify
name dtype
toxicity float64
name dtype
severe_toxicity float64
name dtype
obscene float64
name dtype
identity_attack float64
name dtype
insult float64
name dtype
threat float64
name dtype
sexual_explicit float64
name dtype
message_tree_id string
name dtype
tree_state string
name sequence
emojis
name dtype
name string
name dtype
count int32
name sequence
labels
name dtype
name string
name dtype
value float64
name dtype
count int32
name num_bytes num_examples
train 100367999 84437
name num_bytes num_examples
validation 5243405 4401
41596430 105611404

Dataset Card for OASST1

Dataset Description

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]