174 lines
3.1 KiB
Markdown
174 lines
3.1 KiB
Markdown
---
|
|
license: apache-2.0
|
|
dataset_info:
|
|
features:
|
|
- name: message_id
|
|
dtype: string
|
|
- name: parent_id
|
|
dtype: string
|
|
- name: user_id
|
|
dtype: string
|
|
- name: created_date
|
|
dtype: string
|
|
- name: text
|
|
dtype: string
|
|
- name: role
|
|
dtype: string
|
|
- name: lang
|
|
dtype: string
|
|
- name: review_count
|
|
dtype: int32
|
|
- name: review_result
|
|
dtype: bool
|
|
- name: deleted
|
|
dtype: bool
|
|
- name: rank
|
|
dtype: int32
|
|
- name: synthetic
|
|
dtype: bool
|
|
- name: model_name
|
|
dtype: string
|
|
- name: detoxify
|
|
struct:
|
|
- name: toxicity
|
|
dtype: float64
|
|
- name: severe_toxicity
|
|
dtype: float64
|
|
- name: obscene
|
|
dtype: float64
|
|
- name: identity_attack
|
|
dtype: float64
|
|
- name: insult
|
|
dtype: float64
|
|
- name: threat
|
|
dtype: float64
|
|
- name: sexual_explicit
|
|
dtype: float64
|
|
- name: message_tree_id
|
|
dtype: string
|
|
- name: tree_state
|
|
dtype: string
|
|
- name: emojis
|
|
sequence:
|
|
- name: name
|
|
dtype: string
|
|
- name: count
|
|
dtype: int32
|
|
- name: labels
|
|
sequence:
|
|
- name: name
|
|
dtype: string
|
|
- name: value
|
|
dtype: float64
|
|
- name: count
|
|
dtype: int32
|
|
splits:
|
|
- name: train
|
|
num_bytes: 100367999
|
|
num_examples: 84437
|
|
- name: validation
|
|
num_bytes: 5243405
|
|
num_examples: 4401
|
|
download_size: 41596430
|
|
dataset_size: 105611404
|
|
---
|
|
|
|
# Dataset Card for OASST1
|
|
|
|
## Dataset Description
|
|
|
|
- **Homepage:** https://www.open-assistant.io/
|
|
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
|
- **Paper:** TBA
|
|
|
|
### Dataset Summary
|
|
|
|
In an effort to democratize research on large-scale alignment, we release OpenAssistant
|
|
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
|
|
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
|
|
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
|
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
|
|
|
### Supported Tasks and Leaderboards
|
|
|
|
[More Information Needed]
|
|
|
|
### Languages
|
|
|
|
[More Information Needed]
|
|
|
|
## Dataset Structure
|
|
|
|
### Data Instances
|
|
|
|
[More Information Needed]
|
|
|
|
### Data Fields
|
|
|
|
[More Information Needed]
|
|
|
|
### Data Splits
|
|
|
|
[More Information Needed]
|
|
|
|
## Dataset Creation
|
|
|
|
### Curation Rationale
|
|
|
|
[More Information Needed]
|
|
|
|
### Source Data
|
|
|
|
#### Initial Data Collection and Normalization
|
|
|
|
[More Information Needed]
|
|
|
|
#### Who are the source language producers?
|
|
|
|
[More Information Needed]
|
|
|
|
### Annotations
|
|
|
|
#### Annotation process
|
|
|
|
[More Information Needed]
|
|
|
|
#### Who are the annotators?
|
|
|
|
[More Information Needed]
|
|
|
|
### Personal and Sensitive Information
|
|
|
|
[More Information Needed]
|
|
|
|
## Considerations for Using the Data
|
|
|
|
### Social Impact of Dataset
|
|
|
|
[More Information Needed]
|
|
|
|
### Discussion of Biases
|
|
|
|
[More Information Needed]
|
|
|
|
### Other Known Limitations
|
|
|
|
[More Information Needed]
|
|
|
|
## Additional Information
|
|
|
|
### Dataset Curators
|
|
|
|
[More Information Needed]
|
|
|
|
### Licensing Information
|
|
|
|
[More Information Needed]
|
|
|
|
### Citation Information
|
|
|
|
[More Information Needed]
|
|
|
|
### Contributions
|
|
|
|
[More Information Needed] |