oasst1/README.md

174 lines
3.1 KiB
Markdown
Raw Normal View History

2023-04-13 15:48:16 +00:00
---
license: apache-2.0
2023-04-13 22:14:10 +00:00
dataset_info:
features:
- name: message_id
dtype: string
- name: parent_id
dtype: string
- name: user_id
dtype: string
- name: created_date
dtype: string
- name: text
dtype: string
- name: role
dtype: string
- name: lang
dtype: string
- name: review_count
dtype: int32
- name: review_result
dtype: bool
- name: deleted
dtype: bool
- name: rank
dtype: int32
- name: synthetic
dtype: bool
- name: model_name
dtype: string
- name: detoxify
struct:
- name: toxicity
dtype: float64
- name: severe_toxicity
dtype: float64
- name: obscene
dtype: float64
- name: identity_attack
dtype: float64
- name: insult
dtype: float64
- name: threat
dtype: float64
- name: sexual_explicit
dtype: float64
- name: message_tree_id
dtype: string
- name: tree_state
dtype: string
- name: emojis
sequence:
- name: name
dtype: string
- name: count
dtype: int32
- name: labels
sequence:
- name: name
dtype: string
- name: value
dtype: float64
- name: count
dtype: int32
splits:
- name: train
2023-04-14 23:53:21 +00:00
num_bytes: 100367999
num_examples: 84437
- name: validation
num_bytes: 5243405
num_examples: 4401
download_size: 41596430
dataset_size: 105611404
2023-04-13 15:48:16 +00:00
---
2023-04-13 22:15:32 +00:00
2023-04-15 00:16:00 +00:00
# Dataset Card for OASST1
2023-04-13 22:15:32 +00:00
## Dataset Description
2023-04-15 00:16:00 +00:00
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** TBA
2023-04-13 22:15:32 +00:00
### Dataset Summary
2023-04-15 00:16:00 +00:00
In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
2023-04-13 22:15:32 +00:00
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
## Dataset Structure
### Data Instances
[More Information Needed]
### Data Fields
[More Information Needed]
### Data Splits
[More Information Needed]
## Dataset Creation
### Curation Rationale
[More Information Needed]
### Source Data
#### Initial Data Collection and Normalization
[More Information Needed]
#### Who are the source language producers?
[More Information Needed]
### Annotations
#### Annotation process
[More Information Needed]
#### Who are the annotators?
[More Information Needed]
### Personal and Sensitive Information
[More Information Needed]
## Considerations for Using the Data
### Social Impact of Dataset
[More Information Needed]
### Discussion of Biases
[More Information Needed]
### Other Known Limitations
[More Information Needed]
## Additional Information
### Dataset Curators
[More Information Needed]
### Licensing Information
[More Information Needed]
### Citation Information
[More Information Needed]
### Contributions
[More Information Needed]