oasst1/README.md

---
license: apache-2.0
dataset_info:
  features:
  - name: message_id
    dtype: string
  - name: parent_id
    dtype: string
  - name: user_id
    dtype: string
  - name: created_date
    dtype: string
  - name: text
    dtype: string
  - name: role
    dtype: string
  - name: lang
    dtype: string
  - name: review_count
    dtype: int32
  - name: review_result
    dtype: bool
  - name: deleted
    dtype: bool
  - name: rank
    dtype: int32
  - name: synthetic
    dtype: bool
  - name: model_name
    dtype: string
  - name: detoxify
    struct:
    - name: toxicity
      dtype: float64
    - name: severe_toxicity
      dtype: float64
    - name: obscene
      dtype: float64
    - name: identity_attack
      dtype: float64
    - name: insult
      dtype: float64
    - name: threat
      dtype: float64
    - name: sexual_explicit
      dtype: float64
  - name: message_tree_id
    dtype: string
  - name: tree_state
    dtype: string
  - name: emojis
    sequence:
    - name: name
      dtype: string
    - name: count
      dtype: int32
  - name: labels
    sequence:
    - name: name
      dtype: string
    - name: value
      dtype: float64
    - name: count
      dtype: int32
  splits:
  - name: train
    num_bytes: 100367999
    num_examples: 84437
  - name: validation
    num_bytes: 5243405
    num_examples: 4401
  download_size: 41596430
  dataset_size: 105611404
---

# Dataset Card for OASST1

## Dataset Description

- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** TBA

### Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

### Supported Tasks and Leaderboards

[More Information Needed]

### Languages

[More Information Needed]

## Dataset Structure

### Data Instances

[More Information Needed]

### Data Fields

[More Information Needed]

### Data Splits

[More Information Needed]

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

[More Information Needed]

### Citation Information

[More Information Needed]

### Contributions

[More Information Needed]