oasst1/README.md

---
license: apache-2.0
dataset_info:
  features:
  - name: message_id
    dtype: string
  - name: parent_id
    dtype: string
  - name: user_id
    dtype: string
  - name: created_date
    dtype: string
  - name: text
    dtype: string
  - name: role
    dtype: string
  - name: lang
    dtype: string
  - name: review_count
    dtype: int32
  - name: review_result
    dtype: bool
  - name: deleted
    dtype: bool
  - name: rank
    dtype: int32
  - name: synthetic
    dtype: bool
  - name: model_name
    dtype: string
  - name: detoxify
    struct:
    - name: toxicity
      dtype: float64
    - name: severe_toxicity
      dtype: float64
    - name: obscene
      dtype: float64
    - name: identity_attack
      dtype: float64
    - name: insult
      dtype: float64
    - name: threat
      dtype: float64
    - name: sexual_explicit
      dtype: float64
  - name: message_tree_id
    dtype: string
  - name: tree_state
    dtype: string
  - name: emojis
    sequence:
    - name: name
      dtype: string
    - name: count
      dtype: int32
  - name: labels
    sequence:
    - name: name
      dtype: string
    - name: value
      dtype: float64
    - name: count
      dtype: int32
  splits:
  - name: train
    num_bytes: 100367999
    num_examples: 84437
  - name: validation
    num_bytes: 5243405
    num_examples: 4401
  download_size: 41596430
  dataset_size: 105611404
---

# Dataset Card for OASST1

## Dataset Description

- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** TBA

### Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant 
Conversations (OASST1), a human-generated, human-annotated assistant-style conversation 
corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

### Supported Tasks and Leaderboards

[More Information Needed]

### Languages

[More Information Needed]

## Dataset Structure

### Data Instances

[More Information Needed]

### Data Fields

[More Information Needed]

### Data Splits

[More Information Needed]

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

[More Information Needed]

### Citation Information

[More Information Needed]

### Contributions

[More Information Needed]
initial commit 2023-04-13 15:48:16 +00:00			`---`
			`license: apache-2.0`
Upload README.md with huggingface_hub 2023-04-13 22:14:10 +00:00			`dataset_info:`
			`features:`
			`- name: message_id`
			`dtype: string`
			`- name: parent_id`
			`dtype: string`
			`- name: user_id`
			`dtype: string`
			`- name: created_date`
			`dtype: string`
			`- name: text`
			`dtype: string`
			`- name: role`
			`dtype: string`
			`- name: lang`
			`dtype: string`
			`- name: review_count`
			`dtype: int32`
			`- name: review_result`
			`dtype: bool`
			`- name: deleted`
			`dtype: bool`
			`- name: rank`
			`dtype: int32`
			`- name: synthetic`
			`dtype: bool`
			`- name: model_name`
			`dtype: string`
			`- name: detoxify`
			`struct:`
			`- name: toxicity`
			`dtype: float64`
			`- name: severe_toxicity`
			`dtype: float64`
			`- name: obscene`
			`dtype: float64`
			`- name: identity_attack`
			`dtype: float64`
			`- name: insult`
			`dtype: float64`
			`- name: threat`
			`dtype: float64`
			`- name: sexual_explicit`
			`dtype: float64`
			`- name: message_tree_id`
			`dtype: string`
			`- name: tree_state`
			`dtype: string`
			`- name: emojis`
			`sequence:`
			`- name: name`
			`dtype: string`
			`- name: count`
			`dtype: int32`
			`- name: labels`
			`sequence:`
			`- name: name`
			`dtype: string`
			`- name: value`
			`dtype: float64`
			`- name: count`
			`dtype: int32`
			`splits:`
			`- name: train`
Upload README.md with huggingface_hub 2023-04-14 23:53:21 +00:00			`num_bytes: 100367999`
			`num_examples: 84437`
			`- name: validation`
			`num_bytes: 5243405`
			`num_examples: 4401`
			`download_size: 41596430`
			`dataset_size: 105611404`
initial commit 2023-04-13 15:48:16 +00:00			`---`
Update README.md 2023-04-13 22:15:32 +00:00
Update README.md 2023-04-15 00:16:00 +00:00			`# Dataset Card for OASST1`
Update README.md 2023-04-13 22:15:32 +00:00
			`## Dataset Description`

Update README.md 2023-04-15 00:16:00 +00:00			`- Homepage: https://www.open-assistant.io/`
			`- Repository: https://github.com/LAION-AI/Open-Assistant`
			`- Paper: TBA`
Update README.md 2023-04-13 22:15:32 +00:00
			`### Dataset Summary`

Update README.md 2023-04-15 00:16:00 +00:00			`In an effort to democratize research on large-scale alignment, we release OpenAssistant`
			`Conversations (OASST1), a human-generated, human-annotated assistant-style conversation`
			`corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in`
			`35 different languages, annotated with 461,292 quality ratings. The corpus is a product`
			`of a worldwide crowd-sourcing effort involving over 13,500 volunteers.`
Update README.md 2023-04-13 22:15:32 +00:00
			`### Supported Tasks and Leaderboards`

			`[More Information Needed]`

			`### Languages`

			`[More Information Needed]`

			`## Dataset Structure`

			`### Data Instances`

			`[More Information Needed]`

			`### Data Fields`

			`[More Information Needed]`

			`### Data Splits`

			`[More Information Needed]`

			`## Dataset Creation`

			`### Curation Rationale`

			`[More Information Needed]`

			`### Source Data`

			`#### Initial Data Collection and Normalization`

			`[More Information Needed]`

			`#### Who are the source language producers?`

			`[More Information Needed]`

			`### Annotations`

			`#### Annotation process`

			`[More Information Needed]`

			`#### Who are the annotators?`

			`[More Information Needed]`

			`### Personal and Sensitive Information`

			`[More Information Needed]`

			`## Considerations for Using the Data`

			`### Social Impact of Dataset`

			`[More Information Needed]`

			`### Discussion of Biases`

			`[More Information Needed]`

			`### Other Known Limitations`

			`[More Information Needed]`

			`## Additional Information`

			`### Dataset Curators`

			`[More Information Needed]`

			`### Licensing Information`

			`[More Information Needed]`

			`### Citation Information`

			`[More Information Needed]`

			`### Contributions`

			`[More Information Needed]`