Update README.md

This commit is contained in:
Andreas Köpf 2023-04-15 15:47:38 +00:00 committed by huggingface-web
parent 06c4be1951
commit c5ca5f1f7c

@ -110,7 +110,7 @@ language:
tags:
- human-feedback
size_categories:
- 10K<n<100K
- 100K<n<1M
pretty_name: OpenAssistant Conversations
---
@ -249,8 +249,8 @@ all messages of a message tree or to select trees by their state.
### Ready For Export Trees
```
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
```
Trees in `ready_for_export` state without spam and deleted messages including message labels.
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
@ -259,8 +259,8 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
### All Trees
```
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
```
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
@ -274,12 +274,28 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
Messages which were deleted or have a negative review result (`"review_result": false`).
Beside low quality a frequent reason for message deletion is a wrong language tag.
```
2023-04-12_oasst_prompts.messages.jsonl.gz
```
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
### Using the Huggingface Datasets
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
To load the oasst1 train & validation splits use:
```python
from datasets import load_dataset
ds = load_dataset("OpenAssistant/oasst1")
train = ds['train'] # len(train)=84437 (95%)
val = ds['validation'] # len(val)=4401 (5%)
```
The messages appear in depth-first order of the message trees.
### Languages
@ -332,4 +348,4 @@ OpenAssistant Conversations incorporates 35 different languages with a distribut
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)