Update README.md
This commit is contained in:
parent
06c4be1951
commit
c5ca5f1f7c
30
README.md
30
README.md
@ -110,7 +110,7 @@ language:
|
||||
tags:
|
||||
- human-feedback
|
||||
size_categories:
|
||||
- 10K<n<100K
|
||||
- 100K<n<1M
|
||||
pretty_name: OpenAssistant Conversations
|
||||
---
|
||||
|
||||
@ -249,8 +249,8 @@ all messages of a message tree or to select trees by their state.
|
||||
### Ready For Export Trees
|
||||
|
||||
```
|
||||
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
||||
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
||||
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
|
||||
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
|
||||
```
|
||||
Trees in `ready_for_export` state without spam and deleted messages including message labels.
|
||||
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
||||
@ -259,8 +259,8 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
|
||||
### All Trees
|
||||
|
||||
```
|
||||
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
|
||||
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
||||
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
|
||||
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
|
||||
```
|
||||
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
|
||||
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
|
||||
@ -274,12 +274,28 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
|
||||
Messages which were deleted or have a negative review result (`"review_result": false`).
|
||||
Beside low quality a frequent reason for message deletion is a wrong language tag.
|
||||
|
||||
|
||||
```
|
||||
2023-04-12_oasst_prompts.messages.jsonl.gz
|
||||
```
|
||||
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
||||
|
||||
### Using the Huggingface Datasets
|
||||
|
||||
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
|
||||
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
|
||||
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
|
||||
|
||||
To load the oasst1 train & validation splits use:
|
||||
|
||||
```python
|
||||
from datasets import load_dataset
|
||||
ds = load_dataset("OpenAssistant/oasst1")
|
||||
train = ds['train'] # len(train)=84437 (95%)
|
||||
val = ds['validation'] # len(val)=4401 (5%)
|
||||
```
|
||||
|
||||
The messages appear in depth-first order of the message trees.
|
||||
|
||||
|
||||
### Languages
|
||||
|
||||
@ -332,4 +348,4 @@ OpenAssistant Conversations incorporates 35 different languages with a distribut
|
||||
|
||||
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
|
||||
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
|
||||
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
|
||||
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
|
Loading…
x
Reference in New Issue
Block a user