Update README.md
This commit is contained in:
parent
06c4be1951
commit
c5ca5f1f7c
30
README.md
30
README.md
@ -110,7 +110,7 @@ language:
|
|||||||
tags:
|
tags:
|
||||||
- human-feedback
|
- human-feedback
|
||||||
size_categories:
|
size_categories:
|
||||||
- 10K<n<100K
|
- 100K<n<1M
|
||||||
pretty_name: OpenAssistant Conversations
|
pretty_name: OpenAssistant Conversations
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -249,8 +249,8 @@ all messages of a message tree or to select trees by their state.
|
|||||||
### Ready For Export Trees
|
### Ready For Export Trees
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_ready.trees.jsonl.gz 10364 trees with 88838 total messages
|
2023-04-12_oasst_ready.trees.jsonl.gz 10,364 trees with 88,838 total messages
|
||||||
2023-04-12_oasst_ready.messages.jsonl.gz 88838 messages
|
2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages
|
||||||
```
|
```
|
||||||
Trees in `ready_for_export` state without spam and deleted messages including message labels.
|
Trees in `ready_for_export` state without spam and deleted messages including message labels.
|
||||||
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training.
|
||||||
@ -259,8 +259,8 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF
|
|||||||
### All Trees
|
### All Trees
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_all.trees.jsonl.gz 66497 trees with 161443 total messages
|
2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages
|
||||||
2023-04-12_oasst_all.messages.jsonl.gz 161443 messages
|
2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages
|
||||||
```
|
```
|
||||||
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
|
All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt),
|
||||||
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
|
`aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`.
|
||||||
@ -274,12 +274,28 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist
|
|||||||
Messages which were deleted or have a negative review result (`"review_result": false`).
|
Messages which were deleted or have a negative review result (`"review_result": false`).
|
||||||
Beside low quality a frequent reason for message deletion is a wrong language tag.
|
Beside low quality a frequent reason for message deletion is a wrong language tag.
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
2023-04-12_oasst_prompts.messages.jsonl.gz
|
2023-04-12_oasst_prompts.messages.jsonl.gz
|
||||||
```
|
```
|
||||||
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state.
|
||||||
|
|
||||||
|
### Using the Huggingface Datasets
|
||||||
|
|
||||||
|
While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees.
|
||||||
|
Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation
|
||||||
|
split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/).
|
||||||
|
|
||||||
|
To load the oasst1 train & validation splits use:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from datasets import load_dataset
|
||||||
|
ds = load_dataset("OpenAssistant/oasst1")
|
||||||
|
train = ds['train'] # len(train)=84437 (95%)
|
||||||
|
val = ds['validation'] # len(val)=4401 (5%)
|
||||||
|
```
|
||||||
|
|
||||||
|
The messages appear in depth-first order of the message trees.
|
||||||
|
|
||||||
|
|
||||||
### Languages
|
### Languages
|
||||||
|
|
||||||
@ -332,4 +348,4 @@ OpenAssistant Conversations incorporates 35 different languages with a distribut
|
|||||||
|
|
||||||
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
|
- Discord [Open Assistant Discord Server](https://ykilcher.com/open-assistant-discord)
|
||||||
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
|
- GitHub: [LAION-AI/Open-Assistant](https://github.com/LAION-AI/Open-Assistant)
|
||||||
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
|
- E-Mail: [open-assistent@laion.ai](mailto:open-assistent@laion.ai) (yes, with e)
|
Loading…
x
Reference in New Issue
Block a user