diff --git a/README.md b/README.md index 370882c..fa1cc54 100644 --- a/README.md +++ b/README.md @@ -134,18 +134,18 @@ Please refer to our [paper](https://www.ykilcher.com/OA_Paper_2023_04_15.pdf) fo ### Dataset Structure -This dataset contains message trees which each have an inital prompt message as root which can have -multiple child messages as replies which itself again can have multiple replies. +This dataset contains message trees. Each message tree has an initial prompt message as the root node, +which can have multiple child messages as replies, and these child messages can have multiple replies. -All messages have a role property which can either be "assistant" or "prompter". The roles in -conversation threads from prompt to leaf node are stricly alternating between "prompter" and "assistant". +All messages have a role property: this can either be "assistant" or "prompter". The roles in +conversation threads from prompt to leaf node strictly alternate between "prompter" and "assistant". -This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April, 12 2023. +This version of the dataset contains data collected on the [open-assistant.io](https://www.open-assistant.io/) website until April 12 2023. ### JSON Example: Message -For readability the following JSON examples are shown formatted with indentation on multiple lines. -Objects are stored without indentation on a single lines in the actual jsonl files. +For readability, the following JSON examples are shown formatted with indentation on multiple lines. +Objects are stored without indentation (on single lines) in the actual jsonl files. ```json { @@ -179,7 +179,7 @@ Objects are stored without indentation on a single lines in the actual jsonl fil ### JSON Example: Conversation Tree -For readability only a subset of the message properties is shown here. +For readability, only a subset of the message properties is shown here. ```json { @@ -236,7 +236,7 @@ details about the data structure and Python code to read and write jsonl files c ## Main Dataset Files Conversation data is provided either as nested messages in trees (extension `.trees.jsonl.gz`) -or as flat list (table) of messages (extension `.messages.jsonl.gz`). +or as a flat list (table) of messages (extension `.messages.jsonl.gz`). ### Ready For Export Trees @@ -245,7 +245,7 @@ or as flat list (table) of messages (extension `.messages.jsonl.gz`). 2023-04-12_oasst_ready.messages.jsonl.gz 88,838 messages ``` Trees in `ready_for_export` state without spam and deleted messages including message labels. -The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SFT) & reward model (RM) training. +The oasst_ready-trees file usually is sufficient for supervised fine-tuning (SFT) & reward model (RM) training. ### All Trees @@ -254,7 +254,7 @@ The oasst_ready-trees file is normally sufficient for supervised fine-tuning (SF 2023-04-12_oasst_all.trees.jsonl.gz 66,497 trees with 161,443 total messages 2023-04-12_oasst_all.messages.jsonl.gz 161,443 messages ``` -All trees including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the inital prompt), +All trees, including those in states `prompt_lottery_waiting` (trees that consist of only one message, namely the initial prompt), `aborted_low_grade` (trees that stopped growing because the messages had low quality), and `halted_by_moderator`. @@ -263,19 +263,19 @@ All trees including those in states `prompt_lottery_waiting` (trees that consist ``` 2023-04-12_oasst_spam.messages.jsonl.gz ``` -Messages which were deleted or have a negative review result (`"review_result": false`). -Beside low quality a frequent reason for message deletion is a wrong language tag. +These are messages which were deleted or have a negative review result (`"review_result": false`). +Besides low quality, a frequent reason for message deletion is a wrong language tag. ``` 2023-04-12_oasst_prompts.messages.jsonl.gz ``` -All non-deleted initial prompt messages with positile spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state. +These are all the kept initial prompt messages with positive spam review result of trees in `ready_for_export` or `prompt_lottery_waiting` state. ### Using the Huggingface Datasets -While HF datasets is ideal for tabular datasets it is not a natuaral fit for nested data structures like the OpenAssistant conversation trees. -Nevertheless we make all messages which can alse be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available as parquet train/validation -split which is directly loadable by the [Huggingface Datasets](https://pypi.org/project/datasets/). +While HF datasets is ideal for tabular datasets, it is not a natural fit for nested data structures like the OpenAssistant conversation trees. +Nevertheless, we make all messages which can also be found in the file `2023-04-12_oasst_ready.trees.jsonl.gz` available in parquet as train/validation splits. +These are directly loadable by [Huggingface Datasets](https://pypi.org/project/datasets/). To load the oasst1 train & validation splits use: @@ -290,8 +290,7 @@ The messages appear in depth-first order of the message trees. Full conversation trees can be reconstructed from the flat messages table by using the `parent_id` and `message_id` properties to identify the parent-child relationship of messages. The `message_tree_id` -and `tree_state` properties (only present in flat messages files) can be used to find all -all messages of a message tree or to select trees by their state. +and `tree_state` properties (only present in flat messages files) can be used to find all messages of a message tree or to select trees by their state. ### Languages