Andreas Köpf 0a11b2a98c Added language distribution to README (#2 )

- Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293)


Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>

2023-04-15 09:06:58 +00:00

4.2 KiB

Raw Blame History

license

dataset_info

language

Dataset Card for OASST1

Dataset Description

Homepage: https://www.open-assistant.io/
Repository: https://github.com/LAION-AI/Open-Assistant
Paper: TBA on April 17, 2023

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.

Languages

OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:

**Languages with over 1000 messages

English: 71956
Spanish: 43061
Russian: 9089
German: 5279
Chinese: 4962
French: 4251
Thai: 3042
Portuguese (Brazil): 2969
Catalan: 2260
Korean: 1553
Ukrainian: 1352
Italian: 1320
Japanese: 1018

**Languages with < 1000 messages**

Vietnamese: 952
Basque: 947
Polish: 886
Hungarian: 811
Arabic: 666
Dutch: 628
Swedish: 512
Turkish: 454
Finnish: 386
Czech: 372
Danish: 358
Galician: 339
Hebrew: 255
Romanian: 200
Norwegian Bokmål: 133
Indonesian: 115
Bulgarian: 95
Bengali: 82
Persian: 72
Greek: 66
Esperanto: 59
Slovak: 19

Dataset Structure

Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

[More Information Needed]

4.2 KiB Raw Blame History

Dataset Card for OASST1

Dataset Description

Dataset Summary

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

4.2 KiB

Raw Blame History