- Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293) Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>
4.2 KiB
license | dataset_info | language | tags | size_categories | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
apache-2.0 |
|
|
|
|
Dataset Card for OASST1
Dataset Description
- Homepage: https://www.open-assistant.io/
- Repository: https://github.com/LAION-AI/Open-Assistant
- Paper: TBA on April 17, 2023
Dataset Summary
In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
Languages
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
**Languages with < 1000 messages**
- Vietnamese: 952
- Basque: 947
- Polish: 886
- Hungarian: 811
- Arabic: 666
- Dutch: 628
- Swedish: 512
- Turkish: 454
- Finnish: 386
- Czech: 372
- Danish: 358
- Galician: 339
- Hebrew: 255
- Romanian: 200
- Norwegian Bokmål: 133
- Indonesian: 115
- Bulgarian: 95
- Bengali: 82
- Persian: 72
- Greek: 66
- Esperanto: 59
- Slovak: 19
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
[More Information Needed]
Citation Information
[More Information Needed]
Contributions
[More Information Needed]