Added language distribution to README (#2)

- Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293)


Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>
This commit is contained in:
Andreas Köpf 2023-04-15 09:06:58 +00:00 committed by system
parent 835d217b5b
commit 0a11b2a98c

@ -119,7 +119,7 @@ size_categories:
- **Homepage:** https://www.open-assistant.io/
- **Repository:** https://github.com/LAION-AI/Open-Assistant
- **Paper:** TBA
- **Paper:** TBA on April 17, 2023
### Dataset Summary
@ -129,13 +129,53 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
### Supported Tasks and Leaderboards
[More Information Needed]
### Languages
[More Information Needed]
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
**Languages with over 1000 messages
- English: 71956
- Spanish: 43061
- Russian: 9089
- German: 5279
- Chinese: 4962
- French: 4251
- Thai: 3042
- Portuguese (Brazil): 2969
- Catalan: 2260
- Korean: 1553
- Ukrainian: 1352
- Italian: 1320
- Japanese: 1018
<details>
<summary>**Languages with < 1000 messages**</summary>
<ul>
<li>Vietnamese: 952</li>
<li>Basque: 947</li>
<li>Polish: 886</li>
<li>Hungarian: 811</li>
<li>Arabic: 666</li>
<li>Dutch: 628</li>
<li>Swedish: 512</li>
<li>Turkish: 454</li>
<li>Finnish: 386</li>
<li>Czech: 372</li>
<li>Danish: 358</li>
<li>Galician: 339</li>
<li>Hebrew: 255</li>
<li>Romanian: 200</li>
<li>Norwegian Bokmål: 133</li>
<li>Indonesian: 115</li>
<li>Bulgarian: 95</li>
<li>Bengali: 82</li>
<li>Persian: 72</li>
<li>Greek: 66</li>
<li>Esperanto: 59</li>
<li>Slovak: 19</li>
</ul>
</details>
## Dataset Structure