Added language distribution to README (#2)
- Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293) Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>
This commit is contained in:
parent
835d217b5b
commit
0a11b2a98c
50
README.md
50
README.md
@ -119,7 +119,7 @@ size_categories:
|
|||||||
|
|
||||||
- **Homepage:** https://www.open-assistant.io/
|
- **Homepage:** https://www.open-assistant.io/
|
||||||
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
||||||
- **Paper:** TBA
|
- **Paper:** TBA on April 17, 2023
|
||||||
|
|
||||||
### Dataset Summary
|
### Dataset Summary
|
||||||
|
|
||||||
@ -129,13 +129,53 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
|
|||||||
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
||||||
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
||||||
|
|
||||||
### Supported Tasks and Leaderboards
|
|
||||||
|
|
||||||
[More Information Needed]
|
|
||||||
|
|
||||||
### Languages
|
### Languages
|
||||||
|
|
||||||
[More Information Needed]
|
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
|
||||||
|
|
||||||
|
**Languages with over 1000 messages
|
||||||
|
- English: 71956
|
||||||
|
- Spanish: 43061
|
||||||
|
- Russian: 9089
|
||||||
|
- German: 5279
|
||||||
|
- Chinese: 4962
|
||||||
|
- French: 4251
|
||||||
|
- Thai: 3042
|
||||||
|
- Portuguese (Brazil): 2969
|
||||||
|
- Catalan: 2260
|
||||||
|
- Korean: 1553
|
||||||
|
- Ukrainian: 1352
|
||||||
|
- Italian: 1320
|
||||||
|
- Japanese: 1018
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>**Languages with < 1000 messages**</summary>
|
||||||
|
<ul>
|
||||||
|
<li>Vietnamese: 952</li>
|
||||||
|
<li>Basque: 947</li>
|
||||||
|
<li>Polish: 886</li>
|
||||||
|
<li>Hungarian: 811</li>
|
||||||
|
<li>Arabic: 666</li>
|
||||||
|
<li>Dutch: 628</li>
|
||||||
|
<li>Swedish: 512</li>
|
||||||
|
<li>Turkish: 454</li>
|
||||||
|
<li>Finnish: 386</li>
|
||||||
|
<li>Czech: 372</li>
|
||||||
|
<li>Danish: 358</li>
|
||||||
|
<li>Galician: 339</li>
|
||||||
|
<li>Hebrew: 255</li>
|
||||||
|
<li>Romanian: 200</li>
|
||||||
|
<li>Norwegian Bokmål: 133</li>
|
||||||
|
<li>Indonesian: 115</li>
|
||||||
|
<li>Bulgarian: 95</li>
|
||||||
|
<li>Bengali: 82</li>
|
||||||
|
<li>Persian: 72</li>
|
||||||
|
<li>Greek: 66</li>
|
||||||
|
<li>Esperanto: 59</li>
|
||||||
|
<li>Slovak: 19</li>
|
||||||
|
</ul>
|
||||||
|
</details>
|
||||||
|
|
||||||
## Dataset Structure
|
## Dataset Structure
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user