Added language distribution to README (#2)
- Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293) Co-authored-by: Dimitri <dvruette@users.noreply.huggingface.co>
This commit is contained in:
parent
835d217b5b
commit
0a11b2a98c
50
README.md
50
README.md
@ -119,7 +119,7 @@ size_categories:
|
||||
|
||||
- **Homepage:** https://www.open-assistant.io/
|
||||
- **Repository:** https://github.com/LAION-AI/Open-Assistant
|
||||
- **Paper:** TBA
|
||||
- **Paper:** TBA on April 17, 2023
|
||||
|
||||
### Dataset Summary
|
||||
|
||||
@ -129,13 +129,53 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre
|
||||
35 different languages, annotated with 461,292 quality ratings. The corpus is a product
|
||||
of a worldwide crowd-sourcing effort involving over 13,500 volunteers.
|
||||
|
||||
### Supported Tasks and Leaderboards
|
||||
|
||||
[More Information Needed]
|
||||
|
||||
### Languages
|
||||
|
||||
[More Information Needed]
|
||||
OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows:
|
||||
|
||||
**Languages with over 1000 messages
|
||||
- English: 71956
|
||||
- Spanish: 43061
|
||||
- Russian: 9089
|
||||
- German: 5279
|
||||
- Chinese: 4962
|
||||
- French: 4251
|
||||
- Thai: 3042
|
||||
- Portuguese (Brazil): 2969
|
||||
- Catalan: 2260
|
||||
- Korean: 1553
|
||||
- Ukrainian: 1352
|
||||
- Italian: 1320
|
||||
- Japanese: 1018
|
||||
|
||||
<details>
|
||||
<summary>**Languages with < 1000 messages**</summary>
|
||||
<ul>
|
||||
<li>Vietnamese: 952</li>
|
||||
<li>Basque: 947</li>
|
||||
<li>Polish: 886</li>
|
||||
<li>Hungarian: 811</li>
|
||||
<li>Arabic: 666</li>
|
||||
<li>Dutch: 628</li>
|
||||
<li>Swedish: 512</li>
|
||||
<li>Turkish: 454</li>
|
||||
<li>Finnish: 386</li>
|
||||
<li>Czech: 372</li>
|
||||
<li>Danish: 358</li>
|
||||
<li>Galician: 339</li>
|
||||
<li>Hebrew: 255</li>
|
||||
<li>Romanian: 200</li>
|
||||
<li>Norwegian Bokmål: 133</li>
|
||||
<li>Indonesian: 115</li>
|
||||
<li>Bulgarian: 95</li>
|
||||
<li>Bengali: 82</li>
|
||||
<li>Persian: 72</li>
|
||||
<li>Greek: 66</li>
|
||||
<li>Esperanto: 59</li>
|
||||
<li>Slovak: 19</li>
|
||||
</ul>
|
||||
</details>
|
||||
|
||||
## Dataset Structure
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user