From 0a11b2a98ca704b0580971dc39a180f0e3b1fcde Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Andreas=20K=C3=B6pf?= Date: Sat, 15 Apr 2023 09:06:58 +0000 Subject: [PATCH] Added language distribution to README (#2) - Added language distribution to README (0ea9cb484b8a6c1895a8ff241424f5b9cbb9c293) Co-authored-by: Dimitri --- README.md | 50 +++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 45 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 1216d89..de131f3 100644 --- a/README.md +++ b/README.md @@ -119,7 +119,7 @@ size_categories: - **Homepage:** https://www.open-assistant.io/ - **Repository:** https://github.com/LAION-AI/Open-Assistant -- **Paper:** TBA +- **Paper:** TBA on April 17, 2023 ### Dataset Summary @@ -129,13 +129,53 @@ corpus consisting of 161,443 messages distributed across 66,497 conversation tre 35 different languages, annotated with 461,292 quality ratings. The corpus is a product of a worldwide crowd-sourcing effort involving over 13,500 volunteers. -### Supported Tasks and Leaderboards - -[More Information Needed] ### Languages -[More Information Needed] +OpenAssistant Conversations incorporates 35 different languages with a distribution of messages as follows: + +**Languages with over 1000 messages +- English: 71956 +- Spanish: 43061 +- Russian: 9089 +- German: 5279 +- Chinese: 4962 +- French: 4251 +- Thai: 3042 +- Portuguese (Brazil): 2969 +- Catalan: 2260 +- Korean: 1553 +- Ukrainian: 1352 +- Italian: 1320 +- Japanese: 1018 + +
+ **Languages with < 1000 messages** +
    +
  • Vietnamese: 952
  • +
  • Basque: 947
  • +
  • Polish: 886
  • +
  • Hungarian: 811
  • +
  • Arabic: 666
  • +
  • Dutch: 628
  • +
  • Swedish: 512
  • +
  • Turkish: 454
  • +
  • Finnish: 386
  • +
  • Czech: 372
  • +
  • Danish: 358
  • +
  • Galician: 339
  • +
  • Hebrew: 255
  • +
  • Romanian: 200
  • +
  • Norwegian BokmÃ¥l: 133
  • +
  • Indonesian: 115
  • +
  • Bulgarian: 95
  • +
  • Bengali: 82
  • +
  • Persian: 72
  • +
  • Greek: 66
  • +
  • Esperanto: 59
  • +
  • Slovak: 19
  • +
+
## Dataset Structure