Update README.md (#2)

- Update README.md (75b6ad5f3cc2c88d86e676c39d143d090169c948) Co-authored-by: Aymeric Roucher <A-Roucher@users.noreply.huggingface.co>
2023-12-12 13:28:03 +00:00 · 2023-12-12 13:28:03 +00:00 · 25fb0412db
commit 25fb0412db
parent be45a9e2ae
1 changed files with 8 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -89,10 +89,15 @@ dataset_info:
 ### Dataset Summary

 GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. The dataset was created to support the task of question answering on basic mathematical problems that require multi-step reasoning.
+- These problems take between 2 and 8 steps to solve.
+- Solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ − ×÷) to reach the final answer.
+- A bright middle school student should be able to solve every problem: from the paper, "Problems require no concepts beyond the level of early Algebra, and the vast majority of problems can be solved without explicitly defining a variable."
+- Solutions are provided in natural language, as opposed to pure math expressions. From the paper: "We believe this is the most generally useful data format, and we expect it to shed light on the properties of large language models’ internal monologues""

 ### Supported Tasks and Leaderboards

-[Needs More Information]
+This dataset is generally used to test logic and math in language modelling.
+It has been used for many benchmarks, including the [LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

 ### Languages

@ -146,9 +151,9 @@ The data fields are the same among `main` and `socratic` configurations and thei

 #### Initial Data Collection and Normalization

-From the paper:
+From the paper, appendix A:

-> We initially collected a starting set of a thousand problems and natural language solutions by hiring freelance contractors on Upwork (upwork.com). We then worked with Surge AI (surgehq.ai), an NLP data labeling platform, to scale up our data collection. After collecting the full dataset, we asked workers to re-solve all problems, with no workers re-solving problems they originally wrote. We checked whether their final answers agreed with the original solu- tions, and any problems that produced disagreements were either repaired or discarded. We then performed another round of agreement checks on a smaller subset of problems, finding that 1.7% of problems still produce disagreements among contractors. We estimate this to be the fraction of problems that con- tain breaking errors or ambiguities. It is possible that a larger percentage of problems contain subtle errors.
+> We initially collected a starting set of a thousand problems and natural language solutions by hiring freelance contractors on Upwork (upwork.com). We then worked with Surge AI (surgehq.ai), an NLP data labeling platform, to scale up our data collection. After collecting the full dataset, we asked workers to re-solve all problems, with no workers re-solving problems they originally wrote. We checked whether their final answers agreed with the original solutions, and any problems that produced disagreements were either repaired or discarded. We then performed another round of agreement checks on a smaller subset of problems, finding that 1.7% of problems still produce disagreements among contractors. We estimate this to be the fraction of problems that contain breaking errors or ambiguities. It is possible that a larger percentage of problems contain subtle errors.

 #### Who are the source language producers?