You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chainof-thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021).
3.1 Experimental Setup
We explore chain-of-thought prompting for various language models on multiple benchmarks.
Benchmarks. We consider the following five math word problem benchmarks: (1) the GSM8K benchmark of math word problems (Cobbe et al., 2021), (2) the SVAMP dataset of math word problems with varying structures (Patel et al., 2021), (3) the ASDiv dataset of diverse math word problems (Miao et al., 2020), (4) the AQuA dataset of algebraic word problems, and (5) the MAWPS benchmark (Koncel-Kedziorski et al., 2016). Example problems are given in Appendix Table 12.
Operating System Mode | 操作系统类型
Linux
Operating System Version| 操作系统版本
Ubuntu22.04
Python version | Python 版本
3.12
Software version | 软件版本 (magic-pdf --version)
1.3.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered:
🔎 Search before asking | 提交之前请先搜索
Description of the bug | 错误描述
How to reproduce the bug | 如何复现
3 Arithmetic Reasoning
We begin by considering math word problems of the form in Figure 1, which measure the arithmetic reasoning ability of language models. Though simple for humans, arithmetic reasoning is a task where language models often struggle (Hendrycks et al., 2021; Patel et al., 2021, inter alia). Strikingly, chainof-thought prompting when used with the 540B parameter language model performs comparably with task-specific finetuned models on several tasks, even achieving new state of the art on the challenging GSM8K benchmark (Cobbe et al., 2021).
3.1 Experimental Setup
We explore chain-of-thought prompting for various language models on multiple benchmarks.
Benchmarks. We consider the following five math word problem benchmarks: (1) the GSM8K benchmark of math word problems (Cobbe et al., 2021), (2) the SVAMP dataset of math word problems with varying structures (Patel et al., 2021), (3) the ASDiv dataset of diverse math word problems (Miao et al., 2020), (4) the AQuA dataset of algebraic word problems, and (5) the MAWPS benchmark (Koncel-Kedziorski et al., 2016). Example problems are given in Appendix Table 12.
Operating System Mode | 操作系统类型
Linux
Operating System Version| 操作系统版本
Ubuntu22.04
Python version | Python 版本
3.12
Software version | 软件版本 (magic-pdf --version)
1.3.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: