注: [1] https://www.jasonwei.net/blog/emergencehttps://www.yitay.net/blog/emergence-and-scaling[2] Wei et. al. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models [3] https://lingo.csail.mit.edu/blog/arithmetic_gpt3/[4] Wei et. al. 2022. Emergent Abilities of Large Language Models [5] 截止到 2022 年 11 月,仍没有严格的证据表明这些能力存在于小模型 [6] 在 2022 年 11 月,在 text-davinci-002 上评估 GSM8K 测试集需要 $50 [7] Google 不提供对 PaLM 的公共访问;OpenAI 不允许一些国家的研究人员访问 GPT3 和 Codex(截至 2022 年 11 月) [8] GPT-3 的第一个版本(2020 年 5 月)在许多任务上无法胜过精调 T5 [9] Wei et. al. 2022. Emergent Abilities of Large Language Models. [10] Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems [11] GPT3 一直在持续更新。最新版本 text-davinci-002 现在与 2020 年的原始版本有很大不同。 [12] Wei et. al. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models [13] Wang et. al. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models [14] Fu et. al. 2022. Complexity-Based Prompting for Multi-step Reasoning [15] 目前还没有能公平对比提示词和微调的工作。但当思维链被提出的时候,尽管他们对于提示和精调的比较可能是不公平的,但它们比精调效果要好。 [16] Chung et. al. 2022. Scaling Instruction-Finetuned Language Models [17] Lewkowycz et. al. 2022. Minerva: Solving Quantitative Reasoning Problems with Language Models [18] Jiang et. al. 2022. Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs [19] Xu et. al. 2021. Fusing Context Into Knowledge Graph for Commonsense Question Answering [20] Khashabi et. al. 2020. UnifiedQA: Crossing Format Boundaries With a Single QA System [21] Yu et. al. 2022. Generate rather than Retrieve: Large Language Models are Strong Context Generators [22] Jung et. al. 2022. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations [23] 虽然这些知识可能过时或者不可信,但选择哪种可信知识源超出了本文的讨论范围 [24] Si et. al. 2022. Prompting GPT-3 to be Reliable. [25] Fu et. al. 2022. Complexity-based Prompting for Multi-Step Reasoning [26] Kaplan et. al. 2020. Scaling Laws for Neural Language Models [27] Brown et. al. 2020. anguage Models are Few-Shot Learners. [28] Cobbe et. al. 2021. Training Verifiers to Solve Math Word Problems [29] Li and Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation [30] He et. al. 2021. Towards a Unified View of Parameter-Efficient Transfer Learning [31] Chung et. al. 2022. Scaling Instruction-Finetuned Language Models [32] Suzgun et. al. 2022. Challenging BIG-Bench tasks and whether chain-of-thought can solve them [33] 在本文发布的两个月之后,更多的模型被公布,很多新的模型也都可以做思维链,比如 UL2, FlanT5 [34] Suzgun. et. al. 2022. Challenging Big-Bench tasks and whether chain-of-thought can solve them Fu et. al. 2022. Complexity-Based Prompting for Multi-Step Reasoning Madaan et. al. 2022. Language Models of Code are Few-Shot Commonsense Learners [35] Ouyang et. al. 2022. Training language models to follow instructions with human feedback [36] Chowdhery et. al. 2022. PaLM: Scaling Language Modeling with Pathways [37] Chung. et. al. 2022. Scaling Instruction-Finetuned Language Models [38] Chung. et. al. 2022. Scaling Instruction-Finetuned Language Models Huang et. al. 2022. Large Language Models Can Self-Improv