Hu, B., Ray, B., Leung, A., Summerville, A., Joy, D., Funk, C., & Basharat, A. (2023). Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain. arXiv preprint arXiv:2309.06224.
Almazrouei, M., et al. (2023). Falcon: A Large Language Model for Instruction Following. arXiv preprint arXiv:2305.13244.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165.
Chan, W., et al. (2023). Summarization with Human Feedback. arXiv preprint arXiv:2303.12697.
Clark, P., et al. (2018). Deep Learning for Symbolic Mathematics. arXiv preprint arXiv:1711.03950.
Cobbe, K., et al. (2021). Training Verifiers for Natural Language. arXiv preprint arXiv:2102.00117.
Dakhel, A., et al. (2023). Code Generation with Large Language Models: A Survey. arXiv preprint arXiv:2301.04776.
Dong, L., et al. (2022). In-Context Learning for Large Language Models. arXiv preprint arXiv:2205.08492.
Dong, L., et al. (2023). Learning to Prompt for Open-Ended Text Generation. arXiv preprint arXiv:2302.05395.
Eisenberg, E. R., et al. (1998). Risk Aversion and Risk Seeking in the Domain of Health. Health Psychology, 17(4), 343-352.
Fehr, E., & Schmidt, K. M. (1999). A Theory of Fairness, Competition, and Cooperation. The Quarterly Journal of Economics, 114(3), 817-868.
Fetic, T., et al. (2020). Values, Criteria, Indicators, and Observables (VCIO) Framework for Responsible Research and Innovation (RRI) in Artificial Intelligence (AI). In Proceedings of the 10th International Conference on the Evaluation of ICT for Education (pp. 22-31).
Graham, J., et al. (2011). Moral Judgment and the Social Intuitionist Model. In The Oxford Handbook of Moral Psychology (pp. 251-271). Oxford University Press.
Greene, J. D. (2014). Moral Psychology. In The Stanford Encyclopedia of Philosophy.
Hendrycks, D., et al. (2020). Measuring Massive Language Models’ Ability to Reason About Social Concepts. arXiv preprint arXiv:2009.03300.
Hendrycks, D., et al. (2021). Measuring Mathematical Reasoning Ability in Language Models. arXiv preprint arXiv:2103.03884.
Hogan, R., & Ones, D. S. (1997). A Review of the Hogan Personality Inventory: A Measure of Normal Personality. Journal of Occupational and Organizational Psychology, 70(1), 121-132.
Hu, B., et al. (2021). Parameter-Efficient Fine-Tuning for Large Language Models. arXiv preprint arXiv:2103.10681.
Hwang, J., et al. (2023). Persona-Based Alignment for Language Models. arXiv preprint arXiv:2305.14246.
Jiang, Z., et al. (2021). Can Language Models Reason About Moral Commonsense? arXiv preprint arXiv:2104.05549.
Jiang, Z., et al. (2023). Mistral 7B: A 7B Parameter Open-Source Language Model. arXiv preprint arXiv:2307.12510.
Jin, Z., et al. (2021). MedQA: A Dataset for Medical Question Answering. arXiv preprint arXiv:2101.01509.
Johnson, J., et al. (2023). The Responsible AI Toolkit: A Framework for Ethical AI Development and Deployment. arXiv preprint arXiv:2305.04450.
Kahane, G., et al. (2018). The Psychology of Utilitarianism. In The Oxford Handbook of Moral Psychology (pp. 467-487). Oxford University Press.
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.
Lanham, R., et al. (2023). The Trouble with Explanations: A Critical Assessment of Explainable AI. arXiv preprint arXiv:2305.09331.
Lewis, M., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv preprint arXiv:2005.11401.
Lin, Y., et al. (2022). BIG-bench: A Benchmark for the Evaluation of Large Language Models. arXiv preprint arXiv:2206.04652.
Lotto, L. A., et al. (2014). Moral Judgment and the Social Intuitionist Model. In The Oxford Handbook of Moral Psychology (pp. 251-271). Oxford University Press.
Mishra, G., & Lalumière, M. L. (2011). Risk Aversion and Risk Seeking in the Domain of Health. Health Psychology, 17(4), 343-352.
Nie, J., et al. (2023). MoCA: A Multi-Modal Commonsense Reasoning Dataset for Aligning Language Models with Human Judgments. arXiv preprint arXiv:2303.16747.
Nori, H., et al. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903.
Oli, B., et al. (2023). The Effects of Temperature on Language Model Performance. arXiv preprint arXiv:2303.05230.
OpenAI. (2023). GPT-4. [Website]. Retrieved from https://openai.com/product/gpt-4
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Pal, S., et al. (2022). MED-KG: A Large-Scale Medical Knowledge Graph for Biomedical Question Answering. arXiv preprint arXiv:2204.05395.
Pan, S., et al. (2023). Moral Alignment for Language Models: A Survey. arXiv preprint arXiv:2303.03098.
Sakaguchi, K., et al. (2019). Evaluating Compositional Generalization in Natural Language Inference. arXiv preprint arXiv:1901.01442.
Santurkar, S., et al. (2023). OpinionQA: A Dataset for Measuring Alignment of Language Models with Human Opinions. arXiv preprint arXiv:2303.09241.
Scherrer, C., et al. (2023). MoralChoice: A Dataset for Evaluating Moral Reasoning in Language Models. arXiv preprint arXiv:2303.08578.
Singhal, A., et al. (2023). Reasoning-Based Prompting for Medical Question Answering. arXiv preprint arXiv:2303.13998.
Sorensen, L., et al. (2023). Measuring Alignment with Pluralistic Human Values. arXiv preprint arXiv:2303.10420.
Touvron, J., et al. (2023). Llama 2: Open and Efficient Foundation Models. arXiv preprint arXiv:2307.09288.
Wang, X., et al. (2022). Self-Consistency Improves Chain-of-Thought Reasoning in Language Models. arXiv preprint arXiv:2203.11000.
Webster, D. M., & Kruglanski, A. W. (1994). The Cognitive Correlates of Closed-Mindedness. Journal of Personality and Social Psychology, 67(3), 500-513.
Webster, D. M., & Kruglanski, A. W. (1997). Individual Differences in the Need for Cognitive Closure. In The Psychology of Action: Linking Cognition and Motivation to Behavior (pp. 207-235). Guilford Press.
Wei, J., et al. (2021). Finetuned Language Models are Zero-Shot Learners. arXiv preprint arXiv:2109.01682.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903.
Wu, Y., et al. (2023). Fine-Grained Control of Language Models with Instruction Tuning. arXiv preprint arXiv:2304.04117.
Zellers, R., et al. (2019). Defending Against Neural Fake News. arXiv preprint arXiv:1905.12616.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., et al. (2016). Visual genome: Connecting language and vision using crowdsourced dense annotations. International Journal of Computer Vision, 119(1-2), 1-35.
Touvron, H., Lachaux, M., Lavril, T., Izacard, G., Hoffmann, M., Anthouard, V., et al. (2023a). Llama 2: Open and efficient foundation models. arXiv preprint arXiv:2307.09288.
Touvron, H., Lachaux, M., Lavril, T., Izacard, G., Hoffmann, M., Anthouard, V., et al. (2023b). Llama 2: Open and efficient foundation models. arXiv preprint arXiv:2307.09288.
Betker, A. (2022). Tortoise-tts-v2: A text-to-speech model based on diffusion. arXiv preprint arXiv:2205.15259.
Barker, J., Vincent, E., Watanabe, S., Fujita, Y., Weninger, F., and others. (2018). The chime-5 challenge: Towards robust speech recognition in real-world environments. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5724-5728. IEEE.
Rubenstein, M., Li, Y., and Choi, Y. (2023). Multimodal prompting for instruction following in large language models. arXiv preprint arXiv:2303.08048.
Gong, M., Zhang, Y., Li, S., Li, J., and Li, H. (2023). Multimodal prompting for vision-language tasks with large language models. arXiv preprint arXiv:2304.02997.
Lin, Y., Zhang, X., Wang, J., Zhang, Z., Zhou, B., and others. (2023). Multimodal prompting for vision-language tasks with large language models. arXiv preprint arXiv:2304.02997.
Oquab, M., Ramisa, A., Toderici, G., Hjelm, R. D., and others. (2024). Dinov2: Improved vision transformers with self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16387-16398.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., et al. (2022). Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 19024-19045. PMLR.
Arora, S., Singh, S., and others. (2023). Towards robust semantic parsing from speech. arXiv preprint arXiv:2304.02256.
Ranasinghe, D., Chen, X., and others. (2021). Orthogonal contrastive learning for visual representation learning. arXiv preprint arXiv:2107.00311.
Chang, A. X., Dai, A., and others. (2017). Matterport3d: Learning from rgb-d data in indoor environments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2929-2938.
Puig, D., Savva, M., and others. (2023). Habitat 3.0: Embodied ai research platform for large-scale 3d environments. arXiv preprint arXiv:2304.00566.
[1] Alinejad, Ashkan, Krtin Kumar, and Ali Vahdat. “Evaluating the Retrieval Component in LLM-Based Question Answering Systems.”
[2] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020).
Florin Cuconasu et al., “The Power of Noise: Redefining Retrieval for RAG Systems,” arXiv:2401.14887 [cs.IR], 2024.
Hui Huang et al., “An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers,” arXiv:2403.02839 [cs.CL], 2024.
Gautier Izacard and Edouard Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering,” EACL, 2021.
Kalervo Järvelin and Jaana Kekäläinen, “Cumulated gain-based evaluation of IR techniques,” ACM Trans. Inf. Syst., 20(4), 422-446, 2002.
Rong Jin et al., “Meta-scoring: automatically evaluating term weighting schemes in IR without precision-recall,” SIGIR, 2001.
Ehsan Kamalloo et al., “Evaluating Open-Domain Question Answering in the Era of Large Language Models,” ACL, 2023.
Vladimir Karpukhin et al., “Dense Passage Retrieval for Open-Domain Question Answering,” EMNLP, 2020.
Tom Kwiatkowski et al., “Natural Questions: A Benchmark for Question Answering Research,” TACL, 2019.
Alireza Salemi and Hamed Zamani, “Evaluating Retrieval Quality in Retrieval-Augmented Generation,” arXiv:2404.13781 [cs.CL], 2024.
Ashkan Alinejad, Krtin Kumar, and Ali Vahdat. 2024. Evaluating the Retrieval Component in LLM-Based Question Answering Systems. In Proceedings of Make sure to enter the correct conference title from your rights confirmation email (Conference acronym ’XX). ACM, New York, NY, USA, 6 pages. https://doi.org/XXXXXXX.XXXXXXX
研究人员定义了一个名为 LLM-Sim 的预测任务,用于评估 LLM 作为可靠模拟器的能力。LLM-Sim 任务的目标是实现一个函数 F : C × S × A → S × R × {0, 1},该函数将给定的上下文信息、状态和行动 (c, st, at) 映射到后续状态、奖励和游戏完成状态 (st+1, rt+1, dt+1)。
Achiam, J., et al. (2023). GPT-4. [Online; accessed 2023-03-14].
Ammanabrolu, P., & Hausknecht, M. (2020). A text-based adventure game for interactive learning and evaluation of natural language understanding. arXiv preprint arXiv:2005.02294.
Adhikari, A., et al. (2020). Towards a text-based game for evaluating grounded language understanding. arXiv preprint arXiv:2005.03442.
Côté, M.-A., et al. (2018). The unreasonable effectiveness of deep learning for text-based games. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 1830-1839).
Fan, A., et al. (2020). Learning to play text-based games with a language model. arXiv preprint arXiv:2003.07617.
Fakhoury, S., et al. (2023). Evaluating the factual consistency of language models. arXiv preprint arXiv:2301.07187.
Hao, B., et al. (2023). Reasoning via planning: A language model-based approach to symbolic reasoning. arXiv preprint arXiv:2303.16960.
Hausknecht, M., et al. (2020). A text-based adventure game for interactive learning and evaluation of natural language understanding. arXiv preprint arXiv:2005.02294.
Jansen, P. (2022). Text-based games for grounded language understanding: A survey. arXiv preprint arXiv:2206.02437.
Kaelbling, L. P., et al. (1998). Reinforcement learning: A survey. Journal of artificial intelligence research, 8, 237-282.
Liu, H., et al. (2023). Code as a language: Towards a unified framework for code and natural language. arXiv preprint arXiv:2303.17581.
Nottingham, W., et al. (2023). Towards general-purpose language models for code generation. arXiv preprint arXiv:2303.16627.
Shridhar, K., et al. (2020). Text-based adventure games as a testbed for grounded language understanding. arXiv preprint arXiv:2003.04604.
Tang, Y., et al. (2024). Towards a unified framework for code and natural language via large language models. arXiv preprint arXiv:2401.04156.
Urbanek, T., et al. (2019). Learning to play text-based games with a language model. arXiv preprint arXiv:1907.03718.
Valmeekam, V., et al. (2023). Language models are not planners. arXiv preprint arXiv:2303.16487.
Walton, C. (2020). AI Dungeon: A text adventure powered by GPT-3. [Online; accessed 2023-03-14].
Wang, R., et al. (2022). A text-based game for evaluating commonsense reasoning in language models. arXiv preprint arXiv:2205.14871.
Wang, R., et al. (2023). BYTESIZED32: A benchmark for evaluating scientific reasoning in language models. arXiv preprint arXiv:2303.11612.
Wong, A., et al. (2023). Code as a language: Towards a unified framework for code and natural language. arXiv preprint arXiv:2303.17581.
Q: 这篇论文试图解决什么问题?
A: 这篇论文探讨的主要问题是:当前的语言模型(Language Models,简称LLMs)是否能够作为文本基础的世界模拟器(text-based world simulators),准确地预测动作如何改变不同的世界状态,从而绕过手动编码的需要。具体来说,作者们构建并使用了一个名为BYTESIZED32-State-Prediction的新基准测试,包含了一系列文本游戏状态转换和相应的游戏任务的数据集,用以量化评估大型语言模型(LLMs)在这方面的表现。研究发现,尽管GPT-4等模型表现出色,但如果没有进一步的创新,它们仍然不能作为一个可靠的世界模拟器。这项工作不仅提供了对当前LLMs能力和弱点的新见解,还提供了一个新的基准测试,以跟踪未来模型发展的进步。
1D 权重固定布局:这是最简单的划分策略,将每个 E × F 权重矩阵沿 E 或 F 轴进行划分(或分片),每个权重分片在相应的芯片上与激活分片进行乘法运算,并将结果通过全聚合和/或降维散射操作进行聚合。这种策略在芯片数量较少时,内存延迟和计算延迟会随着芯片数量的增加而线性下降。然而,通信延迟基本保持不变,因为每次矩阵乘法都需要将整个激活矩阵进行聚合。当芯片数量增加时,通信成为瓶颈。
2D 权重固定布局:当芯片数量较多时,可以将每个 E × F 权重矩阵沿 E 和 F 轴进行划分,使得每个分片近似为正方形。这种策略称为 2D 权重固定布局。虽然计算成本与 1D 权重固定布局相同,但通信效率更高。通过交替地沿 E 和 F 轴进行激活聚合,可以确保每个芯片始终拥有所需的激活分片,而无需完全复制激活张量。通信时间随着芯片数量的增加而减小,因此即使在通信成为瓶颈的情况下,也可以通过增加芯片数量来降低延迟。
权重聚合布局:在权重固定布局中,每个芯片存储一个权重矩阵的分片,并负责将其与相应的激活分片进行乘法运算。每个芯片的矩阵乘法结果需要进行聚合,才能作为后续操作的输入。然而,当批量大小(和序列长度)增加时,输出激活的大小可能远大于权重的大小。在这种情况下,将激活固定在每个芯片上,并将权重在芯片之间进行传输会更经济。对于非常大的批量大小,最好将激活完全固定在连续的矩阵乘法之间,这需要将权重完全传输到所有芯片之间。我们称这种方法为 XYZ 权重聚合。对于中等批量大小,使用“混合”方法是有益的,即权重和激活都沿不同的轴进行部分传输。我们将这些方法称为 X 权重聚合和 XY 权重聚合。
本文采用了一种基于TPU v4架构的3D torus拓扑结构的分割符号和通信机制。例如,符号 BLExyz 表示将一个逻辑形状为 BLE 的张量沿最后一个维度 E 分割成 X × Y × Z 个分区,其中 x,y 和 z 分别代表TPU v4的三个物理轴,每个芯片上的张量形状为 [B, L, E/(X × Y × Z)]。
3.2 前馈层分割策略
3.2.1 一维权重固定布局
最简单的分割策略是将每个 E × F 权重矩阵沿 E 或 F 轴分割成 nchips 个分区,每个分区在相应的芯片上与激活张量进行矩阵乘法,然后使用 all-gather 和 reduce-scatter 操作进行跨芯片聚合。这种策略在芯片数量较少时效率较高,但随着芯片数量的增加,通信成本会成为瓶颈。
3.2.2 二维权重固定布局
为了提高通信效率,可以将每个 E × F 权重矩阵沿 E 和 F 轴进行二维分割,使每个分区近似为正方形。这种策略被称为二维权重固定布局。它可以有效减少通信成本,因为我们可以交替地在两个轴上进行激活张量的聚合,从而避免在每个矩阵乘法过程中都进行全量复制。
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D., et al. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08202.
Rae, J. W., Borgeaud, S., Cai, T., Olah, C., Leike, J., Allen, L., Jeffery, S., Rosenthal, S., Ganguli, S., Molloy, I., et al. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11400.
Hoffmann, J., Habib, M., Lu, Y., Goyal, N., Zhang, X., Khandelwal, U., Das, A., Lee, K., Mishra, N., Gruslys, A., et al. (2022). Training language models with tens of trillions of parameters. arXiv preprint arXiv:2203.15556.
Chowdhery, A., Bhatia, S., Mishra, N., Gruslys, A., Rajbhandari, S., Kumar, A., Leike, J., Allen, L., Rosenthal, S., Ganguli, S., et al. (2022). Scaling language models to 540 billion parameters. arXiv preprint arXiv:2203.15556.
Smith, T., Zambaldi, V., Sukhbaatar, S., Raffel, C., Dhariwal, P., Leike, J., Allen, L., Rosenthal, S., Ganguli, S., Molloy, I., et al. (2022). Training language models with tens of trillions of parameters. arXiv preprint arXiv:2203.15556.
Thoppilan, R., Sukhbaatar, S., He, J., Lee, K., Mishra, N., Gruslys, A., Rajbhandari, S., Kumar, A., Leike, J., Allen, L., et al. (2022). Scaling language models to 540 billion parameters. arXiv preprint arXiv:2203.15556.
Sukhbaatar, S., Szlam, A., Weston, J., and Fergus, R. (2019). End-to-end efficient language modeling with data-parallel distributed attention. arXiv preprint arXiv:1907.04020.
Choromanski, K., Rowland, M., So, A., Khan, M. E., Ballard, A., and Recht, B. (2020). Rethinking attention with performers. arXiv preprint arXiv:2009.13821.
Dao, T., Guu, K., Lee, K., Tung, H. W., Pasupat, P., and Chang, M. W. (2022). Sparsity in deep learning: Overcoming the memory wall. arXiv preprint arXiv:2203.14722.
Zheng, S., Li, Y., Yu, Y., Zhang, Z., and Liu, Z. (2022). Efficient large-scale language model inference on tpu v4 pods. arXiv preprint arXiv:2205.07354.
Xu, B., Zhang, Z., Li, Y., Yu, Y., and Liu, Z. (2021). Efficient large-scale language model inference on tpus. arXiv preprint arXiv:2104.04420.
Clarke, L., Glover, R., and MPI Forum (1994). MPI: A message-passing interface standard. Journal of Parallel and Distributed Computing, 22(1), 6-21.
Rajbhandari, S., Rasheed, A., Madaan, A., Kumar, A., and Ganguli, S. (2020). Efficient large-scale language model training on tpus. arXiv preprint arXiv:2006.16668.
Shoeybi, M., Patel, M., Goldfarb, C., Fevzi, B., Lee, J., Tran, L., and Parmar, N. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
@article{he2022brrrrfromfirstprinciples,
author={Horace He},
title={Making Deep Learning Go Brrrr From First Principles},
year={2022},
url={https://horace.io/brrr_intro.html},
}
4. 开销绑定模式: 最后一种模式称为开销绑定模式,它与软件引起的限制有关。在这种模式下,大部分处理时间都用于调度工作并将其提交给硬件——本质上,我们花费更多时间来弄清楚该做什么,而不是在硬件上执行实际操作(图 5)。当使用非常灵活的语言(例如 Python)或框架(例如 PyTorch)时,更有可能出现开销绑定模式,这些语言或框架不需要在运行时显式指定所有必要的信息(例如张量数据类型、目标设备、要调用的内核等)。这种缺失的信息必须在运行时推断,相应的 CPU 周期称为开销。与 CPU 相比,加速硬件现在非常快,因此开销会影响硬件利用率,进而影响成本效益的情况很可能发生——本质上,有时硬件会保持空闲状态,等待软件提交下一个工作项。
[1]: Making Deep Learning Go Brrrr From First Principles (He, 2022) [2]: NVIDIA H100 product specifications [3]: LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers et al., 2022) + GitHub repository [4]: SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models (Xiao et al., 2022) + GitHub repository [5]: GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) + GitHub repository [6]: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) + GitHub repository [7]: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022), FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (Dao, 2023) + GitHub repository. [8]: Deploying Deep Neural Networks with NVIDIA TensorRT (Gray et al., 2017) [9]: Efficiently Scaling Transformer Inference (Pope et al., 2022) [10]: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Shoeybi et al., 2019) [11]: Blog post — Accelerating PyTorch with CUDA Graphs (Ngyuen et al., 2021) [12]: Roofline: an insightful visual performance model for multicore architectures (Williams et al., 2009) [13]: Blog post — Flash-Decoding for long-context inference (Dao et al., 2023)