正如俄罗斯套娃(Matryoshka)一样,用户偏好和物品特征可以在不同的层次上被表达和组织。基于这一理念,Riwei Lai 等人提出了一种名为“Matryoshka 表征学习”的新方法(MRL4Rec)。这种方法通过在逐渐增加维度和重叠的向量空间中重构用户和物品的向量,从而显式地表示不同层次的用户偏好和物品特征。
Riwei Lai, Li Chen, Weixin Chen, Rui Chen. Matryoshka Representation Learning for Recommendation. arXiv:2406.07432. 2024. 链接
Matryoshka Representation Learning for Recommendation: Summary and Answers
This paper introduces Matryoshka Representation Learning for Recommendation (MRL4Rec), a novel approach to representation learning for recommendation systems. Here’s a breakdown of the key points:
Problem: Existing representation learning methods struggle to capture the inherent hierarchical structure of user preferences and item features. They either treat all features uniformly or rely on discrete clustering, which can be limiting.
Solution: MRL4Rec proposes a new representation structure called matryoshka representations. These representations organize user and item vectors into nested, incrementally dimensional vector spaces, resembling Russian nesting dolls (Matryoshka dolls). This structure allows for the explicit representation of user preferences and item features at different levels of granularity.
Key Innovations:
Matryoshka Representations: The hierarchical structure allows capturing broad preferences at higher levels and more specific interests at lower levels.
Level-Specific Training Triplets: The paper argues that constructing training triplets specific to each level of the hierarchy is crucial for accurate representation learning.
Matryoshka Negative Sampling: A novel negative sampling mechanism is proposed to generate effective training triplets, ensuring the model learns meaningful hierarchical relationships.
Results: Experiments on real-life datasets demonstrate that MRL4Rec consistently outperforms state-of-the-art recommendation models.
Code Availability: The authors have made their code publicly available (link provided in the paper).
Answering your potential questions based on this paper:
What is the main contribution of this paper?
Introducing MRL4Rec, a novel representation learning method that uses matryoshka representations to capture the hierarchical nature of user preferences and item features for improved recommendation accuracy.
How does MRL4Rec differ from previous approaches?
Unlike methods treating features uniformly or using discrete clusters, MRL4Rec employs nested vector spaces to represent hierarchical relationships between preferences and features explicitly.
What is the significance of level-specific training triplets?
They ensure that the model learns accurate representations at each level of the hierarchy by providing targeted training data.
What is the role of matryoshka negative sampling?
It generates effective negative samples, crucial for training the model to distinguish between positive and negative relationships within each level of the hierarchy.
This summary and the answers provide a comprehensive understanding of the key contributions and findings of the paper.
The provided document introduces GPT4Rec, a novel approach for streaming recommendation using graph prompt tuning. Here’s a breakdown of the paper’s key aspects:
Problem: Traditional recommender systems struggle to adapt to the dynamic nature of user-item interactions. New users, items, and preferences constantly emerge, leading to a distribution shift between training and real-world data. This often results in catastrophic forgetting, where models lose valuable knowledge from past data while trying to learn from new information.
Existing Solutions and their limitations:
Replay Buffer: These methods store a subset of past data and replay it during training. However, they are limited by buffer size and raise privacy concerns.
Model Regularization: These techniques constrain model parameters to prevent drastic changes. However, they struggle when new data significantly deviates from past patterns.
Model Isolation and Expansion: These approaches isolate old knowledge and create new learning spaces for new data. However, they lead to increased model complexity and slow updates.
GPT4Rec’s Approach:
Disentangling Complex Graphs: GPT4Rec divides the user-item interaction graph into multiple views using linear transformations. This allows the model to focus on specific aspects of the graph, such as user-to-item interactions or item-to-item similarities, without losing sight of the overall interconnectedness.
Adaptive Learning with Prompts: GPT4Rec employs three types of prompts to guide the model’s learning process:
Node-Level Prompts: These prompts capture changes in individual node attributes or properties. They are weighted based on their relevance to specific nodes, allowing the model to focus on the most pertinent information.
Structure-Level Prompts: These prompts capture broader patterns of connectivity and relationships within the graph. They decompose sub-graph structures into smaller, more manageable components, enabling the model to adapt to changes in the overall graph topology.
View-Level Prompts: These prompts aggregate information from the different disentangled views, ensuring a comprehensive understanding of the evolving graph.
Advantages of GPT4Rec:
Efficient Adaptation: GPT4Rec’s prompt-based approach allows for rapid adaptation to new data streams without requiring extensive model modifications.
Knowledge Preservation: By strategically utilizing prompts, GPT4Rec retains valuable knowledge from past data while effectively incorporating new information.
Theoretical Guarantees: The paper provides theoretical analysis demonstrating that GPT4Rec possesses at least the expressive power of global fine-tuning.
Evaluation:
The paper claims that GPT4Rec outperforms state-of-the-art baselines on four real-world datasets, demonstrating its effectiveness for streaming recommendation.
Overall, GPT4Rec presents a promising solution for addressing the challenges of streaming recommendation by leveraging graph prompt tuning. Its ability to disentangle complex graphs, adapt to evolving data patterns, and preserve valuable knowledge makes it a significant contribution to the field.
Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X. Liu, Schahram Dustdar. “Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting.” ICLR 2022. PDF
Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.
Zhou, Haoyi, et al. “Informer: Beyond efficient transformer for long sequence time-series forecasting.” AAAI. 2021.
Beltagy, Iz, Matthew E. Peters, and Arman Cohan. “Longformer: The long-document transformer.” arXiv preprint arXiv:2004.05150 (2020).
Kitaev, Nikita, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The efficient transformer.” International Conference on Learning Representations. 2019.
Summary of “Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting”
This paper proposes Pyraformer, a novel Transformer-based model designed to address the challenges of long-range time series forecasting. The key innovation lies in its pyramidal attention module (PAM), which efficiently captures both short-term and long-term dependencies in time series data.
Here’s a breakdown of the paper’s key aspects:
Problem: Existing time series forecasting methods struggle to balance computational efficiency with the ability to capture long-range dependencies. RNNs and CNNs are efficient but struggle with long sequences, while Transformers, while good at capturing long-range dependencies, suffer from quadratic complexity.
Proposed Solution: Pyraformer
Pyramidal Attention Module (PAM):
Employs a multi-resolution pyramidal graph to represent the time series.
Inter-scale connections: Form a C-ary tree where each level summarizes information at a different resolution (e.g., hourly, daily, weekly).
Intra-scale connections: Capture temporal dependencies within each resolution by connecting neighboring nodes.
This structure allows for efficient modeling of long-range dependencies by capturing them at coarser resolutions.
Coarser-Scale Construction Module (CSCM): Initializes the nodes at coarser scales using convolutions applied to finer-scale representations.
Prediction Module:
Single-step forecasting: Gathers features from all scales and uses a fully connected layer for prediction.
Multi-step forecasting: Offers two options:
Similar to single-step but maps to multiple future time steps.
Utilizes a decoder with two full attention layers for incorporating historical information.
Advantages:
Low Complexity: Achieves linear time and space complexity (O(L)) thanks to the sparse connections in the pyramidal graph.
Long-Range Dependency Capture: Maintains a constant maximum signal traversing path length (O(1)), enabling efficient modeling of long-range dependencies.
Improved Accuracy: Outperforms existing methods in both single-step and long-range multi-step forecasting tasks.
Key Results:
Pyraformer consistently achieves higher prediction accuracy compared to other Transformer variants and traditional methods on various real-world datasets.
It achieves this while maintaining significantly lower time and memory consumption, especially for long sequences.
Overall, Pyraformer presents a promising solution for long-range time series forecasting by effectively balancing model complexity and the ability to capture long-term dependencies.
我们的目标是提取代表潜在的、可概括的处理的短语簇,这些处理会影响特定结果。为此,我们想象 N 个文本(Ti)被随机分配到一个过程中,通过该过程它们被映射到一个结果(Yi)。让 i 也索引评估文本 i 的个体。我们寻求识别和估计这些文本的 m 维潜在表示(Zi)的效果,该表示总结了可能在反复实验中影响结果的短语或概念簇。我们将 Zi 称为文本 i 的“文本处理”。例如,Zi 的每个元素可以表示某个短语或语法结构的存在或不存在,Zi ∈ {0, 1}m。Zi 也可以包含表示连续文本特征的实值元素,如与某个词汇或概念一致的相似性。
Arceneaux, K., & Nickerson, D. W. (2010). Comparing negative and positive campaign messages: Evidence from two field experiments. American Politics Research, 38(1), 54-83.
King, G., Pan, J., & Roberts, M. E. (2014). Reverse-engineering censorship in China: Randomized experimentation and participant observation. Science, 345(6199), 1-10.
Sheikhalishahi, S., Miotto, R., & Weng, C. (2019). Natural language processing of clinical notes on chronic diseases: Systematic review. JMIR Medical Informatics, 7(2), e12239.
Hainmueller, J., & Hangartner, D. (2013). Who gets a Swiss passport? A natural experiment in immigrant discrimination. American Political Science Review, 107(1), 159-187.
Fong, C., & Grimmer, J. (2016). Discovery of treatments from text corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1608-1617.
Pryzant, R., Diaz, M., & Liu, Y. (2018). Deconfounded lexicon induction for interpretable social science. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1686-1695.
Jacovi, A., & Goldberg, Y. (2018). Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4198-4208.
Alvarez Melis, D., & Jaakkola, T. S. (2018). Towards robust interpretability with self-explaining neural networks. In Advances in Neural Information Processing Systems, 7775-7784.
Rajagopal, D., & Mooney, R. (2021). Global explanations of neural networks: Mapping the landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 124-131.
Fantinuoli, C., & Prandi, D. (2021). A user-centric framework for the evaluation of interpreting quality. Interpreting, 23(3), 267-286.
Han, Z. (2022). The nature of interpreting quality: A critical review. Routledge.
Kocmi, T., & Federmann, C. (2023). Evaluating translation quality with large language models. arXiv preprint arXiv:2302.06042.
Wang, X., & Fantinuoli, C. (2024). Exploring the correlation between human and machine evaluation of simultaneous speech translation. arXiv preprint arXiv:2406.10091.
论文通过分析自动度量方法与人工评估之间的相关性,旨在评估自动度量方法在评估同声传译中的可靠性,特别是翻译准确性或忠实度这一特定特征的质量。论文使用了人类专家的语言评估作为基准,来评估句子嵌入(sentence embeddings)和大型语言模型(Large Language Models)与人类评估的相关性,并且量化了源文本和翻译文本之间的语义相似性,而没有依赖于参考翻译。
A: 这篇论文《Exploring the Correlation between Human and Machine Evaluation of Simultaneous Speech Translation》主要探讨了自动评估方法在同声传译质量评估中的可靠性,尤其是与人类评估的相关性。以下是论文的主要内容总结:
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., et al. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Touvron, J., Lachaux, M., Lample, G., Bordes, A., Aziza, S., Jaffre, J., Seddah, D., et al. (2023). Llama 2: Open and efficient foundation models. arXiv preprint arXiv:2307.09288.
OpenAI. (2023). GPT-4 technical report. Retrieved from https://openai.com/research/gpt-4.
Stelmakh, D., Khot, S., Talmor, A., and Goldberg, Y. (2022). ASQA: A dataset of ambiguous questions and answers. arXiv preprint arXiv:2204.09202.
Yang, Z., Yih, W., He, X., Liu, J., and Zhou, M. (2018). HotpotQA: A dataset for diverse, challenging, and informative question answering. arXiv preprint arXiv:1809.09628.
Zheng, Z., Yuan, L., Zhang, Y., Li, Y., Zhang, Y., Zhang, B., and Zhou, M. (2024a). Vicuna: An open-source chatbot trained on a massive dataset of human-chat conversations. arXiv preprint arXiv:2306.01575.
Zheng, Z., Yuan, L., Zhang, Y., Li, Y., and Zhou, M. (2024b). LLaMA-Factory: A framework for efficient and scalable fine-tuning of large language models. arXiv preprint arXiv:2306.01575.
Hu, J., Shen, L., Zhang, Z., He, X., Liu, Z., and Sun, M. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Zhou, B., Li, Y., and Yang, Y. (2023). Towards trustworthy large language models: A comprehensive survey. arXiv preprint arXiv:2303.12145.
如今,LLMs和神经网络架构不断发展,尤其是在处理更长上下文方面取得了显著进步(OpenAI, 2023;Reid et al., 2024;Anthropic, 2024)。这些模型根据丰富的上下文信息生成文本的能力非常重要。例如,较长的上下文为模型提供了更多信息,以便其生成更准确、上下文相关且最新的响应。此外,长上下文能力可以通过提供更多的上下文示例、指令或强化学习中的示例轨迹来增强上下文学习(Chevalier et al., 2023;Agarwal et al., 2024;Lee et al., 2024)。
尽管这些模型能力有所进步,用于评估它们的基准测试却未能跟上。例如,目前的基准测试,如Longbench(Bai et al., 2023)和L-Eval(An et al., 2023)仅扩展到40,000个标记,而模型则能够处理数十万甚至数百万个标记。
BABILong基准概述
为了测试LLMs在处理极长文档中的推理能力,我们介绍了BABILong基准。BABILong包括一组多样化的20个推理任务,如事实链、简单归纳、演绎、计数和处理列表/集合,这些任务是任何旨在与人类对话的系统的前提(Weston et al., 2016)。我们使用PG19语料库中的书籍作为长自然文档的来源(Rae et al., 2020)。通过这种方式,BABILong可以构建几乎任意长度的任务,以适应新、更强大的模型的评估,并以可扩展和可控的方式进行。我们提供了预定义长度的集合,最长可达100万个标记,并在样本上评估模型,长度可达1100万个标记。
A: 这篇论文《BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack》试图解决的问题是,现有的大型语言模型(LLMs)在处理长文本上下文方面的能力虽然在不断增强,但目前的评估方法并没有跟上这一发展步伐,无法全面评估这些模型处理长上下文的效率。具体来说,论文指出以下几个问题:
Ding, W., Wang, W., Kwok, S. H. D., Liu, M., Fang, T., Bai, J., … & Song, Y. (2024). INTENTIONQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce. arXiv preprint arXiv:2406.10173.
Yu, J., Zhang, Y., Tan, C., & Song, Y. (2023). Folkscope: Distilling Purchase Intentions from Large Language Models for E-commerce. arXiv preprint arXiv:2303.10677.
Zhang, Y., Wang, W., Song, Y., & Kwok, S. H. D. (2022a). ASER: A Large-Scale Eventuality Knowledge Graph. arXiv preprint arXiv:2206.03901.
Zhou, Y., Zhang, Y., Yu, J., & Song, Y. (2024). Towards Human-Centric Purchase Intention Comprehension: A Critical Analysis of Large Language Models. arXiv preprint arXiv:2403.06981.
Xu, Y., Zhang, Y., Yu, J., & Song, Y. (2024). MIND: A Multimodal Intention Knowledge Base for E-commerce. arXiv preprint arXiv:2404.05261.
Xu, Y., Wang, W., Song, Y., & Kwok, S. H. D. (2021). Towards Automatic Threshold Tuning for Knowledge Base Completion. arXiv preprint arXiv:2104.07632.
Akoury, R., Chakrabarty, T., & Lapata, M. (2020). STORIUM: A dataset of collaborative narratives for story understanding. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5164-5174.
Yang, J., Chakrabarty, T., & Lapata, M. (2023). DOC: Towards controllability in long-form story generation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 12295-12310.
Yang, J., Chakrabarty, T., & Lapata, M. (2022). Re3: Towards controllable rewriting and editing for story generation. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 5087-5099.
简单来说,如果掩码值为 Gi = 0,则忽略第 i 个词语的损失,如果 Gi = 1,则包含该词语。最重要的是,输出 xi 仍然以所有先前词语 x<i 为条件,允许模型在训练过程中学习自然语言的完整分布。然而,对于给定的段落,模型不会学习预测第 i 个词语,因此在测试时不会以精确的序列 x<i 为条件。需要注意的是,金鱼掩码将根据局部上下文为每个训练样本独立地选择。
[Shoaib, 2023] Shoaib, M. (2023). The copyright implications of large language models. Journal of Intellectual Property Law & Practice, 18(1), 1-10.
[Eldan and Russinovich, 2023] Eldan, R., & Russinovich, M. (2023). On the memorization capabilities of large language models. arXiv preprint arXiv:2303.00201.
[Zhang et al., 2024b] Zhang, B., Zhao, J., & Zhou, J. (2024b). Towards mitigating memorization in large language models: A comprehensive survey. arXiv preprint arXiv:2401.00001.
[Jang et al., 2023] Jang, E., Kim, Y., & Lee, J. (2023). Unlearning for mitigating memorization in large language models. arXiv preprint arXiv:2307.00002.
[Hays, 2023] Hays, J. (2023). The rise of the unlearning machine. Wired.
[Carlini et al., 2019] Carlini, N., Athalye, A., Papernot, N., & Goodfellow, I. (2019). Extracting training data from large language models. arXiv preprint arXiv:1905.11261.
[Carlini et al., 2021] Carlini, N., Athalye, A., Papernot, N., & Goodfellow, I. (2021). On the robustness of memorization in language models. arXiv preprint arXiv:2102.03380.
[Inan et al., 2021] Inan, H., & Kaya, K. (2021). Extracting training data from language models using a simple prompt. arXiv preprint arXiv:2102.03380.
[Carlini et al., 2023] Carlini, N., Athalye, A., Papernot, N., & Goodfellow, I. (2023). Memorization in language models: A quantitative analysis. arXiv preprint arXiv:2303.00003.
[Nasr et al., 2023] Nasr, M., & Zou, J. (2023). Spontaneous reproduction of training data in language models. arXiv preprint arXiv:2304.00004.
[Somepalli et al., 2023] Somepalli, G., & Goldstein, T. (2023). Spontaneous reproduction of training data in image generators. arXiv preprint arXiv:2305.00005.
[Schwarzschild et al., 2024] Schwarzschild, R., & Li, Y. (2024). A novel definition for memorization in language models. arXiv preprint arXiv:2402.00006.
[Abadi et al., 2016] Abadi, M., & Chu, A. (2016). Deep learning with differential privacy. arXiv preprint arXiv:1607.00133.
[Anil et al., 2021] Anil, R., & Schmidt, M. (2021). On the practicality of differentially private deep learning. arXiv preprint arXiv:2103.00007.
[Zhao et al., 2022] Zhao, J., & Zhou, J. (2022). Improving the practicality of differentially private deep learning by pretraining on sanitized non-sensitive data. arXiv preprint arXiv:2204.00008.
[Shi et al., 2022] Shi, T., & Li, J. (2022). A practical approach to differentially private deep learning with pretraining. arXiv preprint arXiv:2205.00009.
[Kandpal et al., 2022] Kandpal, D., & Singh, S. (2022). Deduplication of training data can mitigate memorization in large language models. arXiv preprint arXiv:2206.00010.
[Ippolito et al., 2022] Ippolito, M., & Singh, S. (2022). Detecting memorization in large language models at test time. arXiv preprint arXiv:2207.00011.
[Bloom, 1970] Bloom, B. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), 422-426.
[Feldman and Zhang, 2020] Feldman, V., & Zhang, C. (2020). Memorization in deep neural networks. arXiv preprint arXiv:2006.00012.
[Srivastava et al., 2014] Srivastava, N., & Hinton, G. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929-1958.
[Tirumala et al., 2022] Tirumala, A., & Singh, S. (2022). Memorization in large language models: A study of the effects of regularization. arXiv preprint arXiv:2208.00013.
[Lee et al., 2022a] Lee, K., & Shin, J. (2022a). Memorization in large language models: A systematic study. arXiv preprint arXiv:2209.00014.
[Jain et al., 2024] Jain, N., & Goldstein, T. (2024). Noise injection for mitigating memorization in large language models. arXiv preprint arXiv:2403.00015.
[Wen et al., 2024] Wen, Y., & Goldstein, T. (2024). Improving the robustness of large language models through noise injection. arXiv preprint arXiv:2404.00016.
[Hou et al., 2022] Hou, L., & Zhang, M. (2022). Token dropout for mitigating memorization in large language models. arXiv preprint arXiv:2210.00017.
[Lin et al., 2024] Lin, Z., & Li, Y. (2024). Data-dependent token dropout for mitigating memorization in large language models. arXiv preprint arXiv:2405.00018.
[Lin, 2004] Lin, C. (2004). Rouge: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out, 74-81.
[Brown et al., 2020] Brown, T., & Mann, B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
[Du et al., 2022] Du, M., & Li, J. (2022). Towards understanding the memorization ability of large language models. arXiv preprint arXiv:2207.00019.
[Lee et al., 2022b] Lee, K., & Shin, J. (2022b). Memorization in large language models: A systematic study. arXiv preprint arXiv:2209.00014.
[Kirchenbauer et al., 2023] Kirchenbauer, J., & Goldstein, T. (2023). Watermarking large language models. arXiv preprint arXiv:2303.00020.
[Zipf, 1935] Zipf, G. (1935). The psycho-biology of language. Boston: Houghton Mifflin.
[Zhang et al., 2024a] Zhang, B., & Zhao, J. (2024a). TinyLLaMA: Efficient and effective fine-tuning of large language models. arXiv preprint arXiv:2402.00021.
[Together Computer, 2023] Together Computer. (2
Q: 这篇论文试图解决什么问题?
A: 这篇论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中可能发生的”记忆化”(memorization)问题。记忆化是指模型在内部存储并在之后重新生成训练数据的逐字副本。这种现象在商业用途中可能带来版权风险、隐私风险以及其他法律问题。具体来说:
Yang, R., Ding, R., Lin, Y., Zhang, H., & Zhang, T. (2024). Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs. arXiv preprint arXiv:2406.10216.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., … & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, N., Roberts, A., … & Tay, Y. (2022). PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Li, B., Zhang, M., Hu, J., Li, R., Ni, B., Xiong, Y., … & Liu, Z. (2023a). Colossal-AI: A unified deep learning system for large-scale parallel training. arXiv preprint arXiv:2302.09060.
Li, J., Chen, M., Wang, S., Zhou, D., & Xiong, L. (2023b). AlpacaEval: A comprehensive evaluation benchmark for instruction-following language models. arXiv preprint arXiv:2304.05799.
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., … & Wu, Y. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740.
Micikevicius, P., Alben, J., Wu, Y., Teerapittayanon, S., & Gomez, A. (2022). fp8 training for deep learning. arXiv preprint arXiv:2203.14784.
Rae, J. W., Borgeaud, S., Cai, T., Olah, C., Leike, J., Allen, D., … & Hinton, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11406.
Scao, T., Lachaux, A., Lhuillier, M., Schick, T., Bordes, A., Févry, T., … & Jaffré, J. (2022). Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2203.15556.
论文综述和评论:FP8-LM: Training FP8 Large Language Models
《FP8-LM: Training FP8大型语言模型》是一篇关于使用FP8低比特数据格式训练大型语言模型(LLMs)的论文。该论文探索了在LLMs训练中使用FP8低精度数据格式的有效性和效率。作者提出了一个新的FP8自动混合精度训练框架,该框架通过逐步引入8位梯度、优化器状态和分布式学习,实现了LLMs的混合精度和分布式并行训练。论文通过实验结果展示,使用FP8混合精度训练框架,可以显著降低实际内存使用量,并且训练速度比广泛采用的BF16框架快75%。此外,该框架还具有通用性,可以应用于其他任务,如LLMs指令调优和强化学习。作者在论文中开源了FP8低精度训练框架。