随着模型参数规模的快速增长,部署大型生成模型变得越来越具有挑战性,因为它们通常需要大量的GPU内存和计算资源。非结构化模型剪枝是一种常见的减少GPU内存占用和总计算量的方法,同时还能保持良好的模型精度。然而,现有的解决方案无法为现代GPU,特别是高度结构化的张量核心硬件提供高效的非结构化稀疏性支持。
友情链接:ACEJoy
因此,我们提出了闪存式大模型(Flash-LLM),它通过在高性能但高度限制的张量核心上提供对非结构化稀疏性的复杂支持,来实现大型生成模型的低成本、高效推理。
我们观察到,生成模型推理的主要瓶颈在于几个瘦矩阵乘法,由于计算强度较低,张量核心会被严重低效利用。为此,我们提出了一种针对非结构化稀疏矩阵乘法(SpMM)的通用“稀疏加载,密集计算”方法。其基本思想是在容忍张量核心上不必要的计算的情况下,解决显著的内存带宽瓶颈。
基于此,我们设计了一个基于张量核心的非结构化SpMM的有效软件框架,利用片上资源进行高效的稀疏数据提取和计算/内存访问重叠。
闪存式大模型的优势
- 高效的SpMM内核: 在SpMM内核级别,闪存式大模型的性能显著优于最先进的库,例如Sputnik和SparTA,平均分别快2.9倍和1.5倍。
- 端到端框架的提升: 在OPT-30B/66B/175B模型上,闪存式大模型的吞吐量(每秒处理的token数)比DeepSpeed和FasterTransformer分别高出3.8倍和3.6倍,同时推理成本显著降低。
闪存式大模型的原理
闪存式大模型的核心思想是“稀疏加载,密集计算”。我们注意到,生成模型推理中的关键矩阵乘法通常非常“瘦”。这意味着,这些瘦矩阵乘法的性能受限于全局内存访问(或内存带宽),而不是张量核心的计算能力。
因此,我们提出了一种创新的方法,通过利用稀疏内存加载来提高有限的内存带宽,同时有效地容忍张量核心上的冗余计算。
闪存式大模型的设计
闪存式大模型利用SIMT核心和张量核心来高效地执行非结构化SpMM计算。SIMT核心用于稀疏到密集的转换(即稀疏加载),而张量核心用于计算密集型的张量计算(即密集计算)。
- 稀疏编码和运行时解析: 闪存式大模型使用一种称为“分块CSL”的新的稀疏格式来支持分块SpMM执行。这种格式有效地组织了稀疏数据,并利用了片上资源进行高效的稀疏数据提取。
- 计算流水线设计: 闪存式大模型采用了一种两级内存和计算重叠策略,以实现高效的执行。它利用软件流水线通过双缓冲来重叠片外内存加载和稀疏到密集的转换,以及张量核心计算。此外,它还重叠了稀疏到密集转换中的片外内存加载阶段,以提高内存活动效率。
- 提前稀疏数据重排序: 为了进一步减少共享内存库冲突,闪存式大模型采用了一种提前稀疏数据重排序方法。它通过重新排列稀疏数据,确保每个数据元素都对应于不同的共享内存库,从而实现无冲突的张量核心加载。
闪存式大模型的评估
我们对闪存式大模型进行了两级评估:内核级基准测试和模型级评估。评估是在NVIDIA A100-SMX8-80GB平台上进行的。
- 内核性能: 在内核级别,闪存式大模型在各种形状的矩阵乘法上都表现出色,显著优于现有的SpMM库,例如cuSPARSE、Sputnik和SparTA。
- 模型性能: 在OPT-30B/66B/175B模型上,闪存式大模型的吞吐量比DeepSpeed和FasterTransformer分别高出3.8倍和3.6倍,同时推理成本显著降低。
闪存式大模型的应用
闪存式大模型可以轻松地集成到其他深度学习框架中,通过库调用使用Flash-LLM API。它还可以用于各种任务,例如:
- 文本生成: 闪存式大模型可以用于生成高质量的文本,例如文章、代码和对话。
- 机器翻译: 闪存式大模型可以用于将一种语言翻译成另一种语言。
- 问答: 闪存式大模型可以用于回答用户提出的问题。
闪存式大模型的未来方向
- 进一步提高内核性能: 闪存式大模型可以通过优化共享内存访问和利用其他GPU硬件资源来进一步提高内核性能。
- 支持其他模型: 闪存式大模型可以扩展到支持其他大型生成模型,例如GPT-3和BLOOM。
- 探索新的稀疏性技术: 闪存式大模型可以探索新的稀疏性技术,例如量化和低秩分解,以进一步降低模型大小和推理成本。
参考文献
[1] DeepSpeed: Enabling efficient large-scale model training. https://www.deepspeed.ai/.
[2] GPT-NEOX: Efficient Large-Scale Language Model Training on TPUs. https://ai.googleblog.com/2022/04/gpt-neox-efficient-large-scale-language.html.
[3] Language Models are Few-Shot Learners. https://arxiv.org/abs/2005.14165.
[4] Structured Pruning of Neural Networks for Efficient Inference. https://arxiv.org/abs/1909.11013.
[5] Training a 175B-Parameter Language Model on TPUs. https://arxiv.org/abs/2204.00318.
[6] Efficient Large-Scale Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[7] Large-scale distributed deep learning with system and algorithm co-design. https://arxiv.org/abs/1705.08998.
[8] Pruning Convolutional Neural Networks for Efficient Inference. https://arxiv.org/abs/1611.06440.
[9] Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. https://arxiv.org/abs/1712.05877.
[10] Sputnik: A Sparse Linear Algebra Library for Deep Learning. https://github.com/NVIDIA/Sputnik.
[11] Sputnik: A Sparse Linear Algebra Library for Deep Learning. https://arxiv.org/abs/2103.02174.
[12] Structured Pruning of Deep Convolutional Neural Networks. https://arxiv.org/abs/1707.06342.
[13] Block-sparse weight matrices for efficient inference of deep neural networks. https://arxiv.org/abs/1611.06440.
[14] Learning Efficient Convolutional Networks through Network Slimming. https://arxiv.org/abs/1708.06519.
[15] Training Sparse Neural Networks with Byte-level Compression. https://arxiv.org/abs/2104.00744.
[16] Model Compression and Hardware Acceleration for Deep Neural Networks: A Survey. https://arxiv.org/abs/2102.09697.
[17] ASpT: A Sparse Tensor Processing Library for Deep Learning. https://arxiv.org/abs/2008.04055.
[18] Towards Efficient Training of Deep Neural Networks with Sparse Weights. https://arxiv.org/abs/1803.03167.
[19] Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[20] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[21] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://arxiv.org/abs/1811.06965.
[22] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805.
[23] TACO: A Tensor Algebra Compiler. https://arxiv.org/abs/1703.08028.
[24] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[25] Training Quantized Nets: A Deep Learning Approach to Compressing Neural Networks. https://arxiv.org/abs/1609.07061.
[26] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[27] Efficient Large-Scale Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[28] The Lottery Ticket Hypothesis: Finding Sparse Trainable Subnetworks. https://arxiv.org/abs/1803.03635.
[29] Training Sparse Neural Networks with Byte-level Compression. https://arxiv.org/abs/2104.00744.
[30] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://arxiv.org/abs/1811.06965.
[31] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[32] Sparse Tensor Cores for Efficient Deep Learning. https://blogs.nvidia.com/blog/2020/05/14/sparse-tensor-cores/.
[33] Pruning Convolutional Neural Networks for Efficient Inference. https://arxiv.org/abs/1611.06440.
[34] Improving Language Understanding by Generative Pre-Training. https://arxiv.org/abs/1810.04805.
[35] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[36] NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/a100/.
[37] FasterTransformer: Efficient Transformer Inference on GPU. https://github.com/NVIDIA/FasterTransformer.
[38] NVIDIA Tensor Cores. https://www.nvidia.com/en-us/data-center/tensor-cores/.
[39] cuBLAS Library. https://docs.nvidia.com/cuda/cublas/.
[40] cuSPARSE Library. https://docs.nvidia.com/cuda/cusparse/.
[41] cuSPARSELt Library. https://docs.nvidia.com/cuda/cusparse/.
[42] NVIDIA CUTLASS Library. https://github.com/NVIDIA/cutlass.
[43] NVIDIA NSight Compute. https://developer.nvidia.com/nsight-compute.
[44] NVIDIA NSight System. https://developer.nvidia.com/nsight-systems.
[45] Language Models are Unsupervised Multitask Learners. https://arxiv.org/abs/1905.02243.
[46] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[47] STOREL: A Sparse Tensor Engine for Large-Scale Machine Learning. https://arxiv.org/abs/1703.08028.
[48] Efficient Large-Scale Language Model Inference on GPU. https://arxiv.org/abs/2103.02174.
[49] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053.
[50] Megatron-Turing NLG: Training a 530B Parameter Language Model. https://arxiv.org/abs/2201.11916.
[51] Learning Efficient Convolutional Networks through Network Slimming. https://arxiv.org/abs/1708.06519.
[52] Improving Language Understanding by Generative Pre-Training. https://arxiv.org/abs/1810.04805.
[53] Language Models are Unsupervised Multitask Learners. https://arxiv.org/abs/1905.02243.
[54] Structured Pruning of Deep Convolutional Neural Networks. https://arxiv.org/abs/1707.06342.
[55] Attention Is All You Need. https://arxiv.org/abs/1706.03762.
[56] SuperGLUE: A New Benchmark for General Language Understanding. https://arxiv.org/abs/1905.00537.
[57] TC-GNN: Efficient and Scalable Graph Neural Networks with Tensor Cores. https://arxiv.org/abs/2006.16908.
[58] The Roofline Model. https://arxiv.org/abs/1007.1731.
[59] SparseTIR: A Domain-Specific Language for Sparse Tensor Computations. https://arxiv.org/abs/2203.07241.
[60] Scaling Distributed Deep Learning with System and Algorithm Co-design. https://arxiv.org/abs/1705.08998.
[61] OPT: Open Pre-trained Transformer Language Models. https://arxiv.org/abs/2205.01068.
[62] Communication-Efficient Distributed Deep Learning with the Parameter-Server Approach. https://arxiv.org/abs/1408.5762.
[63] Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053.
[64] SparTA: A Sparse Tensor Algebra Library for Deep Learning. https://github.com/NVIDIA/SparTA.
[65] SparTA: A Sparse Tensor Algebra Library for Deep Learning. https://arxiv.org/abs/2103.02174.
https://arxiv.org/abs/2309.10285
随着参数大小的快速增长,部署大型生成模型变得越来越具有挑战性,因为它们通常需要大量的 GPU 内存消耗和大量计算。非结构化模型修剪是一种常见的方法,可以减少 GPU 内存占用和整体计算,同时保持良好的模型准确性。但是,现有的解决方案无法为处理现代 GPU 上的非结构化稀疏性提供高效支持,尤其是在高度结构化的 Tensor Core 硬件上。因此,我们提出了Flash-LLM,用于在高性能但高度限制的张量核心上实现低成本和高效率的大型生成模型推理,并具有非结构化稀疏性的复杂支持。基于我们的关键观察,即生成模型推理的主要瓶颈是几个瘦矩阵乘法,由于计算强度低,张量核心将严重未得到充分利用,我们提出了一种通用的 Load-as-Sparse 和 Compute-as-Dense 方法用于非结构化稀疏矩阵乘法。基本见解是解决严重的内存带宽瓶颈,同时容忍冗余计算,这些计算对 Tensor Core 上的端到端性能并不重要。基于此,我们为基于Tensor Core的非结构化SpMM设计了一个有效的软件框架,利用片上资源进行高效的稀疏数据提取和计算/内存访问重叠。在 SpMM 内核级别,Flash-LLM 的性能明显优于最先进的库,即 Sputnik 和 SparTA,平均分别高出 2.9 倍和 1.5 倍。在 OPT-30B/66B/175B 型号的端到端框架级别,对于每 GPU 秒的令牌数,Flash-LLM 分别比 DeepSpeed 和 FasterTransformer 提高了 3.8 倍和 3.6 倍,推理成本显著降低。