Vol. 3 No. 9 (2024)
Articles

Structured Compression of Large Language Models with Sensitivity-aware Pruning Mechanisms

Published 2024-12-30

How to Cite

Wang, Y. (2024). Structured Compression of Large Language Models with Sensitivity-aware Pruning Mechanisms. Journal of Computer Technology and Software, 3(9). https://doi.org/10.5281/zenodo.15851638

Abstract

This paper addresses the challenges of high computational complexity and structural redundancy in the inference stage of large language models. It proposes a structured pruning method that combines a Pruning Importance Evaluation Mechanism (PIEM) with a Layer-aware Sensitivity Pruning Strategy (LSPS). The method first constructs a multi-dimensional structural scoring function. It evaluates the importance of each structural unit in the model by integrating weight distribution, gradient information, and contextual influence. Then, based on the sensitivity differences across layers, it adaptively adjusts the pruning intensity. This prevents uniform pruning from damaging performance in highly sensitive layers. Experiments conducted on the large language model ChatGLM-6B show that the proposed method outperforms existing public pruning strategies across multiple evaluation metrics. It significantly reduces inference latency while maintaining high model accuracy. It also removes a larger proportion of redundant structures. In both comparative and ablation experiments, PIEM and LSPS each demonstrate strong independent effectiveness. When combined, the full method achieves the best results in both inference efficiency and structural compression rate. Furthermore, inference tests on edge devices and comparisons under different scoring metrics show that the proposed strategy maintains good stability and adaptability. This confirms its strong generalization ability and practical value in real-world engineering scenarios. Additional experiments validate that multi-round pruning offers a better depth of compression and performance retention than one-shot strategies. These findings further support the method's effectiveness in building lightweight and efficient language models for practical applications.