Paper Title
STRUCTURED PRUNING OF LLAMA 3.1 8B FOR EDGE DEVICES

Abstract
The rapid scaling of Large Language Models (LLMs) has revolutionized natural language processing (NLP) but introduced significant computational and memory challenges, particularly for deployment on resource-constrained devices such as edge systems. This study builds upon the Sheared LLaMA framework to extend structured pruning techniques for the LLaMA 3.1 8B model, addressing the unique architectural complexities posed by Grouped Query Attention (GQA) and expanded intermediate dimensions. Our methodology introduces tailored pruning strategies that preserve critical architectural features while optimizing sparsity through constrained optimization. Combined with continued pre-training, our approach reduces the model size by 35%, lowers memory usage by 38%, and decreases inference latency by 40%, while retaining up to 94% of the original model’s performance on certain NLP benchmarks. The pruned model achieves an accuracy of 52.5% on BoolQ and 59.3% on ARC Easy, demonstrating strong retention of the original model’s capabilities. By aligning structured sparsity patterns with hardware acceleration capabilities, our approach highlights the potential of structured pruning to balance efficiency and performance, offering a scalable pathway for LLM optimization in edge environments.