Adaptive Neural Feedback Methods for Bias and Weight Adjustment in Feed Forward Layers of LLMs

Sai Sukesh Reddy Tummuri

doi:10.32628/IJSRST52310380

Authors

Sai Sukesh Reddy Tummuri Data Engineer, 1 Hacker Wy, Menlo Park, CA,94025, USA Author

DOI:

https://doi.org/10.32628/IJSRST52310380

Keywords:

Feed Forward Layers, Large Language Models, Adaptive Feedback Bias, Weight Corrected Feed-Forward Network, Deep Transformer Stacks, High Learning Rate, LLM Training

Abstract

Feed-forward layers constitute the dominant computational and parametric component of transformer-based Large Language Models (LLMs), yet they are a major source of training instability due to static bias terms, uncontrolled weight scaling, and activation distribution drift. Conventional optimization methods rely solely on global backpropagation signals, which are often insufficient to correct local statistical imbalances that emerge during large-scale, long-horizon training. This work proposes AFB-FFN (Adaptive Feedback Bias and Weight Corrected Feed-Forward Network), a novel feed-forward layer architecture that integrates an internal neural feedback mechanism to dynamically regulate bias and weight behavior during forward propagation. The proposed model introduces lightweight feedback units that generate bias correction vectors and weight gating signals conditioned on intermediate activations, enabling real-time stabilization of hidden representations. The AFB-FFN architecture is embedded within a transformer framework and evaluated on a token-level language modeling task. Extensive experimental analysis demonstrates that the proposed method significantly improves training stability under both nominal and high learning-rate regimes. The model achieves a controlled token-level accuracy of 97.8%, while maintaining smooth convergence, reduced gradient norm variance, lower activation drift, and stable gate entropy compared to conventional FFN baselines. These results validate that adaptive neural feedback driven bias and weight correction within feed-forward layers is an effective and scalable strategy for stabilizing LLM training. The proposed AFB FFN offers a practical architectural advancement toward robust, efficient, and statistically stable large language model optimization.

Downloads

Download data is not yet available.

References

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, ` E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

H. Wang and W. Gao. Tackling the unlimited staleness in federated learning with intertwined data and device heterogeneities. arXiv preprint arXiv:2309.13536, 2023.

Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.

Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2023.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. DOI: https://doi.org/10.1609/aaai.v34i05.6239

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, ` E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.

H. Wang and W. Gao. Tackling the unlimited staleness in federated learning with intertwined data and device heterogeneities. arXiv preprint arXiv:2309.13536, 2023.

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. DOI: https://doi.org/10.18653/v1/2020.emnlp-demos.6

E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. DOI: https://doi.org/10.18653/v1/2022.acl-short.1

Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547– 5569. PMLR, 2022.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. Advances in neural information processing systems, 32, 2019.

Zehua Pei, Xufeng Yao, Wenqian Zhao, and Bei Yu. Quantization via distillation and contrastive learning. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: https://doi.org/10.1109/TNNLS.2023.3300309

Liu, J., Song, Y., Xue, K., Sun, H., Wang, C., Chen, L., ... & Ruan, T. (2022). Fl-tuning: Layer tuning for feed-forward network in transformer. arXiv preprint arXiv:2206.15312.

Y. Xue, T. Tang and A. X. Liu, "Large-Scale Feedforward Neural Network Optimization by a Self-Adaptive Strategy and Parameter Based Particle Swarm Optimization," in IEEE Access, vol. 7, pp. 52473-52483, 2019, doi: 10.1109/ACCESS.2019.2911530. DOI: https://doi.org/10.1109/ACCESS.2019.2911530

M. A. Zare, R. Boostani, M. Mohammadi and S. Kouchaki, "A Dopamine Based Adaptive Emotional Neural Network," in IEEE Access, vol. 10, pp. 109460-109475, 2022, doi: 10.1109/ACCESS.2022.3212403 DOI: https://doi.org/10.1109/ACCESS.2022.3212403

P. Liu, Z. Zhang, Z. Meng and N. Gao, "Joint Attention Mechanisms for Monocular Depth Estimation With Multi-Scale Convolutions and Adaptive Weight Adjustment," in IEEE Access, vol. 8, pp. 184437-184450, 2020, doi: 10.1109/ACCESS.2020.3030097. DOI: https://doi.org/10.1109/ACCESS.2020.3030097

M. Chai, F. Xia, S. Hao, D. Peng, C. Cui and W. Liu, "PV Power Prediction Based on LSTM With Adaptive Hyperparameter Adjustment," in IEEE Access, vol. 7, pp. 115473-115486, 2019, doi: 10.1109/ACCESS.2019.2936597. DOI: https://doi.org/10.1109/ACCESS.2019.2936597

Adaptive Neural Feedback Methods for Bias and Weight Adjustment in Feed Forward Layers of LLMs

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

RightSideBlock

IssueDate

Latest publications