Enhancing Agricultural Product Evaluation with a Multi-Head Vision Transformer Approach

Authors

  • Deshinta Arrova Dewi Nusa Putra University
  • Misinem Universitas Bina Darma
  • Hafiz Muhammad Kurniawan INTI International University
  • Elmar Noche Pangasinan State University

DOI:

https://doi.org/10.31598/sintechjournal.v8i3.2069

Keywords:

Multi-Task Learning, Vision Transformer, Fruit Classification, Freshness Detection, Agricultural Quality Control

Abstract

The advancement of agricultural automation requires efficient and accurate models capable of evaluating multiple aspects of fruit quality simultaneously. Conventional computer vision systems typically employ separate models for fruit type classification and freshness detection, increasing computational complexity and reducing operational efficiency. This study proposes a Multi-Task Learning (MTL) framework based on a Vision Transformer (ViT) backbone to perform both tasks within a single unified model. The architecture utilizes a shared self-attention mechanism for global feature extraction and incorporates two dedicated classification heads to independently predict fruit types and freshness status within a shared feature space. Experiments were conducted using the Fresh and Stale Classification dataset, with evaluation metrics including accuracy, confusion matrices, precision, recall, and F1-score. The model achieved 98% accuracy for fruit classification and 99% for freshness detection. While the proposed ViT-based MTL model requires higher computational resources than individual lightweight CNNs, it demonstrates superior efficiency compared to deploying two separate models, reducing total inference time by 32.8% and parameter count by 31.0% while achieving significantly higher. Results demonstrate consistently high performance across categories, with minor confusion among visually similar fruits. The proposed approach enhances predictive performance while maintaining computational efficiency, offering a practical solution for real-world agricultural quality control applications.

References

[1] K. Yu et al., “Advances in Computer Vision and Spectroscopy Techniques for Non-Destructive Quality Assessment of Citrus Fruits: A Comprehensive Review,” Foods, vol. 14, no. 3, 2025, doi: 10.3390/foods14030386.

[2] I. Rojas Santelices, S. Cano, F. Moreira, and Á. Peña Fritz, “Artificial Vision Systems for Fruit Inspection and Classification: Systematic Literature Review,” Sensors, vol. 25, no. 5, 2025, doi: 10.3390/s25051524.

[3] J. Kong, H. Wang, X. Wang, X. Jin, X. Fang, and S. Lin, “Multi-stream hybrid architecture based on cross-level fusion strategy for fine-grained crop species recognition in precision agriculture,” Comput. Electron. Agric., vol. 185, p. 106134, 2021, doi: https://doi.org/10.1016/j.compag.2021.106134.

[4] S. Espinoza, C. Aguilera, L. Rojas, and P. G. Campos, “Analysis of fruit images with deep learning: A systematic literature review and future directions,” IEEE Access, vol. 12, pp. 3837–3859, 2023, doi: https://doi.org/10.1109/ ACCESS.2023.3345789.

[5] K. Venkatasubramanian, Z. Yasmeen, L. Reddy Kothapalli Sondinti, S. Valiki, S. Tejpal, and K. Paulraj, “Unified Deep Learning Framework Integrating CNNs and Vision Transformers for Efficient and Scalable Solutions,” SSRN Electron. J., 2025, doi: 10.2139/ssrn.5077827.

[6] M. Goldblum et al., “Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks,” in Advances in Neural Information Processing Systems, 2023, vol. 36, pp. 29343–29371.

[7] Y. Li, L. Guo, and Y. Ge, “Pseudo Labels for Unsupervised Domain Adaptation: A Review,” Electronics, vol. 12, no. 15, 2023, doi: 10.3390/electronics12153325.

[8] S. Hemalatha and J. J. B. Jayachandran, “A Multitask Learning-Based Vision Transformer for Plant Disease Localization and Classification,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, p. 188, 2024, doi: 10.1007/s44196-024-00597-3.

[9] Y. Tian and K. Bai, “End-to-End Multitask Learning With Vision Transformer,” IEEE Trans. Neural Networks Learn. Syst., vol. 35, no. 7, pp. 9579–9590, 2024, doi: 10.1109/TNNLS.2023.3234166.

[10] R. Karthik, M. Hariharan, S. Anand, P. Mathikshara, A. Johnson, and R. Menaka, “Attention embedded residual CNN for disease detection in tomato leaves,” Appl. Soft Comput., vol. 86, p. 105933, 2020, doi: 10.1016/j.asoc.2019.105933.

[11] C. J. Reed et al., “Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4088–4099.

[12] R. Caruana, “Multitask Learning,” Mach. Learn., vol. 28, no. 1, pp. 41–75, 1997, doi: 10.1023/A:1007379606734.

[13] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3994–4003.

[14] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.

[15] C. Shorten and T. M. Khoshgoftaar, “A survey on Image Data Augmentation for Deep Learning,” J. Big Data, vol. 6, no. 1, p. 60, 2019, doi: 10.1186/s40537-019-0197-0.

[16] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv Prepr. arXiv1704.04861, 2017, doi: https://doi.org/10.48550/arXiv.1704.04861.

[17] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv Prepr. arXiv2010.11929, 2020.

[18] Y. Zou, S. Yi, Y. Li, and R. Li, “A Closer Look at the CLS Token for Cross-Domain Few-Shot Learning,” in Advances in Neural Information Processing Systems, 2024, vol. 37, pp. 85523–85545. doi: 10.52202/079017-2716.

[19] Y. Zhang and Q. Yang, “A Survey on Multi-Task Learning,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, 2022, doi: 10.1109/TKDE.2021.3070203.

[20] A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, Jun. 2017, pp. 1–10. doi: 10.48550/arXiv.1706.03762.

[21] Y. Gong, P. Wu, R. Xu, X. Zhang, T. Wang, and X. Li, “TripleFormer: improving transformer-based image classification method using multiple self-attention inputs,” Vis. Comput., vol. 40, no. 12, pp. 9039–9050, 2024, doi: 10.1007/s00371-024-03294-6.

[22] K. Han et al., “A Survey on Vision Transformer,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 87–110, 2023, doi: 10.1109/TPAMI.2022.3152247.

[23] A. Khan et al., “A survey of the vision transformers and their CNN-transformer based variants,” Artif. Intell. Rev., vol. 56, no. 3, pp. 2917–2970, 2023, doi: 10.1007/s10462-023-10595-0.

[24] Y.-Y. Zheng, J.-L. Kong, X.-B. Jin, X.-Y. Wang, T.-L. Su, and M. Zuo, “CropDeep: The Crop Vision Dataset for Deep-Learning-Based Classification and Detection in Precision Agriculture,” Sensors, vol. 19, no. 5. p. 1058, 2019. doi: 10.3390/s19051058.

[25] W. Xu et al., “Real-time pest monitoring with RSCDet: Deploying a novel lightweight detection model on embedded systems,” Smart Agric. Technol., vol. 12, p. 101280, 2025, doi: https://doi.org/10.1016/j.atech.2025.101280.

[26] A. Kovari, “A Framework for Integrating Vision Transformers with Digital Twins in Industry 5.0 Context,” Machines, vol. 13, no. 1. pp. 1–22, 2025. doi: 10.3390/machines13010036.

[27] B. Palanisamy et al., “Transformers for Vision: A Survey on Innovative Methods for Computer Vision,” IEEE Access, vol. 13, pp. 95496–95523, 2025, doi: 10.1109/ACCESS.2025.3571735.

Downloads

Published

2025-12-31

How to Cite

Dewi, D. A., Misinem, Kurniawan, H. M., & Noche, E. (2025). Enhancing Agricultural Product Evaluation with a Multi-Head Vision Transformer Approach. SINTECH (Science and Information Technology) Journal, 8(3), 253–265. https://doi.org/10.31598/sintechjournal.v8i3.2069