Analysis Of Neural Network Architectures  For Syllable-Based Voice Recognition In Indonesian

deni sutendi kartawijaya; Tjong Wan Sen

doi:10.31598/sintechjournal.v8i2.1770

Authors

deni sutendi kartawijaya President University
Tjong Wan Sen President University

DOI:

https://doi.org/10.31598/sintechjournal.v8i2.1770

Keywords:

voice recognition, syllable recognition, ANN, LSTM, CNN, deep learning, Indonesian

Abstract

Nowadays, speech recognition technology is widely used in various technology platforms. But there are still only a few numbers of researchs on speech recognition in Indonesian syllable recognition. The main goal of the research is to implement the combination of several deep learning techniques to get the best Model-Based Recognition Systems for Indonesian syllable recognition. Due to the limited of time, current research was conduted to get the best knowledge on how to process syllable voice recognition in Indonesian using 1-D array data using 3 deep learning technniques such as Artificial Neural Networks (ANN), Long Short-Term Memory Networks (LSTM), and Convolutional Neural Networks (CNN). Based on those situations, this study focuses on syllable-based voice recognition in Indonesian using 1D array data that evaluates and compares the performance ANN, LSTM, and CNN, to determine their effectiveness in recognizing syllables within voice data. The dataset of voice recordings was conducted manually. The labeling process was conducted by manually segmenting the 1D array form of the voice data to get the most accurate label. Each syllable was divided into 3 parts with the same size (1024 time-based array data). At the beginning, there were 400 voice recordings collected, but due to the limited of time for the task submission, 10 voice recordings were processed resulting in 309 unique syllable parts across 60 classes. Each architecture was evaluated for their accuracy. The results indicate significant differences in model performance, with CNN demonstrating superior capabilities in capturing sequential dependencies inherent in syllabic speech data. Based on the experiments, the CNN model is the best model to process the Indonesian syllable classification with 99.86% accuracy, followed by LSTM and ANN with 99.03% and 91.91% accuracy respectively. This study may contribute to the next process for Indonesian voice recognition as a basis to conduct another research by combining these models to get the best result.

References

[1]. M. M. Abdulghani, W. L. Walters, and K. H. Abed, "Electroencephalography-Based Inner Speech Classification Using LSTM and Wavelet Scattering Transformation (WST)," in Contemporary Perspective on Science, Technology and Research, vol. 3, pp. 29–52, B P International, 2024, doi: 10.9734/bpi/cpstr/v3/6989c.

[2]. H. Abdullah, W. Garcia, C. Peeters, P. Traynor, K. R. B. Butler, and J. Wilson, "Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems," unpublished.

[3]. Ç. Bakır, "Automatic Voice and Speech Recognition System for the German Language with Deep Learning Methods," Original Research Paper, Adv. Technol. Sci., no. 4, 2013. [Online]. Available: http://ijamec.atscience.org

[4]. O. Chandraumakantham, N. Gowtham, M. Zakariah, and A. Almazyad, "Multimodal Emotion Recognition Using Feature Fusion: An LLM-Based Approach," IEEE Access, vol. 12, pp. 108052–108071, 2024, doi: 10.1109/ACCESS.2024.3425953.

[5]. Z. Chen, J. Ge, H. Zhan, S. Huang, and D. Wang, "Pareto Self-Supervised Training for Few-Shot Learning," unpublished.

[6]. J. Dai, "Sparse Discrete Wavelet Decomposition and Filter Bank Techniques for Speech Recognition," 2019.

[7]. R. Ganchev and M. Informatics, "Voice Signal Processing for Machine Learning: The Case of Speaker Isolation Overview and Evaluation of Decomposition Methods Applied to the Input Signal of Voice Processing ML Models," unpublished.

[8]. B. Guan, J. Cao, X. Wang, Z. Wang, M. Sui, and Z. Wang, "Integrated Method of Deep Learning and Large Language Model in Speech Recognition," 2024. [Online]. Available: https://doi.org/10.20944/preprints202407.1520.v3

[9]. M. D. Hassan, A. Nejdet Nasret, M. R. Baker, and S. Mahmood, "Enhancement automatic speech recognition by deep neural networks," Original Research, vol. 9, no. 4, pp. 921–927, 2021.

[10]. H. Isyanto, A. Setyo Arifin, and M. Suryanegara, "Voice Biometrics for Indonesian Language Users using Algorithm of Deep Learning CNN Residual and Hybrid of DWT-MFCC Extraction Features," Int. J. Adv. Comput. Sci. Appl. (IJACSA), vol. 13, no. 5. [Online]. Available: www.ijacsa.thesai.org

[11]. H. Y. Khdier, W. M. Jasim, and S. A. Aliesawi, "Deep Learning Algorithms based Voiceprint Recognition System in Noisy Environment," J. Phys.: Conf. Ser., vol. 1804, no. 1, 2021.

[12]. S. Kumar Nayak, A. Kumar Nayak, S. Mishra, and P. Mohanty, "Deep Learning Approaches for Speech Command Recognition in a Low Resource KUI Language," Int. J. Intell. Syst. Appl. Eng. (IJISAE), vol. 2023, no. 2. [Online]. Available: www.ijisae.org

[13]. K. Li, A. Zhu, Song, Zhao, Liu, and Jiabei, "Utilizing Deep Learning to Optimize Software Development Processes," J. Comput. Technol. Appl. Math., vol. 1, no. 1, 2024, doi: 10.5281/zenodo.11084103.

[14]. K.-Y. Liu, S.-S. Wang, Y. Tsao, and J.-W. Hung, "Speech enhancement based on the integration of fully convolutional network, temporal lowpass filtering and spectrogram masking," unpublished.

[15]. Z. Ma et al., "An Embarrassingly Simple Approach for LLM with Strong ASR Capacity," unpublished.

[16]. V. Mitra et al., "Robust Features in Deep Learning-Based Speech Recognition," unpublished.

[17]. A. Moondra and P. Chahal, "Improved Speaker Recognition for Degraded Human Voice using Modified-MFCC and LPC with CNN," Int. J. Adv. Comput. Sci. Appl. (IJACSA), vol. 14, no. 4. [Online]. Available: www.ijacsa.thesai.org

[18]. M. Novela and T. Basaruddin, "Dataset Suara dan Teks Berbahasa Indonesia pada Rekaman Podcast dan Talk Show," Agustus, vol. 11, no. 2, pp. 61–66.

[19]. T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech Commun., vol. 41, no. 4, pp. 603–623, 2003, doi: 10.1016/S0167-6393(03)00099-2.

[20]. N. N. Prachi, F. M. Nahiyan, M. Habibullah, and R. Khan, "Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques," in Proc. 2022 Int. Conf. Interdiscip. Res. Technol. Manag. (IRTM), 2022, doi: 10.1109/IRTM54583.2022.9791766.

[21]. F. M. Rammo and M. N. Al-Hamdani, "Detecting the Speaker Language Using CNN Deep Learning Algorithm," Iraqi J. Comput. Sci. Math., vol. 3, no. 1, pp. 43–52, 2022, doi: 10.52866/ijcsm.2022.01.01.005.

[22]. M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet." [Online]. Available: https://github.com/mravanelli/SincNet/

[23]. I. Rebai, Y. Benayed, W. Mahdi, and J. P. Lorré, "Improving speech recognition using data augmentation and acoustic model fusion," Procedia Comput. Sci., vol. 112, pp. 316–322, 2017, doi: 10.1016/j.procs.2017.08.003.

[24]. T. N. Sainath et al., "Multichannel Signal Processing with Deep Neural Networks for Automatic Speech Recognition," unpublished.

[25]. K. A. Salman, K. Shaker, and J. J. Stephan, "Speaker Recognition Using Deep Neural Networks with Combine Feature Extraction Techniques," Periodicals of Engineering and Natural Sciences, vol. 7, no. 2, pp. 10–13, 2019. [Online]. Available: http://pen.ius.edu.ba

[26]. S. Suyanto, A. Romadhony, F. Sthevanie, and R. N. Ismail, "Augmented words to improve a deep learning-based Indonesian syllabification," Heliyon, vol. 7, no. 10, 2021, doi: 10.1016/j.heliyon.2021.e08115.

[27]. Z. Tan, T. Chen, Z. Zhang, and H. Liu, "Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention," 2024. [Online]. Available: www.aaai.org

[28]. S. Team, "Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition," unpublished.

[29]. V. Tiwari, "MFCC and its Applications in Speaker Recognition," unpublished.

[30]. L. T. Van, T. T. Le Dao, T. Le Xuan, and E. Castelli, "Emotional Speech Recognition Using Deep Neural Networks," Sensors, vol. 22, no. 4, 2022, doi: 10.3390/s22041414.

[31]. Y. Weng and J. Wu, "Big Data and Machine Learning in Defence," Int. J. Comput. Sci. Inf. Technol., vol. 16, no. 2, pp. 25–35, 2024, doi: 10.5121/ijcsit.2024.16203.

[32]. G. Yang, Z. Ma, F. Yu, Z. Gao, S. Zhang, X. Chen, and M. Key, "MaLa-ASR: Multimedia-Assisted LLM-Based ASR." [Online]. Available: https://github.com/X-

[33]. S. Yang, Y. Zhao, and H. Gao, "Using Large Language Models in Real Estate Transactions: A Few-shot Learning Approach," unpublished.

[34]. D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach, Signals and Communication Technology Series. [Online]. Available: http://www.springer.com/series/4748

[35]. X. Zhang, R. Chowdhury, R. K. Gupta, and J. Shang, "Large Language Models for Time Series: A Survey." [Online]. Available: https://github.com/xiyuanzh/awesome-llm-time-series

[36]. Y. Zhao, S. Yang, and H. Gao, "Utilizing Large Language Models to Analyze Common Law Contract Formation," unpublished.

[37]. T. Wan Sen, K. H. Dewantara, C. Baru, and F. Komputer, "Data suara ucapan vokal Bahasa Indonesia," Information System Application, vol. 1, no. 2, n.d.

[38]. T. Wan and S. #1, "Frekuensi dominan dalam vokal Bahasa Indonesia," IT for Society, vol. 1, no. 2, n.d.

Analysis Of Neural Network Architectures For Syllable-Based Voice Recognition In Indonesian

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles

Menu

Template

Tools

RJI

Stats

Indexer

Submission

Acreditation

INDEXWIDGET

Information