Transformer-Based Sequence Modeling Short Answer Assessment Framework

P. Sharmila, Kalaiarasi Sonai Muthu Anbananthen, Deisy Chelliah, S. Parthasarathy, Baarathi Balasubramaniam, Saravanan Nathan Lurudusamy

Abstract


Automated subjective assessment presents a significant challenge due to the complex nature of human language and reasoning characterized by semantic variability, subjectivity, language ambiguity, and judgment levels. Unlike objective exams, subjective assessments involve diverse answers, posing difficulties in automated scoring. The paper proposes a novel approach that integrates advanced natural language processing (NLP) techniques with principled grading methods to address this challenge. Combining Transformer-based Sequence Language Modeling with sophisticated grading mechanisms aims to develop more accurate and efficient automatic grading systems for subjective assessments in education. The proposed approach consists of three main phases: Content Summarization: Relevant sentences are extracted using self-attention mechanisms, enabling the system to effectively summarize the content of the responses. Key Term Identification and Comparison: Key terms are identified within the responses and treated as overt tags. These tags are then compared to reference keys using cross-attention mechanisms, allowing for a nuanced evaluation of the response content. Grading Process: Responses are graded using a weighted multi-criteria decision method, which assesses various quality aspects and assigns partial scores accordingly. Experimental results on the SQUAD dataset demonstrate the approach’s effectiveness, achieving an impressive F-score of 86%. Furthermore, significant improvements in metrics like ROUGE, BLEU, and METEOR scores were observed, validating the efficacy of the proposed approach in automating subjective assessment tasks.

 

Doi: 10.28991/HIJ-2024-05-03-06

Full Text: PDF


Keywords


Attention Model; Sequence Language Modeling; Subjective Assessment; Transformer.

References


Liu, T., Hu, Y., Wang, B., Sun, Y., Gao, J., & Yin, B. (2023). Hierarchical Graph Convolutional Networks for Structured Long Document Classification. IEEE Transactions on Neural Networks and Learning Systems, 34(10), 8071–8085. doi:10.1109/TNNLS.2022.3185295.

Ateeq, M. A., Tiun, S., Abdelhaq, H., & Rahhal, N. (2024). Arabic Narrative Question Answering (QA) Using Transformer Models. IEEE Access, 12, 2760 - 2777. doi:10.1109/ACCESS.2023.3348410.

Paiva, J. C., Leal, J. P., & Figueira, Á. (2022). Automated Assessment in Computer Science Education: A State-of-the-Art Review. ACM Transactions on Computing Education, 22(3), 1–40. doi:10.1145/3513140.

Ramesh, D., & Sanampudi, S. K. (2022). An automated essay scoring system: a systematic literature review. Artificial Intelligence Review, 55(3), 2495–2527. doi:10.1007/s10462-021-10068-2.

Neshan, S. A. S., & Akbari, R. (2020). A Combination of Machine Learning and Lexicon Based Techniques for Sentiment Analysis. 2020 6th International Conference on Web Research, ICWR 2020, 8–14. doi:10.1109/ICWR49608.2020.9122298.

Zhu, X., Wu, H., & Zhang, L. (2022). Automatic Short-Answer Grading via BERT-Based Deep Neural Networks. IEEE Transactions on Learning Technologies, 15(3), 364–375. doi:10.1109/TLT.2022.3175537.

Das, B., & Majumder, M. (2017). Factual open cloze question generation for assessment of learner’s knowledge. International Journal of Educational Technology in Higher Education, 14(1), 1–12. doi:10.1186/s41239-017-0060-3.

Sonai, K., Anbananthen, M., Mohamed, A., & Elyasir, H. (2013). Evolution of Opinion Mining. Australian Journal of Basic and Applied Sciences, 7(6), 359–370.

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2014). Automated Grammatical Error Detection for Language Learners, Second Edition. Synthesis Lectures on Human Language Technologies: Springer Nature, 7(1), 149-151. doi:10.2200/S00562ED1V01Y201401HLT025.

Matsumori, S., Okuoka, K., Shibata, R., Inoue, M., Fukuchi, Y., & Imai, M. (2023). Mask and Cloze: Automatic Open Cloze Question Generation Using a Masked Language Model. IEEE Access, 11, 9835–9850. doi:10.1109/ACCESS.2023.3239005.

Feng, Y., Bagheri, E., Ensan, F., & Jovanovic, J. (2017). The state of the art in semantic relatedness: A framework for comparison. Knowledge Engineering Review, 32, 10. doi:10.1017/S0269888917000029.

Sahu, A., & Bhowmick, P. K. (2020). Feature Engineering and Ensemble-Based Approach for Improving Automatic Short-Answer Grading Performance. IEEE Transactions on Learning Technologies, 13(1), 77–90. doi:10.1109/TLT.2019.2897997.

Rosnelly, R., Hartama, D., Sadikin, M., Lubis, C. P., Simanjuntak, M. S., & Kosasi, S. (2021). The Similarity of Essay Examination Results using Preprocessing Text Mining with Cosine Similarity and Nazief-Adriani Algorithms. Turkish Journal of Computer and Mathematics Education, 12(3), 1415-1422. doi:10.17762/turcomat.v12i3.938.

Wahyuningsih, T., Henderi, & Winarno. (2021). Text Mining an Automatic Short Answer Grading (ASAG), Comparison of Three Methods of Cosine Similarity, Jaccard Similarity and Dice’s Coefficient. Journal of Applied Data Sciences, 2(2), 45–54. doi:10.47738/jads.v2i2.31.

Anbananthen, K. S. M., Kannan, S., Busst, M. M. A., Muthaiyah, S., & Lurudusamy, S. N. (2022). Typographic Error Identification and Correction in Chatbot Using N-gram Overlapping Approach. Journal of System and Management Sciences, 12(5), 91–104. doi:10.33168/JSMS.2022.0506.

Kaur, A., & Sasi Kumar, M. (2019). Performance Analysis of LSA for Descriptive Answer Assessment. Lecture Notes in Networks and Systems, 74, 57–63. doi:10.1007/978-981-13-7082-3_8.

Mardini G, I. D., Quintero M, C. G., Viloria N, C. A., Percybrooks B, W. S., Robles N, H. S., & Villalba R, K. (2024). A deep-learning-based grading system (ASAG) for reading comprehension assessment by using aphorisms as open-answer-questions. Education and Information Technologies, 29(4), 4565–4590. doi:10.1007/s10639-023-11890-7.

Bexte, M., Horbach, A., & Zesch, T. (2023). Similarity-based content scoring-a more classroom-suitable alternative to instance-based scoring?. In Findings of the association for computational linguistics: ACL 2023, 1892-1903.

Bonthu, S., Rama Sree, S., & Krishna Prasad, M. H. M. (2023). Improving the performance of automatic short answer grading using transfer learning and augmentation. Engineering Applications of Artificial Intelligence, 123, 106292. doi:10.1016/j.engappai.2023.106292.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). Albert: A Lite Bert for Self-Supervised Learning of Language Representations. 8th International Conference on Learning Representations, ICLR 2020, 344-350, Shenzhen, China. doi:10.1109/SLT48900.2021.9383575.

Klyuchnikov, N., Trofimov, I., Artemova, E., Salnikov, M., Fedorov, M., Filippov, A., & Burnaev, E. (2022). NAS-Bench-NLP: Neural Architecture Search Benchmark for Natural Language Processing. IEEE Access, 10, 45736–45747. doi:10.1109/ACCESS.2022.3169897.

Khodeir, N. A. (2021). Bi-GRU Urgent Classification for MOOC Discussion Forums Based on BERT. IEEE Access, 9, 58243–58255. doi:10.1109/ACCESS.2021.3072734.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(1), 5485–5551.

Alec, R., Jeffrey, W., Rewon, C., David, L., Dario, A., & Ilya, S. (2019). Language Models are Unsupervised Multitask Learners|Enhanced Reader. OpenAI Blog, 1(8), 9.

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). OPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. 11th International Conference on Learning Representations, 1-16. doi:10.48550/arXiv.2210.17323.

La Quatra, M., & Cagliero, L. (2023). BART-IT: An Efficient Sequence-to-Sequence Model for Italian Text Summarization. Future Internet, 15(1), 15. doi:10.3390/fi15010015.

Zhu, X., Wu, H., & Zhang, L. (2022). Automatic Short-Answer Grading via BERT-Based Deep Neural Networks. IEEE Transactions on Learning Technologies, 15(3), 364–375. doi:10.1109/TLT.2022.3175537.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. ArXiv, 1-9. ArXiv:1310.4546. doi:10.48550/arXiv.1310.4546.

Anbananthen, K. S. M., Krishnan, J. K., Sayeed, M. S., & Muniapan, P. (2017). Comparison of Stochastic and Rule-Based POS Tagging on Malay Online Text. American Journal of Applied Sciences, 14(9), 843–851. doi:10.3844/ajassp.2017.843.851.

Žižovic, M., & Pamucar, D. (2019). New model for determining criteria weights: Level based weight assessment (LBWA) model. Decision Making: Applications in Management and Engineering, 2(2), 126–137. doi:10.31181/dmame1902102z.

Bhole, G. P., & Deshmukh, T. (2018). Multi-criteria decision making (MCDM) methods and its applications. International Journal for Research in Applied Science & Engineering Technology, 6(5), 899-915.

Odu, G. O. (2019). Weighting methods for multi-criteria decision-making technique. Journal of Applied Sciences and Environmental Management, 23(8), 1449. doi:10.4314/jasem.v23i8.7.

Fabbri, A. R., Kryściński, W., McCann, B., Xiong, C., Socher, R., & Radev, D. (2021). Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9, 391–409. doi:10.1162/tacl_a_00373.

Barbella, M., & Tortora, G. (2022). Rouge Metric Evaluation for Text Summarization Techniques. SSRN Electronic Journal, 1-31. doi:10.2139/ssrn.4120317.

Saadany, H., & Orăsan, C. (2022). BLEU, METEOR, BERTScore: Evaluation of Metrics Performance in Assessing Critical Translation Errors in Sentiment-oriented Text. Translation and Interpreting Technology Online, 48–56. doi:10.26615/978-954-452-071-7_006.

Muludi, K., Fitria, K. M., Triloka, J., & Sutedi. (2024). Retrieval-Augmented Generation Approach: Document Question Answering using Large Language Model. International Journal of Advanced Computer Science and Applications, 15(3), 776–785. doi:10.14569/IJACSA.2024.0150379.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint, arXiv:1907.11692. doi:10.48550/arXiv.1907.11692.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems, 32, 5754– 5764.

Chen, A., Stanovsky, G., Singh, S., & Gardner, M. (2019). Evaluating question answering evaluation. MRQA@EMNLP 2019 - Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 119–124. doi:10.18653/v1/d19-5817.


Full Text: PDF

DOI: 10.28991/HIJ-2024-05-03-06

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Sharmila P, Kalaiarasi Sonai Muthu Anbananthen, Deisy Chelliah, Parthasarathy S, Baarathi Balasubramaniam, Saravanan Nathan Lurudusamy