This study explores the application of Large Language Models (LLMs) for the analysis of unstructured data from Customer Satisfaction Surveys in the manufacturing sector. The analysis is divided into several stages: firstly, models are used for the summarization of textual comments, identifying strengths and areas of improvement. Subsequently, a topic extraction and sentiment analysis is conducted by the models GPT-4o, Gemini 1.5 Pro, Claude 3.5, Llama 3.1 and Llama 3.3. The task includes identifying the topics covered in the comments and determining their sentiment (positive, negative, neutral or mixed). To assess the quality of responses, results are compared with a Ground Truth generated by a human analyst. The performance metrics selected to measure the similarity between the human and models classifications are the Jaccard Index and the accuracy. Furthermore, prompt engineering is also being tested, by creating a more structured prompt, called few-shot prompt, to test whether a prompt with more detailed topics explanations can improve model performance. Finally, the study analyses the trade-off between costs and benefits, comparing performance, response times and computational costs of different models. The results show that more advanced models offer higher accuracy but at high costs, while open-source models represent a cheaper alternative with lower performance.
(2025). Applications of large language models to customer satisfaction survey for summarization and topic extraction in manufacturing [journal article - articolo]. In RESULTS IN ENGINEERING. Retrieved from https://hdl.handle.net/10446/309950
Applications of large language models to customer satisfaction survey for summarization and topic extraction in manufacturing
Antonini, Laura;Manzoni, Vincenzo;Giardini, Claudio;Quarto, Mariangela
2025-01-01
Abstract
This study explores the application of Large Language Models (LLMs) for the analysis of unstructured data from Customer Satisfaction Surveys in the manufacturing sector. The analysis is divided into several stages: firstly, models are used for the summarization of textual comments, identifying strengths and areas of improvement. Subsequently, a topic extraction and sentiment analysis is conducted by the models GPT-4o, Gemini 1.5 Pro, Claude 3.5, Llama 3.1 and Llama 3.3. The task includes identifying the topics covered in the comments and determining their sentiment (positive, negative, neutral or mixed). To assess the quality of responses, results are compared with a Ground Truth generated by a human analyst. The performance metrics selected to measure the similarity between the human and models classifications are the Jaccard Index and the accuracy. Furthermore, prompt engineering is also being tested, by creating a more structured prompt, called few-shot prompt, to test whether a prompt with more detailed topics explanations can improve model performance. Finally, the study analyses the trade-off between costs and benefits, comparing performance, response times and computational costs of different models. The results show that more advanced models offer higher accuracy but at high costs, while open-source models represent a cheaper alternative with lower performance.| File | Dimensione del file | Formato | |
|---|---|---|---|
|
1-s2.0-S2590123025032335-main.pdf
accesso aperto
Versione:
publisher's version - versione editoriale
Licenza:
Creative commons
Dimensione del file
2.94 MB
Formato
Adobe PDF
|
2.94 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

