Bayesian generation of synthetic datasets for machine-learning tasks: a performance study

Performing Machine Learning (ML) tasks on large-scale datasets, as well as simply storing them for subsequent analysis or for long-term archival, require large computational power. The described approach builds on the technique known as "Bayesian Generation" to produce synthetic datasets in such a way that the probability dis tribution in the source dataset is maintained as much as possible in the new synthetic ones, even if they are much smaller than the original (large) dataset. In fact, this study investigates the impact of generating smaller synthetic datasets for training ML models in place of the original dataset, adopting a twofold perspective. Firstly, the impact on the effectiveness of ML models trained on these smaller synthetic datasets is assessed. Secondly, the amount of computational resources required to generate the synthetic datasets, train ML models on them, and perform the testing phase is measured. Specifically, both execution time and main memory usage are taken into account. Finally, this research work shows that the loss in terms of effectiveness remains consistently limited and stable, and it identifies the scenarios and ML techniques for which incorporating the generation of small syn thetic datasets into the ML pipeline can be beneficial for practical deployment in environments with constrained computational resources, such as mobile or industrial devices.

(2026). Bayesian generation of synthetic datasets for machine-learning tasks: a performance study [journal article - articolo]. In NEUROCOMPUTING. Retrieved from https://hdl.handle.net/10446/317345

Bayesian generation of synthetic datasets for machine-learning tasks: a performance study

Fosci, Paolo;Nieves, Javier;Psaila, Giuseppe;Boffelli, Jacopo;Garcia, Bringas Pablo

2026-01-01

Abstract

Performing Machine Learning (ML) tasks on large-scale datasets, as well as simply storing them for subsequent analysis or for long-term archival, require large computational power. The described approach builds on the technique known as "Bayesian Generation" to produce synthetic datasets in such a way that the probability dis tribution in the source dataset is maintained as much as possible in the new synthetic ones, even if they are much smaller than the original (large) dataset. In fact, this study investigates the impact of generating smaller synthetic datasets for training ML models in place of the original dataset, adopting a twofold perspective. Firstly, the impact on the effectiveness of ML models trained on these smaller synthetic datasets is assessed. Secondly, the amount of computational resources required to generate the synthetic datasets, train ML models on them, and perform the testing phase is measured. Specifically, both execution time and main memory usage are taken into account. Finally, this research work shows that the loss in terms of effectiveness remains consistently limited and stable, and it identifies the scenarios and ML techniques for which incorporating the generation of small syn thetic datasets into the ML pipeline can be beneficial for practical deployment in environments with constrained computational resources, such as mobile or industrial devices.

Scheda breve

Scheda completa

Scheda completa (DC)

	Contatti / Corresponding author
	
				[email protected]
			
	DOI del contributo
	
				https://dx.doi.org/10.1016/j.neucom.2025.132508
			
	Identificativo ISI
	
				WOS:001658813000001
			
	Identificativo SCOPUS
	
				2-s2.0-105027421059
			
	Tipo di articolo
	
				articolo
			
	Data di pubblicazione
	
				2026
			
	Lingua/e del contenuto
	
				Inglese
			
	Formato
	
				online
			
	URL della rivista
	
				https://www.sciencedirect.com/journal/neurocomputing
			
	Rivista in ANCE
	
				NEUROCOMPUTING
			
	Vol.
	
				670
			
	Fasc.
	
				Art. n. 132508
			
	Pag. iniziale
	
				1
			
	Pag. finale
	
				14
			
	Referee
	
				esperti anonimi
			
	Settore scientifico-disciplinare (validi dal 09/05/2024)
	
				Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Keywords
	
				Generation of synthetic data; Bayesian generation; Bayesian networks; The YABaGen tool; Effectiveness and efficiency
			
	Altre informazioni
	
				Part of Special issue SOCO 2024: Recent advancements in soft computing and its application in industrial and environmental problems
			
	Tutti gli autori
	
						Fosci, Paolo; Nieves, Javier; Psaila, Giuseppe; Boffelli, Jacopo; Garcia, Bringas Pablo
					
	Tipologia
	
				info:eu-repo/semantics/article
			
	Fulltext
	
				open
			
	Citazione
	
				(2026). Bayesian generation of synthetic datasets for machine-learning tasks: a performance study  [journal article - articolo]. In NEUROCOMPUTING. Retrieved from https://hdl.handle.net/10446/317345
			
	description.file
	
				Non definito
			
	Numero autori
	
				5
			
	Tipologia
	
				1.1 Contributi in rivista - Journal contributions::1.1.01 Articoli/Saggi in rivista - Journal Articles/Essays
			
	Tipologia sito docente
	
				262
			
	Nelle collezioni:
	
				1.1.01 Articoli/Saggi in rivista - Journal Articles/Essays

File allegato/i alla scheda:

File	Dimensione del file	Formato
1-s2.0-S0925231225031807-main.pdf accesso aperto Versione: publisher's version - versione editoriale Licenza: Creative commons Dimensione del file 3.04 MB Formato Adobe PDF Visualizza/Apri	3.04 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/317345

Citazioni

0

0

social impact