Performing Machine Learning (ML) tasks on large-scale datasets, as well as simply storing them for subsequent analysis or for long-term archival, require large computational power. The described approach builds on the technique known as "Bayesian Generation" to produce synthetic datasets in such a way that the probability dis tribution in the source dataset is maintained as much as possible in the new synthetic ones, even if they are much smaller than the original (large) dataset. In fact, this study investigates the impact of generating smaller synthetic datasets for training ML models in place of the original dataset, adopting a twofold perspective. Firstly, the impact on the effectiveness of ML models trained on these smaller synthetic datasets is assessed. Secondly, the amount of computational resources required to generate the synthetic datasets, train ML models on them, and perform the testing phase is measured. Specifically, both execution time and main memory usage are taken into account. Finally, this research work shows that the loss in terms of effectiveness remains consistently limited and stable, and it identifies the scenarios and ML techniques for which incorporating the generation of small syn thetic datasets into the ML pipeline can be beneficial for practical deployment in environments with constrained computational resources, such as mobile or industrial devices.

(2026). Bayesian generation of synthetic datasets for machine-learning tasks: a performance study [journal article - articolo]. In NEUROCOMPUTING. Retrieved from https://hdl.handle.net/10446/317345

Bayesian generation of synthetic datasets for machine-learning tasks: a performance study

Fosci, Paolo;Psaila, Giuseppe;
2026-01-01

Abstract

Performing Machine Learning (ML) tasks on large-scale datasets, as well as simply storing them for subsequent analysis or for long-term archival, require large computational power. The described approach builds on the technique known as "Bayesian Generation" to produce synthetic datasets in such a way that the probability dis tribution in the source dataset is maintained as much as possible in the new synthetic ones, even if they are much smaller than the original (large) dataset. In fact, this study investigates the impact of generating smaller synthetic datasets for training ML models in place of the original dataset, adopting a twofold perspective. Firstly, the impact on the effectiveness of ML models trained on these smaller synthetic datasets is assessed. Secondly, the amount of computational resources required to generate the synthetic datasets, train ML models on them, and perform the testing phase is measured. Specifically, both execution time and main memory usage are taken into account. Finally, this research work shows that the loss in terms of effectiveness remains consistently limited and stable, and it identifies the scenarios and ML techniques for which incorporating the generation of small syn thetic datasets into the ML pipeline can be beneficial for practical deployment in environments with constrained computational resources, such as mobile or industrial devices.
articolo
2026
Fosci, Paolo; Nieves, Javier; Psaila, Giuseppe; Boffelli, Jacopo; Garcia, Bringas Pablo
(2026). Bayesian generation of synthetic datasets for machine-learning tasks: a performance study [journal article - articolo]. In NEUROCOMPUTING. Retrieved from https://hdl.handle.net/10446/317345
File allegato/i alla scheda:
File Dimensione del file Formato  
1-s2.0-S0925231225031807-main.pdf

accesso aperto

Versione: publisher's version - versione editoriale
Licenza: Creative commons
Dimensione del file 3.04 MB
Formato Adobe PDF
3.04 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/317345
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact