We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.

(2025). BAT: A Toolkit for Biomedical Text Augmentation . Retrieved from https://hdl.handle.net/10446/316346

BAT: A Toolkit for Biomedical Text Augmentation

Pala, Daniele;
2025-01-01

Abstract

We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.
2025
Bergomi, Laura; Parimbelli, Enea; Pala, Daniele; Buonocore, Tommaso M.
File allegato/i alla scheda:
File Dimensione del file Formato  
978-3-031-95841-0 (1) (1).pdf

Solo gestori di archivio

Versione: publisher's version - versione editoriale
Licenza: Licenza default Aisberg
Dimensione del file 1.22 MB
Formato Adobe PDF
1.22 MB Adobe PDF   Visualizza/Apri
Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/316346
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact