We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.
(2025). BAT: A Toolkit for Biomedical Text Augmentation . Retrieved from https://hdl.handle.net/10446/316346
BAT: A Toolkit for Biomedical Text Augmentation
Pala, Daniele;
2025-01-01
Abstract
We introduce BAT (Biomedical Augmentation for Text), a Python package specifically designed to augment textual data in the biomedical domain using a neuro-symbolic pipeline. This innovative approach combines knowledge-driven and data-driven methodologies to generate perturbed versions of text while preserving its original meaning. The package provides two categories of functions: Knowledge-based (KB) perturbation and Transformer-based (TB) perturbation. KB perturbation offers a utility interface towards semantic resources for handling medical terminology alongside general-purpose terms, by providing both medical and general synonym replacement. TB perturbation leverages language models to enable generation of new augmented sentences through contextual word prediction, back-translation, and rephrasing. BAT is designed to tackle the typical challenges of biomedical text, navigating complex medical jargon and enriching text while maintaining its readability. It is also designed for modularity, allowing seamless integration into existing NLP workflows and processing of entire datasets, ranging from single words and sentences to large corpora. By integrating formalized domain knowledge with cutting-edge machine learning models, BAT serves as a versatile toolkit for text augmentation across multiple languages, including English as well as low-resources languages such as Italian, Spanish, and French. It facilitates the generation of diverse, high-quality textual data to support a range of biomedical applications, including creating new training samples, addressing imbalanced distributions, and evaluating model robustness.| File | Dimensione del file | Formato | |
|---|---|---|---|
|
978-3-031-95841-0 (1) (1).pdf
Solo gestori di archivio
Versione:
publisher's version - versione editoriale
Licenza:
Licenza default Aisberg
Dimensione del file
1.22 MB
Formato
Adobe PDF
|
1.22 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

