In the era of social media, the huge availability of big data such as digital data (e.g. posts sent through social networks or unstructured data scraped from websites) allows to develop new types of research in a wide range of fields. These types of big data are available for low costs and in almost real-time. Nevertheless, their collection and analysis are challenging. This paper proposes an unsupervised dictionary-based method to filter tweets related to a specific topic, i.e. environment. We start from the tweets sent by a selection of Official Social Accounts clearly linked with the subject of interest. Then, we identify a list of expressions (bigrams, trigrams and hashtags) used to set the topic-oriented dictionary. Our approach has some relevant advantages: it attempts to reduce as much as possible the interventions and decisions of the researcher as well as the processing time; it is based mostly on combination of words (instead of single words) in order to ease the identification of tweets concerning the topic of interest; it is not based on a pre-defined dictionary, but it can rather be personalized and generalized to other topics. We test the performance of our method by applying the built dictionary to a sample of more than 3.5 million geolocated tweets posted in Great Britain between January and May 2019. All the criteria used to evaluate the performance highlighted very good performances. In particular, the level of accuracy, of sensitivity and of the F1 score were equal or higher than 98.4%; moreover, also for specificity and precision we obtain excellent levels of performance (around 97,5%), higher than the currently most common methods of selection.
(2022). Dictionary-based Classification of Tweets About Environment [journal article - articolo]. In JOURNAL OF MATHEMATICS AND STATISTICAL SCIENCE. Retrieved from http://hdl.handle.net/10446/203283
Dictionary-based Classification of Tweets About Environment
Cameletti, Michela;Fabris, Silvia;Schlosser, Stephan;Toninelli, Daniele
2022-01-01
Abstract
In the era of social media, the huge availability of big data such as digital data (e.g. posts sent through social networks or unstructured data scraped from websites) allows to develop new types of research in a wide range of fields. These types of big data are available for low costs and in almost real-time. Nevertheless, their collection and analysis are challenging. This paper proposes an unsupervised dictionary-based method to filter tweets related to a specific topic, i.e. environment. We start from the tweets sent by a selection of Official Social Accounts clearly linked with the subject of interest. Then, we identify a list of expressions (bigrams, trigrams and hashtags) used to set the topic-oriented dictionary. Our approach has some relevant advantages: it attempts to reduce as much as possible the interventions and decisions of the researcher as well as the processing time; it is based mostly on combination of words (instead of single words) in order to ease the identification of tweets concerning the topic of interest; it is not based on a pre-defined dictionary, but it can rather be personalized and generalized to other topics. We test the performance of our method by applying the built dictionary to a sample of more than 3.5 million geolocated tweets posted in Great Britain between January and May 2019. All the criteria used to evaluate the performance highlighted very good performances. In particular, the level of accuracy, of sensitivity and of the F1 score were equal or higher than 98.4%; moreover, also for specificity and precision we obtain excellent levels of performance (around 97,5%), higher than the currently most common methods of selection.File | Dimensione del file | Formato | |
---|---|---|---|
dictionarybased.pdf
accesso aperto
Versione:
publisher's version - versione editoriale
Licenza:
Creative commons
Dimensione del file
364.4 kB
Formato
Adobe PDF
|
364.4 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo