I Simposio de Postgrado 2023. Ingeniería, ciencias e innovación

I SIMPOSIO 2023 RIVERTEXT: A FRAMEWORK FOR TRAINING AND EVALUATING INCREMENTAL WORD EMBEDDINGS FROM TEXT DATA STREAMS ABSTRACT Word embeddings have become indispensable tools in various natural language processing, including document classification, ranking, and question answe- ring. However, traditional word embedding models have a major limitation in their static nature, which hinders their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web. To address this challenge, incremental word embedding algorithms have been intro- duced, enabling dynamic updating of word representations in response to new language patterns and continuous data streams. This thesis presents RiverText, a comprehensive framework for training and eva- luating incremental word embeddings from text data streams. Our tool provi- des a valuable resource for the natural language processing community that deals with word embeddings in streaming scenarios, such as social media analysis. The library implements various incremental word embedding techniques in a stan- dardized framework, including Skip-gram, Continuous Bag of Words, and Word Context Matrix. Additionally, we have also proposed an evaluation scheme that adapts intrinsic static word embedding evaluation tasks, such as word similarity and categorization, to a streaming setting. For our experiment we compare the performance of our framework using diffe- rent hyperparameter settings, which were the embedding size, window size, the number of negative samples, and the number of contexts for a given target word. Furthermore, to simulate text data streams, we take different tweets downloaded from Twitter API, and evaluate the performance of our models in order to make the first benchmark between the Incremental word embedding algorithm. As the main conclusion, we observed that hyperparameter tuning plays a critical role in the performance, and some incremental algorithms are better than others for some intrinsic NLP task depending on the test dataset and task chosen. Our open-source library is available at https://github.com/dccuchile/rivertext. 1 Departamento de Ciencias de la Computación, Universidad de Chile. 2 National Center for Artificial Intelligence (CENIA). 3 Millennium Institute for Foundational Research on Data (IMFD) . *Email: gabrieliturrab@ug.uchile.cl Gabriel Iturra-Bocaz 1, 2, 3* , Felipe Bravo-Marquez 1,2,3