This repository documents the implementation and evaluation of domain adaptation of Large Language Models as introduced in the paper Adaptation of Large Language Models for the public sector: A clustering use case.
Since 2023, the SEMIC action has started to upscale the usage of AI tools and methods to examine how AI-driven applications can complement and automate existing SEMIC assets (data models, vocabularies, tools).
Through this empirical study, the SEMIC team aimed to further contribute to this effort by diving into the world of Large Language Models and domain adaptation. The objective of the study was to evaluate the potential benefits of retraining Large Language Models on language and domain-specific data, including policy documents and legislations. To achieve this, the SEMIC team needed to:
- Determine relevant domain-specific data sources and create a domain specific corpus.
- Fine-tune or retrain two chosen language models on the domain corpus.
- Test the performance in comparison to the generic language models on the chosen use case.
The following sections document the different steps undertaken to achieve these objectives by deep-diving into the three main folders of this repository.
The first step of the study consisted in the definition of an appropriate corpus for retraining Large Language Models on domain specific data. The approach followed by the SEMIC team to define a corpus of relevant text and make it suitable for domain adaptation of Large Language Models is described below.
To achieve the retraining of Large Language Models, a corpus of text specific to the domain of interest needs to be defined. For this activity the retrained models were used to cluster pledges on the Transition Pathway for Tourism of the European Commission.
Therefore, for building a domain corpus the SEMIC team searched for publications revolving around this topic. The following document Transition Pathway for Tourism which describes the different measures and outputs related to the implementation of the Transition Pathway for Tourism was used as a point of reference.
Using a PDF scraping package from Python (pikepdf), 200 distinct URL links to websites or online PDFs, of which 183 were links to EU websites and documents, were extracted. After the removal of unproper links (typing errors, outdated references, …) and the addition of extra relevant resources (updated locations, additional resources to download on websites, …) a final set of 225 URLs to visit was defined.
For each of the visited links, different scraping libraries from Python (Selenium for URLs, and PdfReader for online PDFs) were used to extract the relevant textual content.
To make the scraped content more amenable for training language models, a series of transformations to clean the corpus needed to be applied:
Figure 1: Overview of the cleaning process
Firstly, built-in functions from the Python string library were used to replace contractions (e.g., we’ll, we’re, …) and switch words to lowercase. Then, unnecessary characters, i.e., elements that do not bring any semantic value to the text, were removed by relying on regular expressions. This included URLs, non-alphanumeric characters, trailing and leading whitespaces, and double/triple whitespaces.
Finally, an extra cleaning step was necessary to handle syntactic errors generated by the scraping of online content. For instance, in most documents, the pronoun this had been scraped in two pieces, th and is. A combination of regular expressions and manual cleaning was used to correct these errors.
After the application of the different cleaning steps, the SEMIC team performed a short exploration of the corpus (stored in Files folder). This short step allowed to get a first intuition on the impact that the retraining could have.
The final corpus contained 225 documents ready to train language models. Some key figures about this final training corpus were gathered (see Table 1).
Number of distinct documents | 225 |
Languages | English |
Number of tokens (with stop words) | 1.949M |
Number of tokens (without stop words) | 1.349M |
Table 1: Basic information on the training corpus
From Table 1, it can be seen that the size of the training corpus was approximately equal to 2 million tokens. Though significantly small compared to the 3.3 billion tokens used for pre-training BERT and RoBERTa, this corpus had the advantage of only containing data very specific to the target task, an advantage that can provide a better target domain adaptation. In addition, relying on this approach allowed to limit the training costs.An alternative to this approach that was envisaged by the SEMIC team was the use of a larger corpus of data on tourism using Wikipedia data. To gain an insight on the impact of using a larger corpus of data, the SEMIC team also considered to use TourBERT during this study. This model was trained from scratch on a large set of text on tourism.
After the creation of a training corpus, the next step consisted in training the large language models and adapt it to the Transition Pathway for Tourism domain. In this section, the focus is put forward on the methodology used to perform this domain specialisation, i.e., how the SEMIC team further pre-trained LLMs for domain specialisation.
Adapting a language model to specific domains and tasks is not a new topic in the literature. Models as FinBERT, SciBERT, BioBERT, or TourBERT have already demonstrated the interest of training LLMs on domain data (better performance, lower costs, ...). With the emergence of these multiple models, a multitude of approaches for domain adaptation have also been developed.
Figure 2: Example of taxonomy of domain specialisation
It was chosen to rely on a task-adaptative pre-training to fine-tune the model to the domain of interest. To put it simply, the SEMIC team fine-tuned a pre-trained LLMs on a mask modelling task with the domain corpus as training data. With this approach (which was also used for training FinBERT), it was possible to benefit from the general language knowledge of the model and tweak it for a better semantic understanding of the domain language. This also allowed to maintain low training costs compared to a training from scratch as TourBERT or SciBERT.Regarding the choice of models to fine-tune, it was decided to evaluate the impact of domain adaptation for BERT (bert-base-uncased) and RoBERTa (roberta-base). The decision was motivated by different factors:
- Clustering capabilities: The models needed to be suited for the underlying task of this study, i.e., text clustering. Compared to other models (e.g., BLOOM), BERT and RoBERTa, as embedding models, have shown good capabilities in these specific natural language processing tasks.
- Open-source: By choosing for an open-source model black-box effect could be avoided. In addition, it allowed to limit the costs, and to facilitate the dissemination of the results. Both models can be freely accessed through Huggingface.
- Size of the model: To limit the training time and computational requirements, the models needed to be reasonable in terms of size.
From a technical perspective, the models were fine-tuned on an AWS infrastructure (with the scripts stored in Fine-tuning AWS):
Figure 3: High-level architecture of the training on AWS
- An Amazon Sagemaker training instance (on a ml.p3.2xlarge instance as recommended by Huggingface) was used for training both models.
- The training job was initiated by fitting a HuggingfaceEstimator object.
- Sagemaker studio was used as an IDE for writing and running the training script.
- An S3 bucket was created to store the training data and the model after training.
Following the original BERT training, the model was fine-tuned on 40 epochs with a batch size of 32. The maximum sentence length was set to 128 tokens and the model was trained until the training loss starts to converge. The training of the model was then continued allowing sentence lengths up to 512 tokens.
Both models were fine-tuned after approximately 1 hour and 50 minutes.
The final step of this work consisted in the assessment of the impact of domain-specific training on the clustering of pledges. The assessment was performed in two steps:
- Clustering with LLMs: For each model, i.e., BERT and RoBERTa, both the pre-trained and fine-tuned version were used to generate clusters of pledges on the Transition Pathway for Tourism
- Cluster validation: The results of the different clustering processes were evaluated using a combination of intrinsic (silhouette score and inertia) and extrinsic validation (accuracy and F1 score) metrics. These metrics were chosen based on a literature review on clustering evaluation. In addition, for extrinsic validation an innovative approach relying on Azure OpenAI GPT4 as a human emulator was used to create ground truth labels.
The following sections provide a more in-depth overview of the different steps. The main results and conclusions of the study are also presented.
As previously mentioned, the first step consisted in the clustering of pledges on the Transition Pathway for Tourism. Therefore, for each model, i.e., BERT and RoBERTa, both the pre-trained and fine-tuned version were used to generate clusters of pledges. Clusters obtained with Word2Vec during a previous phase of the SEMIC action were also used as a baseline (see here).
As for the work done during this previous project, a three-step process was followed to obtain the clusters of pledges (see here):
- Pre-processing:
The first step consisted in a series of transformations to pre-process the pledges’ text and make it ready for text clustering. Firstly, the SEMIC team used built-in functions from the Python string library to replace contractions (e.g., we’ll, we’re, …) and switch words to lowercase. Then, regular expressions were used to remove unnecessary characters, i.e., elements that did not bring any semantic value to the clustering task, from the pledges. This included URLs, punctuations, digits, non-alphanumeric characters, single characters, and double/triple whitespaces.
- Document Indexing
After the pre-processing, the next step consisted in indexing the data, i.e., to create a vector representation of the pledges. The process consisted of four separated steps.
First, the pledges were tokenised using the NLTK word_tokenize function. To comply with the maximum length of BERT and RoBERTa, each pledge then needed to be chunked into sentences of maximum 512 tokens (for pledges with less than 512 tokens, padding tokens were added). Finally, special tokens were added at the beginning and end of each sentence to inform the model about the start and end of the sequence to embed.
Then, before indexing each sentence, the following step consisted in creating three vectors of size 512 from the tokenised sentences. The first vector was obtained by replacing each word from the sentence by its corresponding ID from the model’s vocabulary. A vector of 0, the segment ID vector, was then created to inform the model that all the words from the vector belonged to the same sentence. Finally, to allow the model to ignore padding tokens, an attention mask vector was generated. The indexed sentences were then obtained by feeding the three vectors in the chosen Large Language Model. The average of the second to last hidden layer was used as sentence embedding.
Figure 4: Overview of the pledge embedding process.
Finally, the last step applied to achieve document indexing was to combine the different sentence embeddings into a single pledge embedding. In other words, to obtain a single vector for each pledge, the different sentence embeddings of the pledge needed to be pooled. To do this, an average operator was used, i.e., the pledge embedding is equal to the average of its word embeddings.
- Clustering
After the document indexing, every pledge from the dataset was represented by a 768-dimensional vector that captures its semantic content. Hence, by looking at the distance between pledges’ vectors, an indication on their similarity can be seen and different clusters can be identified.
Based on the analysis made during the previous phase, it appeared that six pledge clusters could be identified in the dataset. Knowing this optimal number, the SEMIC team applied a K-Means clustering algorithm to group the pledges in 6 clusters based on their embedding vectors. Given the dependence of K-means results to the initial centres, the algorithm was repeated 500 times to find the initial centres that minimised the performance criterion (clusters’ inertia). The results of the optimal K-means algorithm were then used to evaluate the embedding models.
To assess the quality of the clusters, an intrinsic method was first applied, i.e., the quality was assessed by examining how well the clusters were separated and how compact they were. Three different approaches were chosen to evaluate this internal quality (IntrinsicVal.py).
2D-tsne plots were first used to get an intuition on the impact of the domain-specific training on the intrinsic cluster validity. Next to the visual exploration, two quantitative measures were computed to evaluate the goodness of the clustering structure: the WSS/TSS (visualised by elbow graphs) and the silhouette score. Both metrics allow to quantify the internal cluster coherence.
Next to the quantitative evaluation of the cluster coherence, an additional assessment was performed to evaluate the coherence of clusters from a content perspective, i.e., whether the clusters would make sense from a human perspective (ExtrinsicVal.py).
To achieve this, labels first needed to be defined to summarise the content of the clusters. To accelerate the labelling process, it was decided to rely on a GPT-based approach, i.e., providing GPT with all the pledges of a cluster, it was asked to give their common topic.
To ensure the quality of the approach, a first set of tests were performed on the Word2Vec clusters which had been humanely analysed during the previous phase. Appendix I shows an example of summary generated by GPT when provided with the pledges from the “Digital” cluster. After this validation, it was decided to repeat the process for the clusters of the remaining models.
Having defined a set of labels, the next step consisted in finding which cluster was the most appropriate for each pledge from a content perspective. Once again, GPT was used as a human emulator to accelerate the process. The results were then compared to the cluster repartition made by the different models to obtain an accuracy and F1-score.