Use the text classifier here, deployed as a live web application using Streamlit 🎈!
This project builds a text classifier to classify firms into categories of business activities (target) based on the Singapore Standard Industrial Classification (SSIC) framework, using free-text descriptions of their business activities (feature). This is done by fine-tuning and ensembling multiple pre-trained BERT models via a model soup.
This classifier aims to be employed in front-facing services by Singapore's Accounting and Corporate Regulatory Authority (ACRA) to aid new firm registrants in selecting the most appropriate SSIC code. Further context on this use case is as follows:
In Singapore, all firms are required to register with the Accounting and Corporate Regulatory Authority (ACRA) during firm creation. During this registration, firms self-declare their Singapore Standard Industrial Classification (SSIC) code, which is the national standard for classifying economic activities, based on their planned business activities to be undertaken.
However, 2 scenarios are common:
(i) Firms may not select the most appropriate SSIC code at the point of registration.
(ii) Firms may subsequently change their business activities and may not inform ACRA about this change.
As a result, many firms' SSIC codes do not correctly reflect their business activities.
These pose a problem because the government relies on accurate firm SSIC codes for various monitoring and development purposes.
Previously, officers manually read through each firm's business activity descriptions that are periodically collected over time through surveys, to determine if each firm's SSIC code is still reflective of its current business activities. If not, officers manually re-assign a new SSIC code to the firm.
However, this requires:
(i) A significant amount of man hours for reading thousands of texts.
(ii) Officers to have a good understanding of all SSIC codes, in order to correctly re-assign correct SSIC codes to firms. This is difficult as there are thousands of different SSIC codes in existence.
To resolve the above problems, this project builds a text classifier to automatically classify firms into correct SSIC codes (target) based on free-text descriptions of their business activities (feature).
This project uses publicly available firm business activity descriptions for each SSIC code from the Department of Statistics.
- Remove instances of emails, websites, or digits that are irrelevant to activity descriptions- Standardize common abbreviations
- Remove boilerplate introductions, such as "The principal activity of the company is to…" Due to limited training data, we artificially generate additional text for model training that are variations of the text in the original training data, which is a method termed back-translation.
We use deep-translator to translate a text from the original language (English) to a target language and then translate it back to the original language. The resulting text will be somewhat different from the original text due to translation "noise", but it conveys the same general meaning.
This introduces variations in sentence structure, word choice and phrasing, thereby diversifying the training data, enabling the model to learn from a broader range of sentence constructions and improve its ability to generalize to different inputs for better performance.
We use the bert-base-uncased model on HuggingFace for transfer learning. BERT was chosen as it demonstrates state-of-the-art results when fine-tuned on a specific task even with a relatively small amount of training data. This is particularly beneficial in the context of this project, where only a limited amount of labeled data is available for training. We use optuna to fine-tune multiple BERT models on our training data with different hyperparameter configurations. Model soup is a recent research breakthrough in 2022 for combining multiple deep learning models into a single model through averaging the weights of each model, as outlined in the research paper.We ensemble the multiple fine-tuned BERT models via this methodology to obtain an ensembled model that outperforms any single model in its "soup". This is our final model.