Companies rely on huge, round-the-clock support teams to keep customers engaged. This can be both pricey and inconvenient. Chatbots can significantly increase efficiency and reduce corporate costs. It handles an infinite number of questions with minimal human intervention. Creating comprehensive chatbot training data is time consuming and labor intensive. It also lengthens the period from product design to deployment. Companies with little training data may be unable to create a solid enough model for chatbot interactions. In this project, we will use internal customer support data to build a strong chatbot utilizing the Sequential model. It also shows how to run the chatbot with the developed model.
1. To Process Unstructured Data
2. To Label the data using unsupervised and supervised techniques
3. To build an AI Chatbot for customer assistants using a sequential model
➔ Language: Python
➔ Libraries: pandas, numpy, seaborn, spacy, tensorflow, sklearn, nltk, matplotlib, hyperopt, keras, chatintents
The dataset is an unstructured assortment of project customer service inquiry chat logs. The chat logs consist of timestamped dialogue between a human customer agent and a visitor to the project website. The dialogue consists predominately of queries about project's services, prices, location, and signup information.
1. Preprocess semi-structured data
2. Perform exploratory data analysis
3. Unsupervised labeling
4. Supervised labeling
5. Training data preparation
6. Hyperparameter tuning
7. Train deep learning sequential model
8. Evaluating the model
9. Use model for prediction
10.Run the chatbot
This project is compiled with python for the purpose of pre-trained USE model. The code is intended to run locally in a terminal.
Virtual environment
Create a python 3.9 virtual environment using (please ensure you have miniconda installed):
conda create -n myenv python=3.9
Then activate virtual environemt:
conda activate myenv
To ensure python versions are compatible between myenv and Jupyter it is necessary to create myenv IPython kernel.
Begin by installing ipykernel
:
pip install --user ipykernel
Then link the myvenv kernel to Jupyter:
python -m ipykernel install --user --name=myenv
After running Jupyter select Kernel
from the Jupyter menu bar and select Change kernel...
from the Kernel menu. From the pop up box select the myenv
kernel.
To install dependencies use the below command:
pip install -r requirements.txt
The dataset is an unstructured assortment of ProjectPro customer service enquiry chat logs. The chat logs consist of timestamped dialogue between a human customer agent and a visitor to the ProjectPro website. The dialogue consists predominately of queries about ProjectPro's services, prices, location, and signup information.
Initialise the Cluster class to begin these steps:
python engine.py --cluster
Data label clustering is performed in an unsupervised way. An initial step before any clustering is to preprocess the chat logs one by one:
- Extract only the text transcripts with the relevant chatter.
- Remove urls.
- Normalise contractions and other shorthand.
- Strip everything except letter characters.
- lemmatize words.
- load text and user into dataframe
Following preprocessing, the data is then explored to identify and vidualise features. Beginning with initialising the EDA object:
python engine.py --eda
Various data exploration methods can be called on the data to explore the features:
# check token frequency distribution
python engine.py --eda_token_dist
# plot the frequency distribution
python engine.py --eda_plot_token_dist
# get top N tokens
python engine.py --eda_top_n_tokens N # integer
# get token length histogram
python engine.py --eda_tokens_hist
# get senth length histogram
python engine.py --eda_sent_hist
The dataframe derived from preprocessing is clustered using the chatintents module. The follwing is the order in which the data is clustered:
- The utterances are embedded using Google's Universal Sentence Encoder.
- Hyperparameter tuning is performed using bayesian optimisation, which performs better than random.
- The dimensionality is reduced using UMAP due to the effects of high dimensionality on distance metrics.
- The data is then clustered using HDBSCAN.
- Finally using using spaCy dependency parsing, the data is automatically labelled.
Along the way the outcomes of different stages in the clustering process can be eplored using the following methods:
# check the best hyperparameters derived through bayesian optimization
python engine.py --cluster_best_params
# get cluster visualisation
python engine.py --cluster_plot
# get summary dataframe slice (20) of cluster labels e.g. label count
python engine.py --cluster_labels_summary
# get dataframe slice (20) of labelled text
python engine.py --cluster_labeled_utts
The final dataframe is exported to a CSV file for further human review and ammendment. The data can be then exported to a json file or kept in the csv and processed using parse_data_csv
function.
To train the model, run:
python engine.py --train
The data is prepared by one hot encoding the labels and splitting into train, test, eval sets.
Training entails a pre-step of hyperparameter tuning using keras_tuner
. Hyperparameters such as:
- Number of layers
- Number of perceptrons
- Dropout layer value
- Activation function
- Learning rate
- Number of epochs
Once these are optimized the best hyperparameter configuration is used to train the model.
The mddel has an early stopping mechanism, which uses the validation loss as a stopping condition. Once the validation loss drops, the training continues for a set number of epochs and stops if there is no improvement over the historic best value.
The pre-training process can be explored using a number of methods:
# Check the summary of the hyperparameter tuning
python engine.py --train_search_summary
# Check the results of the hyperparameter tuning
python engine.py --train_results_summary
# check the summary of the model
python engine.py --train_model_summary
# get diagram of model
python engine.py --train_model_diagram
Once the model has finished training, the model with the best weights is saved.
The saved model can then be evaluated using the test data. To evaluate the model, begin by initializing the Eval class:
python engine.py --eval
A number of evaluation methods can be called to explore the model performance:
# check test data loss and accuracy
python engine.py --eval_test_loss_acc
# Plot of validation vs train accuracy over epochs
python engine.py --eval_acc_plot
# Plot of validation vs train loss over epochs
python engine.py --eval_loss_plot
# Compare predicted vs actual intents of test data
python engine.py --eval_comp_preds
# Get f-score of predicted intents
python engine.py --eval_fscore
# Get confusion matrix of predicted vs actual intents
python engine.py --eval_conf_matrix
To run the model in a chat environment use:
python engine.py --chat
The chat will run in a terminal and simulate a deployed chatbot with predicted responses given some user input.