This repository contains a text classification model implemented in Python using the scikit-learn
library. The model classifies text into "good" or "bad" categories based on training data provided in CSV files.
Tested on Python 3.12.1
The model supports various options specified in a JSON format:
Example:
{
"modelname": "textclassification",
"hashname": "csvhashes",
"checkCSVfiles": "true",
"inputMode": "true",
"enablePrints": "true",
"stringJSONreply": "false"
}
modelname
: Name of the trained model file. If you keep default, model file wil betextclassification.joblib
hashname
: Name of the file storing the hash of the concatenated CSV data.checkCSVfiles
: Set to "true" to ensure CSV files have the required number of rows. (min. 2)inputMode
: Set to "true" for interactive user input, otherwise uses predefined text.enablePrints
: Set to "true" to enable print statements.stringJSONreply
: Set to "true" to get JSON-formatted string prediction output.
-
Training the Model:
- The code will train the model using data from
good_texts.csv
andbad_texts.csv
. - Data hash is calculated to determine if retraining is needed.
- The code will train the model using data from
-
User Input Mode:
- If
inputMode
is set totrue
, the user can input text for classification interactively.
- If
-
Output:
- The model outputs predictions and probabilities for "good" and "bad" classes.
- pandas
- scikit-learn
- joblib
- hashlib
To install:
pip install pandas scikit-learn joblib hashlib
- Clone this repository.
- Ensure Python and required libraries are installed. (Currently tested in Python 3.12.1)
- Customize options and CSV files in the code or use default settings.
- Run the script.
Feel free to explore and modify the code for your specific use case. If you encounter any issues or have suggestions, don't forget open an issue.