Text Classification with Python on the HPCC

From time to time Wharton Computing Research and Analytics group is asked for assistance in using machine learning to classify text data. We thought it might be helpful to provide a working example as scaffolding for text classification projects.


To get started, log into your HPCC account. If you don’t have an HPCC account or have never logged in, you can find instructions and details at https://research-it.wharton.upenn.edu/documentation/access/ .

Once logged in, we will want to make sure that any potentially resource-intensive commands we may execute are performed an a compute node. Compute nodes are the workhorses of the HPCC cluster. We can move to a compute node using this command:


Now that we are on a compute node, we can clone the public git repository containing the example project into your HPCC home directory:

git clone git@bitbucket.org:wharton/textclass.git

This will create a subdirectory named textclass in your home directory. You can move into the textclass subdirectory with the command

cd textclass

Then use the following command to enable python3

source /opt/rh/rh-python36/enable

Then create a virtual environment called “env”.

python -m venv env

You can enter your new virtual environment with the command

source env/bin/activate

Once inside the environment, any libraries you install will become part of the virtual environment. You can install the dependencies for the text classification example with the command

python -m pip install -r requirements.txt

Inside the directory you unzipped is a csv file with a little over 30,000 Amazon reviews labeled as good or bad based on the number of stars the reviewer granted. Bad reviews are assigned the class __label__1 and good reviews are given the class __label__2. To take a look at the first few lines, use the command

zcat amz_reviews_labeled.csv.gz | head

These data are a subset of the dataset published here: https://www.kaggle.com/bittlingmayer/amazonreviews .

Data Preparation

The first step in any text classification problem is cleaning and tokenizing the data. There is a python script in the folder named prep.py that will do this. You can use the text editor of your choice (vim, nano, etc.) to view and edit prep.py. All the python scripts are heavily annotated with comments that are meant to be explanatory. In general terms, the prep.py script will

  1. standardize the case of all the reviews to lowercase,
  2. remove all punctuation,
  3. delete common words using a the Natural Language ToolKit (nltk: https://www.nltk.org/) stopword list, and
  4. remove common syntax-related word portions using the Snowball stemmer (https://www.kite.com/python/docs/nltk.SnowballStemmer) .
  5. save and compress the result in a file for use in the next step

Not all of these steps may be necessary for every text classification approach. In your project you should feel free to add or remove steps as necessary. Running this code on the HPCC is a little different from running it on your own machine. We don’t want to run potentially long-running tasks in an interactive session. So the code includes a small shell script named run_python.sh to make allocating a task to a new compute node session a little easier. The script will make sure that python3 is enabled and the virtual environment is activated. Then it will run any python script that you pass to it as its first parameter. We can use the command qsub launch our task as follows:

qsub -N prep -m e -M <your email> run_python.sh prep.py

The big N parameter gives a name to the job and the big and little M parameters tell qsub to mail me when the job is done. You will of course want to replace <your email> with your actual email. More information on submitting jobs to the cluster may be found here: https://research-it.wharton.upenn.edu/documentation/job-management/ .

At any time you can use the qstat command to check on the status of your job. Any output that would normally be written to the terminal (stderr or stdout) will be written to a file named <job name>.o<job id>, where <job name> is the name you assigned to the job (in this case “prep”), and job id is a large integer that qsub assigned to your job. A command to view a job’s output, for example, might look like

cat prep.o1696712

When the job is done there should be a new csv file named “amz_reviews_tokenized.csv.gz” with a column for the tokenized input. You can see what the first few lines look like with the command

zcat amz_reviews_labeled.csv.gz | head

Feature Extraction

The next step is to transform the input into a format that can be used for machine learning. We are going to change the variable length lists of tokens into fixed length vectors of floating point values. One way to do this is called term frequency – inverse document frequency or TF-IDF (https://en.wikipedia.org/wiki/Tf%E2%80%93idf). The script transform.py uses TF-IDF to vectorize the reviews. On a general level, the steps it performs are:

  1. reading the data in from amz_reviews_tokenized.csv.gz
  2. constructing an instance of scikit-learn’s TfidfVectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  3. using the vectorizer instance to generate feature vectors for each document in the input
  4. save the vectorizer to a file named “vectorizer” for use in future inference
  5. creating 80%/20% randomized subsets to represent training data and holdout test data in keeping with the principles of cross validation (https://machinelearningmastery.com/k-fold-cross-validation/)
  6. writing the training feature vectors (X_train), the training class labels (y_train), the holdout test feature vectors (X_test), and the holdout test class labels (y_test) to files for use in the model training step.

You can run transform.py with the command

qsub -N transform -m e -M <your email> run_python.sh transform.py

TF-IDF is not the only approach to extracting features from tokenized input. Then gensim library (https://radimrehurek.com/gensim/) has a method that extends the word2vec concept for calculating word embeddings (https://en.wikipedia.org/wiki/Word2vec) to create a corpus-level model which is capable of combining word embedding vectors into a vector that represents a summary of a document as a whole (https://radimrehurek.com/gensim/models/doc2vec.html). To use doc2vec instead of TF-IDF, run transform_doc2vec.py instead of transform.py. The structure of the script is very similar, except that it has an extra step between 2 and 3 in which the corpus-level model is trained. Also, the gensim library allows for the workload to be distributed among multiple processor cores running in parallel. The script currently trains the doc2vec model using all as many cores as are available, so when you run transform_doc2vec.py, add the parameters “-pe openmp <number of cores>” to qsub to specify the number of cores you would like to use (in this case 4) as follows:

qsub -N transform -m e -M <your email> -pe openmp 4 run_python.sh transform_doc2vec.py

Model Training and Evaluation

Once feature extraction is complete, we can begin training of the actual classification model. There are quite a lot of classification algorithms available. Scikit-learn alone has something like 30 (https://scikit-learn.org/stable/modules/classes.html). Each one has its own set of parameters. This allows for an infinite variety of training strategies. For the purposes of this example we will look at two techniques: Logistic Regression and Random Forest. Scikit-learn makes it easy to swap in different models because it has standardized the interfaces of its model types. So using the code we have provided as a boilerplate, you should be able to implement almost any kind of classification strategy you choose.

The LogisticRegression classifier is implemented in logreg.py. At a high level, logreg.py performs the following steps:

  1. It reads X_train, y_train, X_test, and Y_test in from files.
  2. Using random sampling splits X_train and y_train down further into training and validation sets, again in keeping with the principles of cross validation.
  3. If a model already exists, it reads the model from the file logreg_model and creates a backup copy of the model in the file logreg_model_previous. It also backs up any metrics calculated in a previous run to logreg_results_previous. This is so you can revert the previous version of the model in the event that training actually makes the model worse. If no model exists, it creates an instance of LogisticRegression (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
  4. It trains the model on the random sample from the training set.
  5. It calculates the following metrics using both the validation set and the holdout test set:
    1. accuracy (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
    2. f1 (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)
    3. roc_auc (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
  6. It writes the results of the metric calculations to a file called logreg_results
  7. It saves the model in a file called logreg_model

The metrics calculated by this script are just three of the many metrics you can use to evaluate model performance. Each has its tradeoffs: some handle class imbalance in the data better than others, some take prediction confidence as well as accuracy into account, etc. The code calculates the predicted classes and the predicted class probabilities for the validation and test holdout sets. With those you can calculate whatever metrics suit your purposes.

You can run logreg.py with the command

qsub -N logreg -m e -M <your email> run_python.sh logreg.py

When it is finished, you can review the metrics you have calculated by accessing the logreg_results file:

cat logreg_results

The result should look something like this:

validation accuracy: 0.8604
validation f1_score: 0.8604004913923146
validation roc_auc: 0.9369905433974424
holdout accuracy: 0.85888
holdout f1_score: 0.85888
holdout roc_auc: 0.9357527722979924

If you would like to continue training your model, simply run logreg.py again. It is designed to continue training with the model saved from the last iteration. If at any point you would like to start from scratch, simply delete logreg_model:

rm logreg_model

The Random Forest classifier (ranfor.py) works just like the Logistic Regression classifier. It simply swaps in scikit-learn’s RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) class in place of the LogisticRegression class. Also – like the transform_doc2vec – it is designed to save time by using all multiple cores. So when you run it, be sure to use the “-pe openmp <number of cores>” parameters for qsub as follows:

qsub -N ranfor -m e -M <your email> -pe openmp 4 run_python.sh ranfor.py

By following the pattern established by logreg.py and ranfor.py, it should be possible to train and evaluate additional classifier models using the training and test feature sets produced by the feature extraction step.


Our objective was to use an example to outline the basic steps and methods involved in text classification on the HPCC. Using this code, it should be possible to implement your own custom text classification strategy on your own datasets. As always, please let us know if this material has been useful to you in any of your research, if you need help making use of it, or if there are any ways we can improve upon it. Happy classifying!