Using AI-powered email classification to accelerate help desk responses

mardi 6 mai 2025, 11:00 , par InfoWorld

Service-based organizations may handle thousands of customer emails daily, placing a significant burden on IT help desks, customer service organizations, and other departments involved in reading, prioritizing, and responding to those communications. A 2023 study found that mid-size and larger companies handling customer inquiries often struggle with response delays, impacting customer satisfaction and retention.

Accurate classification and prioritization of emails are critical for improving response time and customer satisfaction. By leveraging machine learning—specifically text classification and sentiment analysis—organizations can automate email triage, helping to ensure that urgent issues receive immediate attention while routine inquiries are processed efficiently.

This article explores how enterprises can integrate these technologies to optimize help desk and other customer service operations.

The challenge: Manual email triage is inefficient

Traditional email triage relies on human agents to read, categorize, and prioritize emails. This approach is:

Slow: A high volume of emails overwhelms human teams.

Inconsistent: Different agents may classify the same email differently.

Error-prone: Critical issues may be overlooked due to human oversight.

By automating email categorization and prioritization with AI, organizations can eliminate inefficiencies while maintaining accuracy.

The solution: AI-powered email classification

Customer emails to help desks generally fall into one of six categories:

Requirement: Requests for new features or functionalities that do not yet exist.

Enhancement: Suggestions to improve existing features or functionalities.

Defect: Reports of system bugs, failures, or unexpected behavior.

Security issues: Concerns related to security vulnerabilities, security breaches, or data loss or exposure.

Feedback: General suggestions, both positive and negative, about the product.

Configuration issues: Difficulties in setting up the system.

Using a text classification model trained on historical data, enterprises can automatically categorize incoming emails, reducing manual effort and improving efficiency.

Sentiment analysis: The priority filter

Beyond categorization, sentiment analysis detects the emotional tone of emails. Classifying the sentiment of emails as positive, neutral, or negative can help with prioritizing the responses.

Examples of sentiment analysis

Positive sentiment: “I love this feature, but can we add X?”

Route to Enhancement Team

Tag as Low Priority

Neutral sentiment: “I found a bug in the login system.”

Route to Bug Fixing Team

Tag as Medium Priority

Negative sentiment: “Your app is terrible, login doesn’t work!”

Route to Critical Defect Resolution Team

Tag as High Priority

About the training data set

The data set used to train the model is a dummy data set that I created specifically for this project. It simulates real-world help desk email content and includes labeled examples across the six categories introduced above (Requirement, Enhancement, Defect, Security issue, Feedback, and Configuration issue). Each email is paired with a sentiment label (positive, neutral, or negative) to support both categorization and prioritization based on tone.

The data set has been uploaded to a public GitHub repository, and you can access it here.

Step 1: Import required libraries

Our implementation relies on Pandas for data manipulation, NLTK for natural language processing, including sentiment analysis via SentimentIntensityAnalyzer, and Scikit-learn for text classification, using the Multinomial Naive Bayes classifier.

import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.naive_bayes import MultinomialNB

Step 2: Preprocess the training data

We preprocess the training data by removing special characters, eliminating stopwords like “and” and “the,” and applying lemmatization to reduce words to their base forms. These steps enhance data quality and improve the model’s performance.

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

#object of WordNetLemmatizer
lm = WordNetLemmatizer()

def text_transformation(df_col):
corpus = []
for item in df_col:
new_item = re.sub('[^a-zA-Z]',' ',str(item))
new_item = new_item.lower()
new_item = new_item.split()
new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
corpus.append(' '.join(str(x) for x in new_item))
return corpus

corpus = text_transformation(df_train['text'])

To convert the textual data to numerical data for machine learning, we use CountVectorizer from Scikit-learn.

cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = df_train.label

In the code snippet above, note that

CountVectorizer(ngram_range=(1, 2)) converts the preprocessed email text (from corpus) into a matrix of token counts, including both unigrams (single words) and bigrams (pairs of words).

X is the feature matrix used to train the model.

y is the target variable containing the email categories (Requirement, Enhancement, Defect, etc.).

Step 3: Train the classification model

We use the Multinomial Naïve Bayes model, a probabilistic algorithm ideal for text classification, to fit our training vectors to the values of the target variable.

classifier = MultinomialNB()
classifier.fit(X, y)

Why Multinomial Naïve Bayes?

The Multinomial Naïve Bayes model is particularly well-suited for text classification tasks where features are based on word counts or frequencies—exactly the case with our data set. Multinomial Naïve Bayes is a good match for our data for several reasons:

Categorical data: Our data set consists of labeled email text, where the features (words and phrases) are naturally represented as discrete counts or frequencies.

High-dimensional sparse features: The output from CountVectorizer or TfidfVectorizer creates a large, sparse matrix of word occurrences. Multinomial Naïve Bayes handles this kind of input efficiently and effectively without overfitting.

Multi-class classification: We are categorizing emails into six distinct classes. Multinomial Naïve Bayes supports multi-class classification out of the box, making it a clean fit for this problem.

Speed and efficiency: Multinomial Naïve Bayes is computationally lightweight and trains quickly, which is especially helpful when iterating on feature engineering or working with dummy data sets.

Strong baseline performance: Even with minimal tuning, Multinomial Naïve Bayes tends to perform well on text classification tasks, giving us a strong, reliable baseline to compare other models against.

There are several other machine learning models including logistic regression, support vector machines, random forests/decision trees, and deep learning models such as LSTM and BERT that also perform well on text classification tasks. Multinomial Naïve Bayes is a good starting point due to its simplicity and effectiveness, but it’s generally a good practice to try multiple algorithms and compare their performance.

To compare the performance of different models, we use evaluation metrics such as

Accuracy: The percentage of total predictions that were correct. Accuracy is highest when classes are balanced.

Precision: Of all the emails the model labeled as a certain category, the percentage that were correct.

Recall: Of all the emails that truly belong to a category, the percentage the model correctly identified.

F1-score: The harmonic mean of precision and recall. F1 provides a balanced measure of performance, when you care about both false positives and false negatives.

Support: Indicates how many actual samples there were for each class. Support is helpful in understanding class distribution.

Step 4: Test the classification model and evaluate performance

The code listing below combines a number of steps—preprocessing the test data, predicting the target values from the test data, and evaluating the model’s performance by plotting the confusion matrix and computing accuracy, precision, and recall. The confusion matrix compares the model’s predictions with the actual labels. The classification report summarizes the evaluation metrics for each class.

#Reading Test Data
test_df = pd.read_csv(test_Data.txt',delimiter=';',names=['text','label'])
# Applying same transformation as on Train Data
X_test,y_test = test_df.text,test_df.label
#pre-processing of text
test_corpus = text_transformation(X_test)
#convert text data into vectors
testdata = cv.transform(test_corpus)
#predict the target
predictions = clf.predict(testdata)
#evaluating model performance parameters
mlp.rcParams['figure.figsize'] = 10,5
plot_confusion_matrix(y_test,predictions)
print('Accuracy_score: ', accuracy_score(y_test,predictions))
print('Precision_score: ', precision_score(y_test,predictions,average='micro'))
print('Recall_score: ', recall_score(y_test,predictions,average='micro'))
print(classification_report(y_test,predictions))

Output –

IDG

IDG

While acceptable thresholds vary depending on the use case, a macro-average F1-score above 0.80 is generally considered good for multi-class text classification. The model’s F1-score of 0.8409 indicates that the model is performing reliably across all six email categories.

Rules of thumb

If accuracy and F1-score are both above 0.80, the model is typically considered production-ready in many business scenarios.

If recall is low, the model may be missing important cases—critical for high-priority email triage.

If precision is low, the model may be flagging incorrect emails—problematic for sensitive categories like Security Issues.

Step 5: Integrate sentiment analysis

We integrate NLTK’s SentimentIntensityAnalyzer to score emails by sentiment intensity. We set priority to high for negative sentiment, to medium for neutral sentiment, and to low for positive sentiment.

sia = SentimentIntensityAnalyzer()
def get_sentiment(text):
# Predict Category (Ensure it is a string)
category = clf.predict(cv.transform([text]))[0] # Extract first element

# Sentiment Analysis
sentiment_score = sia.polarity_scores(text)['compound']

if sentiment_score >= 0.05:
sentiment = 'Positive'
elif sentiment_score

Step 6: Test the complete model

Example 1

email_sentiments = get_sentiment('Your app is terrible and not secure, login doesn’t work!')
print(email_sentiments)
Output -
{ 'Category': 'SecurityIssues', 'Sentiment': 'Negative','Priority': 'High'}

Example 2

email_sentiments = get_sentiment('Add advanced filtering and export options for reports')
print(email_sentiments)
Output -
{
'Category': 'RequirementEnhancement','Sentiment': 'Positive','Priority': 'Low'
}

Here is the GitHub repository link for the whole code.

Combining classification and sentiment analysis

Combining machine learning-based classification and sentiment analysis creates a robust AI-powered email triage system. This approach helps enterprises scale their customer support operations while maintaining efficiency, reducing response times, and ensuring high-impact issues receive immediate attention. As organizations handle increasing digital communication, such solutions become essential to delivering superior customer service while optimizing operational costs.

Lire la suite sur InfoWorld