skip to main content
research-article
Open Access

Learning Category Distribution for Text Classification

Authors Info & Claims
Published:12 April 2023Publication History

Skip Abstract Section

Abstract

Label smoothing has a wide range of applications in the machine learning field. Nonetheless, label smoothing only softens the targets by adding a uniform distribution into a one-hot vector, which cannot truthfully reflect the underlying relations among categories. However, learning category relations is of vital importance in many fields such as emotion taxonomy and open set recognition. In this work, we propose a method to obtain the label distribution for each category (category distribution) to reveal category relations. Furthermore, based on the learned category distribution, we calculate new soft targets to improve the performance of model classification. Compared with existing methods, our algorithm can improve neural network models without any side information or additional neural network module by considering category relations. Extensive experiments have been conducted on four original datasets and 10 constructed noisy datasets with three basic neural network models to validate our algorithm. The results demonstrate the effectiveness of our algorithm on the classification task. In addition, three experiments (arrangement, clustering, and similarity) are also conducted to validate the intrinsic quality of the learned category distribution. The results indicate that the learned category distribution can well express underlying relations among categories.

Skip 1INTRODUCTION Section

1 INTRODUCTION

The text classification task has been widely studied in natural language processing (NLP). There is a wide range of applications of text classification in our daily life, such as sentiment analysis [39], spam identification [1], and opinion mining [20]. A variety of supervised machine learning algorithms have been introduced in the field of text classification, such as support vector machine [15], k-nearest neighbor [40], and maximum entropy [24]. With the development of deep learning, many datasets as well as models have been proposed to achieve better performance on the text classification task. Graves and Schmidhuber [11] presented bidirectional long short-term memory (BiLSTM) in sequence processing tasks. Kim [17] introduced convolutional neural networks for text classification. More recently, many pre-trained language models such as ELMO [26], BERT [6], and XLNet [41] have shown their contribution to the NLP community.

The algorithms mentioned above focus on the specific structure of the model, and each category is regarded as an independent dimension. In neural network models, each category is represented with one-hot vectors, which are further employed as the target output of the model to minimize cross-entropy loss. The use of one-hot vectors results in two main problems. On the one hand, one-hot representation does not accord with the fact that different categories are not orthogonal to each other. On the other hand, the model trained with one-hot vectors tends to be overconfident [13]. Each instance may be related to multiple labels, especially when the categories are similar to each other. However, each instance is annotated as an independent category and represented with one-hot vectors when training the model. Therefore, the model that represented categories with one-hot vectors tends to be more confident.

Szegedy et al. [32] proposed a technique named label smoothing as shown in Figure 1(b), in which the one-hot labels are replaced with a weighted mixture of a one-hot vector and a uniform vector to calculate the cross-entropy loss. By focusing on the cross-entropy loss function rather than specific model architecture, label smoothing provides another way to improve the accuracy of the modern neural network models in many downstream tasks such as speech recognition [5], machine translation [22], and confidence calibration [23]. Nevertheless, the cosine similarity between different categories is equal to a constant in label smoothing representation. It is still unable to express the realistic category relations.

Fig. 1.

Fig. 1. Illustration of different category targets. The ground-truth emotion category is joy. (a) In hard label, the confidence of joy is set to be 1. (b) In label smoothing, the confidence of joy is set to be slightly less than 1 (such as 0.9), and the other categories share the rest of the confidence evenly. (c) In our algorithm, the confidence of joy is set to be slightly less than 1, and the other categories share the rest of the confidence depending on their similarities with joy.

In real-world data, the relations between different categories are not easy to annotate. And the category relations are ignored in existing classification datasets. However, there are wide applications in many fields to reveal relations among categories. In psychological research, the quantitative analysis of emotion category relations is very helpful for the research of emotion taxonomy [19, 28]. In open set recognition, it benefits the detection of unknown categories to project existing categories into a vector space [9, 29].

In this work, we derive the label distribution for each category (category distribution) from the soft labels output by trained neural network models. Based on the learned category distribution, soft targets of each category are calculated to improve the performance of the model on the classification task. Experimental results demonstrate that our method is especially useful for the datasets with ambiguous labels or heavy noise. In addition, we detect the intrinsic quality of the learned category distribution in expressing underlying relations among categories from two perspectives (arrangement and clustering).

The main contributions of this work can be summarized as follows:

  • We propose a novel algorithm to improve the label smoothing technique. Our algorithm does not introduce any additional neural network module. Experiments demonstrate that our algorithm outperforms baseline methods on the classification task.

  • We construct 10 noisy datasets to validate the quality of our algorithm on noisy data. Experimental results indicate that our algorithm is especially useful for noisy data.

  • We derive category distribution from the soft labels output by neural network models. As the vice product of our algorithm, category distribution is proved to be able to reveal underlying category relations effectively.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Text Classification

Text classification is a fundamental task that has been widely studied in a number of diverse domains, such as data mining, sentiment analysis, information retrieval, and medical diagnosis. Traditional text classification algorithms follow a two-step procedure. First, some artificial features are designed and extracted from the initial document [8, 38, 47]. Then, the features are fed into the algorithm to make the final classification decision [16, 25, 48]. With the breakthroughs in deep learning in recent years, many deep learning models have shown their success in text classification. Zhang et al. [46] introduced an empirical exploration on the use of character-level convolutional neural networks for text classification. Yao et al. [42] proposed a graph neural network model to enhance the text classification task by modeling a whole corpus as a heterogeneous graph. Wang et al. [36] presented a framework to combine explicit and implicit representations of short text for classification. Due to the ability to extract latent features directly from the initial documents, deep learning models have become much more popular in recent years.

2.2 Label Embedding

Label embedding is proposed in the domain of zero-shot image classification [3, 4]. Each category is represented with predefined attributes and the side information is also required to score the value for each category. In the NLP community, label embedding is used to convert the categories into semantic vector space [21, 35, 45]. In other words, each category is regarded as a special word and the embedding of the labels is also inputted into the model to enhance the text classification task. Different from previous studies that embedded categories into semantic space, Wang and Zong [37] proposed a framework to represent the emotion categories in emotion space, and the emotion relations are further detected with the distributed representations of emotion categories. In this work, we derive the category distribution and soft targets directly from the trained neural network model. Furthermore, the derived soft targets are employed to enhance the accuracy performance in the text classification task.

2.3 Soft Label

Soft labels have a higher entropy than one-hot hard labels and have been applied in a variety of applications. Hinton et al. [14] introduced knowledge distillation by using soft targets output by a trained large model as the ground-truth label to train a relatively smaller model. Phuong and Lampert [27] discussed the mechanisms of knowledge distillation by studying the special case of deep linear classifiers. For the purpose of preventing neural network models from being over-confident, Szegedy et al. [32] presented a label smoothing mechanism by smoothing the initial one-hot label with uniform distribution. Vyas et al. [33] proposed a meta-learning framework where the instance labels are treated as learnable parameters and updated with the model during training. Zhang et al. [44] introduced an online label smoothing algorithm for image classification, in which the soft label of each instance will be added to a one-hot vector in every training step. Based on the label smoothing, Guo et al. [12] proposed the label confusion model (LCM) to enhance the text classification model. On the one hand, LCM requires an additional neural network module to calculate the soft label for the input instance. On the other hand, LCM is an instance-level model and generates the soft label only for instances, not categories. In this work, we derive label distribution for each category rather than each instance. The derived category distribution can well express category relations. Importantly, our method doesn’t require any additional neural network module.

2.4 Label Distribution Learning

Geng [10] proposed label distribution learning (LDL). LDL is a somewhat new machine learning task that paralleled with the classification task. In LDL, the true label distribution of each instance in the dataset is required to be pre-annotated. However, a majority of existing datasets are annotated with discrete categories, and they are not applied for LDL. However, it is very hard and expensive to annotate the true label distribution for each instance. A majority of existing datasets are annotated with discrete categories, and they are not applied for LDL. Therefore, our work is fundamentally different from LDL. In this work, the label distributions for each category are learned from the trained model. Our algorithm doesn’t require any human annotating of the soft label.

Skip 3OUR METHOD Section

3 OUR METHOD

In this section, we first discuss the potential loss bias caused by hard labels or label smoothing. To address this problem, we derive the category distribution that can express category relations. Based on the learned category distribution, the soft targets are calculated to improve model classification performance. The detailed approaches of our algorithm are listed last.

3.1 Loss of Neural Network Models

In neural network models, cross-entropy is chosen as the loss function for training. For example, given a dataset \(\mathcal {D} = \lbrace (x^{(i)},\boldsymbol {y}^{(i)})_{i=1}^N \rbrace\) annotated with C categories, for an instance annotated as category K, we have the loss function formula: (1) \(\begin{equation} \mathcal {L}(x^{(i)},\boldsymbol {y}^{(i)}|\boldsymbol {\theta }) = \sum _{j=1}^{C}-\boldsymbol {y}^{(i)}_j\log \boldsymbol {y}^{,\ (i)}_j, \end{equation}\) where \(\boldsymbol {y}^{,}\) is the soft label predicted by the model, and \(\theta\) is the model parameters to be trained.

In one-hot representation, \(\boldsymbol {y}^{(i)}\) is expressed as (2) \(\begin{equation} \boldsymbol {y}^{(i)}_j = \left\lbrace \begin{aligned}&1,\quad if \ j=K, \\ &0,\quad else. \end{aligned} \right. \end{equation}\) Applying Equation (2) into Equation (1), we have the entropy loss of one-hot labels: (3) \(\begin{equation} \mathcal {L}^{hard}(x^{(i)},\boldsymbol {y}^{(i)}|\boldsymbol {\theta }) = -\log \boldsymbol {y}^{,\ (i)}_K. \end{equation}\)

In label smoothing, \(\boldsymbol {y}^{(i)}\) can be expressed as (4) \(\begin{equation} \boldsymbol {y}^{(i)}_j = \left\lbrace \begin{aligned}&(1-\alpha)+\alpha /C,\quad if \ j=K, \\ &\alpha /C,\quad \quad \quad \quad \quad \ else. \end{aligned}\right. \end{equation}\) After applying Equation (4) into Equation (1), we obtain the entropy loss of label smoothing: (5) \(\begin{equation} \mathcal {L}^{LS}(x^{(i)},\boldsymbol {y}^{(i)}|\boldsymbol {\theta }) = -(1-\alpha)\log \boldsymbol {y}^{,\ (i)}_K-\alpha \sum ^C_{j=1}\log \boldsymbol {y}^{,\ (i)}_j. \end{equation}\)

Although label smoothing outperforms the one-hot label by introducing a uniform distribution, it still cannot express realistic category relations. The similarity between different categories is not the same. Therefore, both the hard label and label smoothing are unable to describe realistic category relations, which causes calculated cross-entropy loss to deviate from the actual loss during training. Accurate category relations are essential to obtain a more realistic cross-entropy loss.

3.2 Derivation of Category Distribution

Inspired by knowledge distillation [14] where the soft labels output by the trained model tend to have more useful information than the hard label, we derive category distribution from the soft labels.

In an annotated dataset, each instance is actually a sample of the corresponding annotated category. In this article, we regard the soft label output by the trained model as the estimation of label distribution for the corresponding instance. Therefore, our goal is to minimize the loss between the category distribution and the instance distribution. Considering all instances are annotated as category K, we have (6) \(\begin{equation} \min \sum _{i=1}^{N_K}Dist(\boldsymbol {y},\boldsymbol {y^{(i)}}), \end{equation}\) where Dist is the distance function, \(\boldsymbol {y}\) is the distribution of category K to be solved, \(\boldsymbol {y^{(i)}}\) is the label distribution of the ith instance annotated as category K, and \(N_K\) is the number of instances annotated as category K in the dataset.

There are many functions to measure the distance between two distributions. Since \(\boldsymbol {y}\) is the actual distribution to be derived and \(\boldsymbol {y^{(i)}}\) is the predicted distribution of the ith instance, we choose KL-Divergence to measure the distance between true distribution (\(\boldsymbol {y}\)) and fitted distribution (\(\boldsymbol {y^{(i)}}\)): (7) \(\begin{equation} \min \sum _{j=1}^{C}\sum _{i=1}^{N_K}\boldsymbol {y}_j\log \frac{\boldsymbol {y}_j}{\boldsymbol {y}_j^{(i)}}\ , \end{equation}\) where \(\boldsymbol {y}\) is the distribution of category K to be solved, and \(\boldsymbol {y}_j\) is the jth component of vector \(\boldsymbol {y}\).

By solving Equation (7), we have the formula to calculate the Kth category distribution: (8) \(\begin{equation} \boldsymbol {y}_j = \frac{1}{N_K}\sum ^{N_K}_{i=1}\boldsymbol {y}_j^{(i)}\ . \end{equation}\)

By concatenating the distribution of all categories, we obtain the final category distribution matrix: (9) \(\begin{equation} \boldsymbol {Y} = [\boldsymbol {Y}_1;\boldsymbol {Y}_2;\ldots ;\boldsymbol {Y}_C], \end{equation}\) where \(\boldsymbol {Y}_i\) is the distribution for the ith category. \(\boldsymbol {Y}\) is a square matrix. The ith row of \(\boldsymbol {Y}\) represents the distribution for the ith category. The jth column of \(\boldsymbol {Y}\) represents the jth component of each category distribution.

To improve the performance of the models on the classification task, the similarity matrix of our category distribution is calculated as soft targets to train the model. The new soft label matrix is calculated as (10) \(\begin{equation} \boldsymbol {S}_{ij} = \frac{e^{\boldsymbol {s}_{ij}/T}}{\sum _{m=1}^{C}e^{\boldsymbol {s}_{im}/T}}, \end{equation}\) where \(\boldsymbol {S}\) is the new soft targets, \(\boldsymbol {S}_{ij}\) is the ith row and jth column element in \(\boldsymbol {S}\), and \(\boldsymbol {s}_{ij}\) is the cosine similarity between \(\boldsymbol {Y}_i\) and \(\boldsymbol {Y}_j\). T is the temperature parameter to control confidence in learning samples. A higher value of parameter T produces a softer probability distribution over categories.

3.3 Algorithm

Our algorithm benefits the community in two aspects. On the one hand, the category relations can be revealed with our category distribution, although these categories are one-hot represented in the dataset. On the other hand, based on our category distribution, soft targets are calculated for further improving the model performance on the classification task.

It should be noted that our algorithm does not require any additional neural network module. Just from the soft labels predicted by the model, we can in turn employ the soft labels to improve the model performance on the classification task. The detailed steps can be seen in Algorithm 1.

Skip 4EXPERIMENTS Section

4 EXPERIMENTS

In this section, we first validate the ability of our algorithm to improve the model performance on text classification. Then, experiments are conducted to detect the intrinsic quality of the learned category distribution in expressing category relations.

4.1 Experimental Setup

4.1.1 Datasets.

Five datasets that vary in the domain, topic, and languages are chosen to validate the effectiveness of our algorithm.

20NG (bydata version):1 This is an English news dataset that consists of 20 news topics. There are 11,314 samples for training and 7,532 samples for testing.

THUCNews:2 This is a Chinese news dataset proposed by Sun et al. [31]. There are 50,000, 10,000, and 5,000 samples for training, validation, and testing, respectively. There are 10 categories (sports, finance, real estate, home, education, technology, fashion, politics, game, and amusement) contained in the dataset.

NHKNews:3 This is a Japanese news dataset. There are 21,795 instances annotated with up to 10 topics (culture, sports, drama, information, anime, welfare, variety, report, education, and music) in this dataset.

KRNews:4 This is a Korean news dataset. There are 45,654 samples annotated with seven categories (science, economy, society, culture, world, sports, and politics) in this dataset.

FaizalNews:5 This is an Indonesian news classification dataset. There are 9,000 and 1,000 samples for training and testing, respectively. There are five categories (ball, health, finance, automotive, and property) contained in this dataset.

4.1.2 Models.

In this article, three typical neural network models are chosen to conduct experiments.

TextCNN: Kim [17] introduced the convolution neural network structure for text classification. Different from the CNN in image classification, the width of the convolution kernel is equal to the dimension of word vectors; 300-dimensional random vectors are adopted in our experiments.

BiLSTM: The bidirectional long-short time memory model was proposed by Graves and Schmidhuber [11]. BiLSTM is an improved version of a bidirectional recurrent neural network [30]; 300-dimensional random vectors are adopted in our experiments.

BERT: Bidirectional Encoder Representations from Transformers were introduced by Devlin et al. [6]. We choose the BERT-based model to fine-tune the datasets.

4.1.3 Settings.

For TextCNN, the width of our convolutional kernel is 100, which is equal to the dimension of employed word vectors. The height of the kernel is divided into three groups (2,3,4). There are 64 channels in each group. We tune the batch size and learning rate to 128 and 0.001, respectively. For BiLSTM, batch size and learning rate are set to 128 and 0.001, which are the same as for TextCNN. The hidden layer size is set to 32 in each direction. For the BERT model, a fully connected layer is added on top of the pre-trained BERT-based model. Batch size and learning rate are separately set to 128 and 2e-5 for fine-tuning. For label smoothing, we set \(\alpha\) to 0.9. The Adam optimizer is employed to train all three neural models in our work [18]. Our model is trained on CPU Intel(R) Xeon(R) E5-2620 and GPU GeForce RTX 3090. For 20NG, THUCNews, and FaizalNews, we use original data split to train the models. For NHKNews and KRNews without original data split, we randomly split them into train, valid, and test sets with the ratio of 0.6:0.2:0.2.

4.2 Improvements on Text Classification

4.2.1 Test Performance.

Three metrics (accuracy, recall, and macro-F1) are employed to show the performance of the models. Table 1 shows the test performance of hard label, label smoothing, and our algorithm with three models on five datasets. Our algorithm generally outperforms hard label and label smoothing. Our method has the most obvious improvement with the TextCNN network. Comparing three basic neural models, we can find that BERT outperforms TextCNN and BiLSTM on all five datasets. Correspondingly, the improvement of our algorithm on BERT is less than that on TextCNN and BiLSTM.

Table 1.
Models20NGTHUCNewsNHKNewsKRNewsFaizalNews
accrecf1accrecf1accrecf1accrecf1accrecf1
TextCNN+HL82.8782.8182.7888.2088.1788.1760.3958.4058.6672.8571.9172.1990.2189.8089.94
TextCNN+LS83.2283.3383.1988.1888.1888.1761.2457.6159.0972.6172.2472.3489.2388.5088.68
TextCNN+CD83.8683.6183.7088.8288.8388.8161.3159.1359.8873.0172.9072.7990.4790.4090.43
BiLSTM+HL76.2476.4876.2387.8287.7487.7657.7856.7456.6369.1168.7268.7579.9080.1079.86
BiLSTM+LS77.5077.4677.4287.9887.9087.9157.0555.9756.2168.8467.7768.1080.9280.7080.75
BiLSTM+CD77.2777.3977.2288.2988.0488.0858.7856.9857.6769.4268.9569.0880.8080.5080.45
BERT+HL92.2492.0992.1497.1897.1897.1876.0773.9474.8880.5880.4380.4792.7292.6092.64
BERT+LS92.1792.2992.2197.2297.2197.2177.6074.1275.5480.8780.6180.6992.5392.4092.44
BERT+CD92.4892.6692.5697.2397.2397.2376.4676.4276.3480.8680.5080.6393.0992.8092.87
  • HL, LS, and CD are the abbreviations of hard label, label smoothing, and category distribution, respectively. Best macro-F1 results are shown in bold.

Table 1. Test Performance on Different Datasets

  • HL, LS, and CD are the abbreviations of hard label, label smoothing, and category distribution, respectively. Best macro-F1 results are shown in bold.

Table 2 shows the detailed test performance on each category in the 20NG dataset with the TextCNN model. Our algorithm outperforms label smoothing in 14 categories. There are three main categories with only one sub-category in the 20NG dataset (alt.atheism, misc.forsale, and soc.religion.christian). Particularly, our algorithm outperforms label smoothing in all three categories, with improvement ranging from 0.20 to 4.02 on a macro-F1 score.

Table 2.
Hard LabelLabel SmoothingCategory Distribution
accrecf1accrecf1accrecf1
alt.atheism90.5190.5190.5191.8590.5191.1895.4291.2493.28
comp.graphics71.6868.8970.2568.3667.2267.7967.1872.7869.87
comp.os.ms-windows.misc68.0274.4471.0978.9575.0076.9279.0473.3376.08
comp.sys.ibm.pc.hardware72.2266.8269.4272.6468.2270.3670.1472.4371.26
comp.sys.mac.hardware74.2476.1775.1979.6979.2779.4873.8574.6174.23
comp.windows.x73.4167.5570.3667.3174.4770.7171.8171.8171.81
misc.forsale74.1685.1679.2871.3683.5276.9680.1181.8780.98
rec.autos82.6384.6283.6182.4185.5883.9681.0688.4684.60
rec.motorcycles89.0189.4789.2491.8989.4790.6792.9790.5391.73
rec.sport.baseball82.4989.5085.8583.7387.5085.5783.2587.0085.09
rec.sport.hockey88.6189.5089.0587.8887.0087.4488.6189.5089.05
sci.crypt95.9291.7193.7798.9491.2294.9293.5091.2292.35
sci.electronics78.9578.5778.7680.9572.8676.6978.9578.5778.76
sci.med83.1478.1480.5681.9781.9781.9784.0277.6080.68
sci.space89.0991.5990.3287.7390.1988.9489.2589.2589.25
soc.religion.christian92.2388.5690.3694.2790.0592.1195.2489.5592.31
talk.politics.guns87.8890.6289.2391.4489.0690.2486.8789.5888.21
talk.politics.mideast95.9292.1694.0095.9292.1694.0095.9492.6594.26
talk.politics.misc90.6284.3087.3581.7288.3784.9290.0088.9589.47
talk.religion.misc76.6777.9777.3175.3883.0579.0380.0081.3680.67
macro average82.8782.8182.7883.2283.3383.1983.8683.6183.70
  • Best macro-F1 results are shown in bold.

Table 2. Test Performance of Each Category on 20NG Datasets with TextCNN Model

  • Best macro-F1 results are shown in bold.

4.2.2 Analysis.

As mentioned above, our soft targets are calculated based on the learned category distribution, which means our algorithm improves the model by considering category relations. Therefore, for datasets where the categories are easy to distinguish, the improvement of our algorithm is limited. This view can be validated from Table 1. There are five topics annotated in FaizalNews, where the categories are much easier to distinguish than other datasets. Three baseline models with hard labels all have a high performance. As a result, the improvement of our algorithm is not as significant as other datasets.

On the contrary, our algorithm is especially useful for datasets where the boundaries of categories are quite hard to define. This can also be validated from Table 1. NHKNews is a dataset with lots of noise. The incorrect annotated instances make the label relations fuzzy. As a result, the improvement of our algorithm is more significant than the others. It is worth pointing out that the easier it is to define the boundaries between categories, the easier it is to classify the dataset. For the datasets where the label boundaries are not clear, our algorithm is helpful to discover label relations and improve the model performance. Therefore, our algorithm is more helpful for hard datasets than easy datasets.

4.3 Tolerance to Noisy Data

It is hard to annotate all samples correctly, especially when the categories are similar to each other. Learning from noisy data is a problem with great practical importance. However, generalization of deep neural networks to noisy data is very harmful [43]. In this section, we find that our approach can improve the performance of neural networks by reducing the confidence in learning noisy data.

To better show the performance of our method on noisy data, we construct a series of noisy datasets based on THUCNews. For each noise data, only training data are randomly re-labeled with a certain noise proportion, and the test data remain unchanged. We construct four noise datasets with different noise proportions (5%, 10%, 20%, and 30%). We choose TextCNN, BiLSTM, and BERT as base prediction models. Three metrics (accuracy, recall, and macro-F1) are employed to evaluate the model performance. The detailed results are listed in Table 3.

Table 3.
Models5% Noise10% Noise20% Noise30% Noise
accrecf1accrecf1accrecf1accrecf1
TextCNN+HL87.2087.1587.1784.7184.6484.6683.2483.2083.2081.8981.8581.86
TextCNN+LS87.3187.2887.2785.0985.0385.0583.5683.5283.5382.5782.5182.52
TextCNN+CD88.0288.0087.9986.4886.4786.4584.7984.7284.7383.1683.1583.13
BiLSTM+HL81.1281.1181.1079.6179.5579.5774.0774.0574.0569.0969.0169.02
BiLSTM+LS82.1382.0482.0580.6880.6780.6574.7774.6974.7270.9170.8170.84
BiLSTM+CD82.5982.5582.5480.8780.8080.8175.6275.5275.5571.3471.3371.32
BERT+HL97.0797.0697.0696.6796.6796.6796.5696.5596.5595.6795.6695.66
BERT+LS97.0097.0097.0096.6096.5996.5996.8296.8196.8195.8795.8695.86
BERT+CD97.1497.1397.1396.9396.9296.9296.6796.6696.6696.0396.0196.01
  • HL, LS, and CD represent hard label, label smoothing, and category distribution, respectively. The percentage represents the proportion of the samples that are randomly re-labeled.

Table 3. Test Performance on Noisy Data of THUCNews Dataset

  • HL, LS, and CD represent hard label, label smoothing, and category distribution, respectively. The percentage represents the proportion of the samples that are randomly re-labeled.

With noise proportion increasing, the test accuracy, recall, and macro-F1 of all models decrease. On all four noisy datasets and all three models, our algorithm generally outperforms the hard label and label smoothing. For three neural models, the macro-F1 score of BiLSTM drops 12.08% on the noisy THUCNews dataset, which indicates that BiLSTM is most sensitive to noise. The macro-F1 score of BERT drops only 1.40%. As a pre-trained language model, BERT demonstrates its strong power against noise.

As listed in Table 3, our algorithm is especially useful on TextCNN and BiLSTM. With noise proportion increasing (from 5% to 30%), our algorithm outperforms label smoothing by 1.44 to 2.30 percentage points on the macro-F1 score. The baselines of the BERT model are so high that there a little room for improvement. Nonetheless, the improvement of our method increases (from 0.07 to 0.4) on the macro-F1 score with noise proportion increasing. The noisy experiments demonstrate the effectiveness of our algorithm on all three modern neural networks.

4.4 The Vice Product: Category Distribution

Existing classification datasets regard categories as independent ones. However, it is very important to detect category relations in many fields [2, 7]. In this section, we detect the intrinsic quality of the proposed category distribution in expressing category relations from two aspects: arrangement and clustering. We employ the 20NG dataset in this section as there are 7 major categories and 20 sub-categories in the 20NG dataset.

4.4.1 Arrangement of Category Distribution.

The dimension of category distribution is equal to the number of categories. To better show the arrangement of the categories, we first use singular value decomposition (SVD) [34] to reduce the dimension of the category distribution from 20 to 2. Then, two-dimensional vectors are replaced with their rank order, which remains the relative relations among them. The two-dimensional vectors are displayed in Figure 2. All sub-categories that belong to the same major category are represented with the same color.

Fig. 2.

Fig. 2. Category distribution obtained by our algorithm on the 20NG dataset. (a) Visualization of category distribution. There are 7 major categories and 20 sub-categories contained in the dataset. Seven major categories are represented with seven colors, respectively. (b) The cluster dendrogram of category distribution. The dendrogram is constructed using linkage clustering.

There are four major categories (comp, rec, sci, and talk) that contain multiple sub-categories. They are colored with brown, green, purple, and blue. For these four major categories, each one is linearly separated from the other three. Although the categories are annotated with one-hot vectors, our proposed category distribution can still well express potential relations among major categories.

As for the major category talk (in blue), the sub-category talk.religion (left top in blue) is far away from the other three sub-categories that belong to talk.politics, which is consistent with the fact that talk.religion and talk.politics belong to different sub-topics. As for alt.atheism (left top in cyan) and soc.religion.christian (left top in gold), they are very close to each other, although they belong to different major categories. What’s more, the sub-category talk.religion.misc is close to both soc.religion.christian and alt.atheism. This is accordant to the fact that talk.religion.misc, alt.atheism, and soc.religion.christian are highly related with religion.

The sub-category sci.electronics (middle bottom in green) is close to the major category comp (right bottom in brown), which is consistent with the fact that electronics and computers are closely related. As for the major category misc.forsale (right middle in coral), it is very interesting that misc.forsale is located between comp (in brown) and rec (in purple) but far away from talk.politics and religion-related categories. This is very reasonable as sub-categories in comp and rec are products that can be traded, but religion and politics are cultural concepts that cannot be traded.

From this experiment, we can conclude that our category distribution can not only well express the relations among major categories but also well capture the relations among sub-categories.

4.4.2 Clustering of Category Distribution.

In this section, we perform the cluster analysis on our category distribution. We choose the linkage function in the scipy package6 to conduct this experiment. For function parameters, we choose the Ward algorithm and Euclidean distance.

The cluster dendrogram of category distribution can be seen in Figure 2. Although the clustering results are not completely consistent with human clustering results, we can still find several common features.

All sub-categories that belong to comp are colored in red, which demonstrates that our category distribution can well distinguish the categories related to the computer topic. Moreover, sci.electronic is also marked in red, which means sci.electronic is close to comp and is accordant with the results in the above experiment.

All four sub-categories that belong to rec are colored in purple. In addition, two sub-categories in the sci topic and one sub-category in talk are also marked in purple. This indicates that two sub-categories in different major categories can be close to each other, which further suggests the complexity of clustering the categories.

As shown in Figure 2, alt.atheism, soc.religion.christian, and talk.religion.misc are very close to each other and colored orange. These three categories seem to be far away from the other categories. It is reasonable as they are highly related with the religion topic while the others are not. sci seems to be the most irregular major category. Although sci contains four sub-categories, these sub-categories are not clustered together.

Skip 5LIMITATIONS Section

5 LIMITATIONS

In this article, we discuss how to extract category relations from text classification datasets and further improve the classification performance of neural network models. However, it should be noted that the extracted category relations can only reflect the relations in the data space, not the semantic space. Also, the category relations may change with the choice of the dataset and the classification model. How to obtain dataset-unrelated category relations in semantic space remains a change.

Skip 6CONCLUSION AND FUTURE WORK Section

6 CONCLUSION AND FUTURE WORK

In this article, we argue that label smoothing is unable to well express category relations. To address this problem, we propose an algorithm to obtain category distribution to reveal category relations. Based on the learned category distribution, new soft targets are calculated for further model fine-tuning. Experimental results demonstrate the effectiveness of our algorithm in improving model classification performance and the learned category distribution in expressing underlying category relations. Moreover, our algorithm doesn’t require any additional neural network module and can be easily employed in existing neural network models.

There are two avenues of future work we would like to explore. On the one hand, existing deep models tend to be overconfident. Training with soft labels can reduce model confidence in making predictions. It is interesting to detect the ability of category distribution in confidence calibration. On the other hand, category distribution, as well as label smoothing, is useful only on single-label datasets. It is very meaningful to apply these methods to multi-label datasets.

Footnotes

REFERENCES

  1. [1] Abernethy Jacob, Chapelle Olivier, and Castillo Carlos. 2008. Web spam identification through content and hyperlinks. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web. 4144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Akata Zeynep, Perronnin Florent, Harchaoui Zaid, and Schmid Cordelia. 2013. Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 819826.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Akata Zeynep, Perronnin Florent, Harchaoui Zaid, and Schmid Cordelia. 2015. Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 7 (2015), 14251438.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Chen, Wang Haobo, Liu Weiwei, Zhao Xingyuan, Hu Tianlei, and Chen Gang. 2019. Two-stage label embedding via neural factorization machine for multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33(1). 33043311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Chorowski Jan and Jaitly Navdeep. 2017. Towards better decoding and language model integration in sequence to sequence models. In Proc. Interspeech 2017. 523527.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 41714186.Google ScholarGoogle Scholar
  7. [7] Ekman Paul Ed and Davidson Richard J.. 1994. The Nature of Emotion: Fundamental Questions.Oxford University Press.Google ScholarGoogle Scholar
  8. [8] Forsyth Richard S. and Holmes David I.. 1996. Feature-finding for text classification. Literary and Linguistic Computing 11, 4 (1996), 163174.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Geng Chuanxing, Huang Sheng-jun, and Chen Songcan. 2020. Recent advances in open set recognition: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 10 (2020), 3614–3631.Google ScholarGoogle Scholar
  10. [10] Geng Xin. 2016. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28, 7 (2016), 17341748.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Graves Alex and Schmidhuber Jürgen. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks 18, 5–6 (2005), 602610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Guo Biyang, Han Songqiao, Han Xiao, Huang Hailiang, and Lu Ting. 2021. Label confusion learning to enhance text classification models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1292912936.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Guo Chuan, Pleiss Geoff, Sun Yu, and Weinberger Kilian Q.. 2017. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. 13211330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Hinton Geoffrey, Vinyals Oriol, and Dean Jeff. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).Google ScholarGoogle Scholar
  15. [15] Joachims Thorsten. 1998. Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning. Springer, 137142.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Kim Sang-Bum, Han Kyoung-Soo, Rim Hae-Chang, and Myaeng Sung Hyon. 2006. Some effective techniques for naive Bayes text classification. IEEE Transactions on Knowledge and Data Engineering 18, 11 (2006), 14571466.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Kim Yoon. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 17461751.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Kingma Diederik P. and Ba Jimmy. 2015. Adam: A method for stochastic optimization. In Proceedings of the 2015 Conference on International Conference on Learning Representations (ICLR), 1–15.Google ScholarGoogle Scholar
  19. [19] Kron Assaf. 2019. Rethinking the principles of emotion taxonomy. Emotion Review 11, 3 (2019), 226233.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Liu Bing and Zhang Lei. 2012. A survey of opinion mining and sentiment analysis. In Mining Text Data. Springer, 415463.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Ma Yukun, Cambria Erik, and Gao Sa. 2016. Label embedding for zero-shot fine-grained named entity typing. In Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING’16). 171180.Google ScholarGoogle Scholar
  22. [22] Meister Clara, Salesky Elizabeth, and Cotterell Ryan. 2020. Generalized entropy regularization or: There’s nothing special about label smoothing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 68706886.Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Müller Rafael, Kornblith Simon, and Hinton Geoffrey E.. 2019. When does label smoothing help? Advances in Neural Information Processing Systems 32 (2019), 46944703.Google ScholarGoogle Scholar
  24. [24] Nigam Kamal, Lafferty John, and McCallum Andrew. 1999. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, Vol. 1. 6167.Google ScholarGoogle Scholar
  25. [25] Nigam Kamal, McCallum Andrew Kachites, Thrun Sebastian, and Mitchell Tom. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2 (2000), 103134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Peters Matthew, Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 22272237.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Phuong Mary and Lampert Christoph. 2019. Towards understanding knowledge distillation. In International Conference on Machine Learning. 51425151.Google ScholarGoogle Scholar
  28. [28] Russell James A. and Steiger James H.. 1982. The structure in persons’ implicit taxonomy of emotions. Journal of Research in Personality 16, 4 (1982), 447469.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Scheirer Walter J., Jain Lalit P., and Boult Terrance E.. 2014. Probability models for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 11 (2014), 23172324.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Schuster Mike and Paliwal Kuldip K.. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 26732681.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Sun Maosong, Li Jingyang, Guo Zhipeng, Yu Z., Zheng Y., Si X., and Liu Z.. 2016. Thuctc: An efficient Chinese text classifier. GitHub Repository (2016).Google ScholarGoogle Scholar
  32. [32] Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jon, and Wojna Zbigniew. 2016. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, 28182826.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Vyas Nidhi, Saxena Shreyas, and Voice Thomas. 2020. Learning soft labels via meta learning. arXiv preprint arXiv:2009.09496 (2020).Google ScholarGoogle Scholar
  34. [34] Wall Michael E., Rechtsteiner Andreas, and Rocha Luis M.. 2003. Singular value decomposition and principal component analysis. In A Practical Approach to Microarray Data Analysis. Springer, 91109.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Wang Guoyin, Li Chunyuan, Wang Wenlin, Zhang Yizhe, Shen Dinghan, Zhang Xinyuan, Henao Ricardo, and Carin Lawrence. 2018. Joint embedding of words and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 23212331.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Wang Jin, Wang Zhongyuan, Zhang Dawei, and Yan Jun. 2017. Combining knowledge with deep convolutional neural networks for short text classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 29152921.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Wang Xiangyu and Zong Chengqing. 2021. Distributed representations of emotion categories in emotion space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 23642375. Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Xia Rui, Xu Feng, Zong Chengqing, Li Qianmu, Qi Yong, and Li Tao. 2015. Dual sentiment analysis: Considering two sides of one review. IEEE Transactions on Knowledge and Data Engineering 27, 8 (2015), 21202133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Xia Rui, Zong Chengqing, and Li Shoushan. 2011. Ensemble of feature sets and classification algorithms for sentiment classification. Information Sciences 181, 6 (2011), 11381152.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Yang Yiming. 1999. An evaluation of statistical approaches to text categorization. Information Retrieval 1, 1 (1999), 6990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime, Salakhutdinov Russ R., and Le Quoc V.. 2019. XLNet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 32 (2019).Google ScholarGoogle Scholar
  42. [42] Yao Liang, Mao Chengsheng, and Luo Yuan. 2019. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33(1). 73707377.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Zhang Chiyuan, Bengio Samy, Hardt Moritz, Recht Benjamin, and Vinyals Oriol. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 3 (2021), 107115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Zhang Chang-Bin, Jiang Peng-Tao, Hou Qibin, Wei Yunchao, Han Qi, Li Zhen, and Cheng Ming-Ming. 2021. Delving deep into label smoothing. IEEE Transactions on Image Processing 30 (2021), 59845996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Zhang Honglun, Xiao Liqiang, Chen Wenqing, Wang Yongkun, and Jin Yaohui. 2018. Multi-task label embedding for text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 45454553.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Zhang Xiang, Zhao Junbo, and LeCun Yann. 2015. Character-level convolutional networks for text classification. Advances in Neural Information Processing Systems 28 (2015), 649657.Google ScholarGoogle Scholar
  47. [47] Zhang Yin, Jin Rong, and Zhou Zhi-Hua. 2010. Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics 1, 1–4 (2010), 4352.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zong Chengqing, Xia Rui, and Zhang Jiajun. 2021. Text Data Mining. Springer.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Learning Category Distribution for Text Classification

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
        April 2023
        682 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3588902
        Issue’s Table of Contents

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 April 2023
        • Online AM: 27 February 2023
        • Accepted: 6 February 2023
        • Revised: 26 December 2022
        • Received: 3 March 2022
        Published in tallip Volume 22, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)606
        • Downloads (Last 6 weeks)39

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format