Abstract
In the paper we propose a method for representation of documents in a semantic lower-dimensional space based on the modified Reduced k-means method which penalizes clusterings that are distant from classification of training documents given by experts. Reduced k-means (RKM) enables simultaneously clustering of documents and extraction of factors. By projection of documents represented in the vector space model on extracted factors, documents are clustered in the semantic space in a semi-supervised way (using penalization) because clustering is guided by classification given by experts, which enables improvement of classification performance of test documents. Classification performance is tested for classification by logistic regression and support vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that representation of documents by the RKM method with penalization improves the average precision of classification by SVMs for the 25 largest classes of Reuters collection for about 5,5% with the same level of average recall in comparison to the basic representation in the vector space model. In the case of classification by logistic regression, representation by the RKM with penalization improves average recall for about 1% in comparison to the basic representation.
Chapter PDF
Similar content being viewed by others
References
Bengio, J., Ducharme, R., Vincet, P., Jauvin, C.: A Neural probabilistic language model. Journal of Machine Learning Research 3, 1137-1155 (1997)
Deerwester, S., Dumas, S. T., Furnas, G.W., Landauer, T. K., Harshman, R. A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 381-407 (1990)
De Sarbo,W. S., Jedidi, K., Cool, K., Schendel, D.: Simultaneous multidimensional unfolding and cluster analysis: an investigation of strategic groups. Marketing Letters, 2, 129-146 (1990)
De Soete, G., Carroll, J. D.: K-means clustering in a low-dimensional Euclidean space. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B. (eds.) New Approaches in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 212–219. Springer, Heidelberg (1994)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computation Linguistic, pp. 4171–4186, Association for Computational Linguistic (2019)
Dobša, J., Mladenic, D.,Rupnik, J., Radoševic, D., Magdalenic, I.: Cross-language information retrieval by Reduced :-means, International Journal of Computer Information Systems and Industrial Management Applications, 10, 314-322 (2018)
Dumas, S., Letche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: Proceedings of the AAAI spring symposium on cross-language text and speech retrieval, pp. 15–21. American Association for Artificial Intelligence (1997)
Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space (2013) Available via arXiv.org https://arxiv.org/abs/1301.3781.Cited21Jan2022
Pennington, J., Socher, R., Manning, C. D.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Association for Computational Linguistics, (2014)
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Tettlemoyer, L.: Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1:2227-2237 (2018)
Timmerman, M. E. Ceulemans, E., Kiers, H. A. L., Vichi, M: Factorial and Reduced k-means reconsidered. Computational Statistics & Data Analyisis, 54, 1856-1871 (2010)
Timmerman, M. E., Ceulemans, E., De Rover, K., Van Leeuwen, K.: Subspace:-means clustering, Behavioural Research, 45, 1011-1023 (2013)
Vichi, M., Kiers, H. A. L.: Factorial: k-means analysis for two-way data, Computational Statistics & Data Analysis, 37, 49-64 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Dobša, J., Kiers, H.A.L. (2023). Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space. In: Brito, P., Dias, J.G., Lausen, B., Montanari, A., Nugent, R. (eds) Classification and Data Science in the Digital Age. IFCS 2022. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-031-09034-9_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-09034-9_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09033-2
Online ISBN: 978-3-031-09034-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)