Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

Dobša, Jasminka; Kiers, Henk A. L.

doi:10.1007/978-3-031-09034-9_14

Jasminka Dobša²² &
Henk A. L. Kiers²³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

Included in the following conference series:

Conference of the International Federation of Classification Societies

1304 Accesses

Abstract

In the paper we propose a method for representation of documents in a semantic lower-dimensional space based on the modified Reduced k-means method which penalizes clusterings that are distant from classification of training documents given by experts. Reduced k-means (RKM) enables simultaneously clustering of documents and extraction of factors. By projection of documents represented in the vector space model on extracted factors, documents are clustered in the semantic space in a semi-supervised way (using penalization) because clustering is guided by classification given by experts, which enables improvement of classification performance of test documents. Classification performance is tested for classification by logistic regression and support vector machines (SVMs) for classes of Reuters-21578 data set. It is shown that representation of documents by the RKM method with penalization improves the average precision of classification by SVMs for the 25 largest classes of Reuters collection for about 5,5% with the same level of average recall in comparison to the basic representation in the vector space model. In the case of classification by logistic regression, representation by the RKM with penalization improves average recall for about 1% in comparison to the basic representation.

Download to read the full chapter text

Chapter PDF

Document Cluster Analysis Based on Parameter Tuning of Spectral Graphs

A semi-supervised framework for concept-based hierarchical document clustering

Article 02 October 2023

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Keywords

References

Bengio, J., Ducharme, R., Vincet, P., Jauvin, C.: A Neural probabilistic language model. Journal of Machine Learning Research 3, 1137-1155 (1997)
Google Scholar
Deerwester, S., Dumas, S. T., Furnas, G.W., Landauer, T. K., Harshman, R. A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 381-407 (1990)
Article Google Scholar
De Sarbo,W. S., Jedidi, K., Cool, K., Schendel, D.: Simultaneous multidimensional unfolding and cluster analysis: an investigation of strategic groups. Marketing Letters, 2, 129-146 (1990)
Google Scholar
De Soete, G., Carroll, J. D.: K-means clustering in a low-dimensional Euclidean space. In: Diday, E., Lechevallier, Y., Schader, M., Bertrand, P., Burtschy, B. (eds.) New Approaches in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 212–219. Springer, Heidelberg (1994)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computation Linguistic, pp. 4171–4186, Association for Computational Linguistic (2019)
Google Scholar
Dobša, J., Mladenic, D.,Rupnik, J., Radoševic, D., Magdalenic, I.: Cross-language information retrieval by Reduced :-means, International Journal of Computer Information Systems and Industrial Management Applications, 10, 314-322 (2018)
Google Scholar
Dumas, S., Letche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: Proceedings of the AAAI spring symposium on cross-language text and speech retrieval, pp. 15–21. American Association for Artificial Intelligence (1997)
Google Scholar
Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space (2013) Available via arXiv.org https://arxiv.org/abs/1301.3781.Cited21Jan2022
Pennington, J., Socher, R., Manning, C. D.: GloVe: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Association for Computational Linguistics, (2014)
Google Scholar
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Tettlemoyer, L.: Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1:2227-2237 (2018)
Google Scholar
Timmerman, M. E. Ceulemans, E., Kiers, H. A. L., Vichi, M: Factorial and Reduced k-means reconsidered. Computational Statistics & Data Analyisis, 54, 1856-1871 (2010)
Google Scholar
Timmerman, M. E., Ceulemans, E., De Rover, K., Van Leeuwen, K.: Subspace:-means clustering, Behavioural Research, 45, 1011-1023 (2013)
Article Google Scholar
Vichi, M., Kiers, H. A. L.: Factorial: k-means analysis for two-way data, Computational Statistics & Data Analysis, 37, 49-64 (2001)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Organization and Informatics, University of Zagreb, Pavlinska 2, 40000, Varaždin, Croatia
Jasminka Dobša
Department of Psychology, University of Groningan, Grote Kruisstraat 2/1, 9712 TS, Groningen, The Netherlands
Henk A. L. Kiers

Authors

Jasminka Dobša
View author publications
You can also search for this author in PubMed Google Scholar
Henk A. L. Kiers
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jasminka Dobša .

Editor information

Editors and Affiliations

Faculty of Economics, University of Porto, Porto, Portugal
Paula Brito
Business Research Unit, University Institute of Lisbon, Lisbon, Portugal
José G. Dias
Department of Mathematical Sciences, University of Essex, Colchester, UK
Berthold Lausen
Department of Statistical Sciences "Paolo Fortunati", University of Bologna, Bologna, Italy
Angela Montanari
Department of Statistics & Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
Rebecca Nugent

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dobša, J., Kiers, H.A.L. (2023). Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space. In: Brito, P., Dias, J.G., Lausen, B., Montanari, A., Nugent, R. (eds) Classification and Data Science in the Digital Age. IFCS 2022. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-031-09034-9_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-09034-9_14
Published: 08 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09033-2
Online ISBN: 978-3-031-09034-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

Abstract

Chapter PDF

Similar content being viewed by others

Document Cluster Analysis Based on Parameter Tuning of Spectral Graphs

A semi-supervised framework for concept-based hierarchical document clustering

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Improving Classification of Documents by Semi-supervised Clustering in a Semantic Space

Abstract

Chapter PDF

Similar content being viewed by others

Document Cluster Analysis Based on Parameter Tuning of Spectral Graphs

A semi-supervised framework for concept-based hierarchical document clustering

The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model

Keywords

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation