Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Reynaert, Martin W. C.

doi:10.1007/s10032-010-0133-5

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Original Paper
Open access
Published: 03 November 2010

Volume 14, pages 173–187, (2011)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Download PDF

Martin W. C. Reynaert¹

1702 Accesses
21 Citations
6 Altmetric
Explore all metrics

Abstract

We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (ticcl) is compared to its focus word-based counterpart and evaluated on 6 years’ worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more traditional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper ‘Het Volk’ show that the system is not sensitive to domain variation.

Article PDF

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Spanish Diacritic Error Detection and Restoration—A Survey

When to Use OCR Post-correction for Named Entity Recognition?

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Navarro G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)
Article Google Scholar
Kukich K.: Techniques for automatically correcting words in text. ACM Comput. Surv. 24(4), 377–439 (1992)
Article Google Scholar
Cucerzan, S., Brill, E.: Spelling correction as an iterative process that exploits the collective knowledge of web users. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 293–300. Association for Computational Linguistics, Barcelona (2004)
Lopresti D., Zhou J.: Using consensus sequence voting to correct OCR errors. Comput. Vis. Image Underst. 67(1), 39–47 (1997)
Article Google Scholar
Kernighan, M.D., Church, K.W., Gale, W.A.: A spelling correction program based on a noisy channel model. In: COLING-90, vol. II, pp. 205–211. Helsinki (1990)
Oflazer, K., Güzey, C.: Spelling correction in agglutinative languages. In: ANLP, pp. 194–195. (1994)
Sun, X., Gao, J., Micol, D., Quirk, C.: Learning phrase-based spelling error models from clickthrough data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10) (2010)
Teahan W.J., Inglis S., Cleary J.G., Holmes G.: Correcting English text using PPM models. In: Storer, J.A., Reif, J.H. (eds) Proc Data Compression Conference, pp. 289–298. IEEE Computer Society Press, Society Press, Los Alamitos, CA (1998)
Google Scholar
Kolak, O., Resnik, P.: OCR error correction using a noisy channel model. In: Proceedings of the second international conference on Human Language Technology Research, pp. 257–262. Morgan Kaufmann Publishers Inc., San Francisco, CA, (2002)
Brill, E., Moore, R.C.: An improved error model for noisy channel spelling correction. In: Proceedings of the 38th Annual Meeting of the ACL, pp. 286–293. (2000)
Strohmaier, C.M., Ringlstetter, C., Schulz, K.U., Mihov, S.: Lexical postcorrection of OCR-results: the web as a dynamic secondary dictionary? In: International Conference on Document Analysis and Recognition 2:1133 (2003)
Ringlstetter C., Schulz K.U., Mihov S.: Orthographic errors in web pages: toward cleaner web corpora. Comput. Linguist. 32(3), 295–340 (2006)
Article Google Scholar
Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. In: Cybernetics and Control Theory, vol. 10(8), pp. 707–710 (1965), original in: Doklady Nauk SSSR 163(4):845–848 (1965)
Gotscharek, A., Neumann, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Enabling information retrieval on historical document collections: the role of matching procedures and special lexica. In: AND ’09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, pp. 69–76. ACM, New York, NY (2009)
Reynaert, M.: Text induced spelling correction. In: Proceedings COLING 2004, Geneva (2004)
Reynaert, M.: Text-induced spelling correction. PhD thesis, Tilburg University (2005)
Damerau F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Proceedings of CICLing 2008. Lecture Notes in Computer Science vol. 4919/2008, pp. 617–630. Springer, Berlin (2008)
Reynaert, M.: Parallel identification of the spelling variants in corpora. In: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data 2009 (AND-2009), pp. 77–84. Barcelona, Spain (2009)
Frauenfelder U., Baayen R., Hellwig F., Schreuder R.: Neighbourhood density and frequency across languages and modalities. J. Mem. Lang. 32, 781–804 (1993)
Article Google Scholar
Zipf G.K.: The psycho-biology of language: an introduction to dynamic philology, 2nd edn. The M.I.T. Press, Cambridge, MA (1935)
Google Scholar
van Rijsbergen C.J.: Information Retrieval. Butterworths, London (1975)
Google Scholar

Download references

Acknowledgements

We are grateful to our anonymous reviewers for their rightful criticisms of our first draft. We like to thank our contacts at the kb for their support and patience: Paul Doorenbosch, Astrid Verheusen, Tineke Koster en Evelien Ket. Heartfelt thanks to scientific programmer Ko van der Sloot at ILK, whose reimplementation of our basic ideas demonstrated to us their essence. Early ticcl prototypes were developed within a Netherlands Organization for Scientific Research (NWO) Exact Sciences Hefboom project. The production version of ticcl was commissioned by the Koninklijke Bibliotheek - Den Haag. Development continues under the Stevin project SoNaR (STE07014). ticcl was turned into the online processing system ticclops with funding from CLARIN-NL (CLARIN-NL-09-011).

Open Access

This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Author information

Authors and Affiliations

Tilburg Centre for Cognition and Communication, Tilburg University, Kamer D 342, P.O. Box 90153, 5000 LE, Tilburg, Netherlands
Martin W. C. Reynaert

Authors

Martin W. C. Reynaert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin W. C. Reynaert.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License (https://creativecommons.org/licenses/by-nc/2.0), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Reynaert, M.W.C. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. IJDAR 14, 173–187 (2011). https://doi.org/10.1007/s10032-010-0133-5

Download citation

Received: 16 December 2009
Revised: 06 August 2010
Accepted: 12 October 2010
Published: 03 November 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s10032-010-0133-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Abstract

Article PDF

Similar content being viewed by others

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Spanish Diacritic Error Detection and Restoration—A Survey

When to Use OCR Post-correction for Named Entity Recognition?

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Abstract

Article PDF

Similar content being viewed by others

Using the Google Web 1T 5-Gram Corpus for OCR Error Correction

Spanish Diacritic Error Detection and Restoration—A Survey

When to Use OCR Post-correction for Named Entity Recognition?

References

Acknowledgements

Open Access

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation