Efficiently Finding Near Duplicate Figures in Archives of Historical Documents
Abstract
The increasing interest in archiving all of humankind’s cultural artifacts has resulted in the digitization of millions of books, and soon a significant fraction of the world’s books will be online. Most of the data in historical manuscripts is text, but there is also a significant fraction devoted to images. This fact has driven much of the recent increase in interest in query-by-content systems for images. While querying/indexing systems can undoubtedly be useful, we believe that the historical manuscript domain is finally ripe for true unsupervised discovery of patterns and regularities. To this end, we introduce an efficient and scalable system that can detect approximately repeated occurrences of shape patterns both within and between historical texts. We show that this ability to find repeated shapes allows automatic annotation of manuscripts, and allows users to trace the evolution of ideas. We demonstrate our ideas on datasets of scientific and cultural manuscripts dating back to the fourteenth century.
Keywords
References
[1] X. Bai, X. Yang, L. Latecki, and W. Liu, “Learning context sensitive shape similarity by graph transduction,” IEEE TPAMI, 2009.
[2] D. H. Ballard, “Generalizing the Hough transform to detect arbitrary shapes,” Pattern Recognition, vol. 13, 1981, pp. 111-22.
http://dx.doi.org/10.1016/0031-3203(81)90009-1
[3] S. S. Bukhari, F. Shafait, and T. M. Breuel, “Improved document image segmentation algorithm using multiresolution morphology,” Document Recognition and Retrieval, 2011, pp. 1-10.
[4] B. J. Burke, Book of Orders of Knighthood and Decorations of Honour of all Nations, London: Hurst and Blackett, 1858.
[5] O. Chum, J. Philbin, M. Isard, and A. Zisserman, “Scalable near identical image and shot detection,” CIVR, 2007, pp. 549-556.
[6] O. Chum, J. Philbin, and A. Zisserman,“Near Duplicate Image Detection: min-Hash and tf-idf Weighting,” BMVC, 2008.
[7] C. Boutell, A Manual of Heraldry, Historical and Popular, Winsor and Newton, 1863.
[8] C. Davenport, British Heraldry, Methuen, London, 1921.
[9] C. Davenport, English heraldic book-stamps, figured and described, London: Archibald Constable. ltd, 1909.
[10] C. R. Dod, and R. P. Dod, Dod’s Peerage, Baronetage and Knightage of Great Britain and Ireland for 1915, London: Simpkin, Marshall, Hamilton, Kent. ltd, 1915.
[11] E. E. Dorling, Leopards of England, and other papers on heraldry, Constable & Company limited, London, 1913.
[12] R. Duda, and P. Hart, “Use of the Hough transform to detect lines and curves in pictures,” Comm. ACM, vol. 15, 1, 1972, pp. 11-15.
http://dx.doi.org/10.1145/361237.361242
[13] A. Fornés, J. Lladós, and G. Sanchez, “Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based Method,” Graphics Recognition, 2007.
[14] B. Gatos, I. Pratikakis, and S. J. Perantonis, “An adaptive binarisation technique for low quality historical documents,” Workshop on Document Analysis Systems, 2004.
http://dx.doi.org/10.1007/978-3-540-28640-0_10
[15] C. Grana, D. Borghesani, and R. Cucchiara. “Automatic segmentation of digitalized historical manuscripts,” Multimedia Tools Applications, vol. 55, 3, 2011, pp. 483- 506.
http://dx.doi.org/10.1007/s11042-010-0561-8
[16] P. V. C. Hough, Method and mean for recognizing complex pattern. USA patent 3069654, 1996.
[17] E. Kavallieratou and E. Stamatatos, “Adaptive binarization of historical document images,” ICPR, 2006, pp.742–745.
[18] Y. Ke, R. Sukthankar, L. Huston, “An efficient parts-based near-duplicate and sub-image retrieval system,” ACM Multimedia, 2004, pp. 869-876.
[19] E. Keogh, L. Wei, X. Xi, M. Vlachos, S. Lee, and P. Protopapas, “Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures,” VLDB Journal, vol. 18, 3, 2009, pp. 611-30.
http://dx.doi.org/10.1007/s00778-008-0111-4
[20] T. Koch-Grünberg, Sudamerikanische Felszeichnungen (South American petroglyphs), Berlin, E. Wasmuth A-G, 1907.
[21] J. Mas, G. Sanchez, and J. Llados, “An Incremental Parser to Recognize Diagram Symbols and Gestures represented by Adjacency Grammars,” Graphics Recognition, 2006, pp. 252-263.
[22] A. Mueen, E. Keogh, and N. Shamlo, “Finding Time Series Motifs in Disk-Resident Data,” ICDM, 2009, pp. 367-376.
[23] A. Pritchard, A history of Infusoria, including Desmidiaceae and Diatomaceae, British and foreign. Ed. IV. 968. London, 1861.
[24] G. Ramponi, F. Stanco, W. D. Russo, S. Pelusi, and P. Mauro, Digital automated restoration of manuscripts and antique printed books, 2005, EVA.
[25] J. V. Richardson Jr., Bookworms: The Most Common Insect Pests of Paper in Archives, Libraries, and Museums.
[26] G. Sanchez, et al., “A platform to extract knowledge from graphic documents. application to an architectural sketch understanding scenario,” DAS, 2004, pp. 389-400.
[27] K. B. Schroeder, et al. “Haplotypic Background of a Private Allele at High Frequency in the Americas,” Molecular Biology and Evolution, 2009, pp. 995-1016.
http://dx.doi.org/10.1093/molbev/msp024
PMid:19221006 PMCid:2734135
[28] W. Smith, A synopsis of the British Diatomaceae: with remarks on their structure, function and distribution, pp. [V]-XXXIII, pp. 1-89, 31 pls. London, 1853.
[29] Smith, G. A. and Turner, W. G. Indian Rock Art of Southern California with Selected Petroglyph Catalog, San Bernardino County, 1975.
[30] T. Rakthanmanon, Q. Zhu, and E. J. Keogh. “Mining Historical Documents for Near-Duplicate Figures,” ICDM, 2011, pp. 557-566.
[31] M. Tompa and J. Buhler, “Finding motifs using random projections,” Computational Molecular Biology, 2001, pp. 67-74.
[32] W. West and G. S. West, A Monograph of the British Desmidiaceae, vols. I–V, Ray Soc, London, 1904.
http://dx.doi.org/10.5962/bhl.title.1619
[33] H. J. Wolfson and I. Rigoutsos, “Geometric Hashing: An Overview,” IEEE Computer Science, vol. 4, 4, 1997.
[34] X. Xi, E. Keogh,L. Wei, and A. Mafra-Neto, “Finding Motifs in a Database of Shapes,” SIAM Conference on Data Mining, 2007.
[35] Q. Zhu, X. Wang, E. Keogh, and S. H. Lee, “Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs,” SIGKDD, 2009.
[36] Mining Historical Archives for Near-Duplicate Figures http://www.youtube.com/watch?v=QYY8A6CwS-A 2011.
[37] Project Website: http://www.cs.ucr.edu/~rakthant/DocMotif
Full Text: PDF


