Design and Analysis of an Effective Corpus for Evaluation of Bengali Text Compression Schemes
Abstract
In this paper, we propose an effective platform for evaluation of Bengali text compression schemes. A novel scheme for construction of Bengali text compression corpus has also been incorporated in this paper. A methodical study on the formulation-approaches of text corpus for data compression and present an effective corpus named Ekushe-Khul for evaluating the Bengali text compression schemes has also been presented in this paper. To design the Bengali text compression corpus, Type to Token Ratio has been considered as the selection criteria with a number of secondary considerations. This paper also presents a mathematical analysis on data compression performance with structural aspects of corpora. A comprehensive analysis on the evolving criteria of text compression corpora with related issues in designing dictionary based compression are extensively incorporated here. The proposed corpus is effective for evaluating compression efficiency of small and middle sized Bengali text files.
Keywords
References
Full Text: PDF


