Thumbnail
Access Restriction
Open

Author Chaudhuri, Anirban Ray ♦ Singh, Debnath ♦ Nasipuri, Mita ♦ Basu, Dipak Kumar
Source Inflibnet's Institutional Repository
Content type Text
Publisher INFLIBNET Centre
File Format PDF
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science ♦ Library & information sciences
Subject Keyword Indian Scripts ♦ Desktop Publishing ♦ Page Layout Analysis ♦ Optical Character Recognition ♦ Document Reconstruction ♦ Encoding Standard ♦ Indian Language
Abstract The transformation of a scanned paper document into an editable form suitable for further processing such as desktop publishing or archiving in a digital library is a complex process. It requires solutions to several problems – document analysis by acquiring knowledge of document layout by a Page Layout Analyzer (PLA), followed by document recognition, which mainly comprises text recognition by Optical Character Recognition (OCR). Besides these two, another important problem is document reconstruction by transforming content into an electronically editable format by keeping the original layout intact. Core OCR modules exist on different Indian scripts, but no such document reconstruction system is available for Indian scripts. The document reconstruction system reported in this paper is the first of its kind on Indian scripts and it addresses document reconstruction for Bengali document images. The system makes use of the knowledge of both document layout extracted by a PLA in a graphical user interface (GUI) and the results of text recognition steps performed by OCR for transformation of paper documents into Rich Text Format.
ISBN 8190207903
Education Level UG and PG
Learning Resource Type Article