Image Understanding and Pattern Recognition Lab German Research Center for Artificial Intelligence Kaiserslautern, Germany
Ambrish Dantrey, B. Tech. III year , E&CE Indian Institute of Technology, Roorkee Roorkee, India Supervisors: Faisal Shafait, Illya Mezhirov
Reviewer: prof. Dr. Thomas Breuel Start Date for Internship: 15th May, 2007 End Date for Internship: 27th July, 2007
Report Date: 27th July, 2007 Preface
This report documents the work done during the summer internship at Image Understanding and Pattern Recognition(IUPR) Lab, Deutsche Forschungszentrum für Künstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr. Thomas Breuel. The report first shall give an overview of the tasks completed during the period of internship with technical details. Then the results obtained shall be discussed and analyzed. Report shall also elaborate on the the future works which can be persuaded as an advancement of the current work. I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt. Ambrish Dantrey
Simply put, I could not have done this work without the lots of help I received cheerfully from whole IUPR. The work culture in IUPR really motivates. Everybody is such a friendly and cheerful companion here that work stress is never comes in way. I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for proving the nice ideas to work upon. Not only did they advised about my project but listening to their discussions in IPeT meeting have evoked a good interest in Image analysis. I am also highly indebted to my supervisors Faisal Shafait and Ilya Mezhirov, who seemed to have solutions to all my problems. Author
The report presents the three tasks completed during summer internship at IUPR which are listed below: 1. Detection of headlines in document images with black runlengths and OCRopus performance evaluation in detecting headlines 2. Reengineering the zoneclassification module 3. Evaluation of different segmentation algorithms performance All these tasks have been completed successfully and results were according to expectations. The detection of headlines achieved a low error rate of 2.85% as against 6.52 of previously used methods. During evaluation of segmentation algorithms XYcut was found to gain a lot by noise cleanup, which is an interesting result as it strengthen the claim of XYcut segmentation algorithm as a suitable method for OCRopus. The reengineering and porting of zoneclassification module to OCRopus makes it possible for OCRopus to have a text/image segmentation if it is required in future. Author
OCRopus : Introduction
Though the field of optical character recognition(OCR) is considered to be widely explored, the development of an efficient system for use in real world situations still remains a challenge for developers. OCRopus is a stateoftheart document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, multilingual capabilities and is being developed at IUPR. This being a very big project, I was assigned the tasks of developing tools for layoutanalysis and evaluation.
Following goals were set as I proceeded in my work: 1. Conversion of groundtruthdata in MARG database from XML format to hOCR microformat. 2. Development of a rulebased headline detection method using the median black runlength of the lines.
3. Development of segmentationclassification module and ...