Need for Customized Soundex based Algorithm on Indian Names for Phonetic Matching

G. Christopher Jaisunder; Israr Ahmed; R. K. Mishra

G. Christopher Jaisunder National Informatics Centre, Department of Electronics and Information Technology ((DeitY), Ministry of Communication and Information Technology, Government of India, New Delhi - 110003, Delhi, India
Israr Ahmed National Informatics Centre, Department of Electronics and Information Technology ((DeitY), Ministry of Communication and Information Technology, Government of India, New Delhi - 110003, Delhi, India;
R. K. Mishra Department of Computer Science, Jamia Millia Islamia, New Delhi - 110025, Delhi, India

Abstract

In any digitization program, the reproduction of the handwritten demographic data is a challenging job particularly for the records
of previous decades. Nowadays, the requirement of the digitization of the individual’s past records becomes very much essential. In
the areas like financial inclusion, border security, driving license, passport issuance, weapon license, banking sectors, health care
and social welfare benefits, the individual’s earlier case history is a mandatory part of the decision making process. Documents are
scanned and stored in a systematic method; each and every scanned document is tagged with a proper key. Documents are retrieved
with the help of assigned key, for the purpose of data entry through the software program/ package. Here comes the difficulty that
the data, particularly the critical personal data like name and father name etc., may not be legible for the reading purpose and the
data entry operators type the characters as per their understanding. The chances of error is of high order in name variations in
terms of duplicate characters, abbreviations, omissions, ignoring space between names and wrong spelling. Now the challenge is
that, result of data retrieval over these key fields may not be proper because of the wrong data entry. We need to explore the opportunities and challenges for defining the effective strategies to execute this job without compromising the quality and quantity
of the matches. In this scenario, we need to have an appropriate string matching algorithm with the phonetic matching. The algorithm is to be defined according to the nature, type and region of the data domain so that the search shall be phonetic based rather
than simple string comparison. In this paper, I have tried to explain the need for customized soundex based algorithm on phonetic
matching over the misspelt, incomplete, repetitive and partial prevalent data.