Part of Speech Recognizer

Only available on StudyMode
  • Topic: Natural language processing, Part-of-speech tagging, Lexical category
  • Pages : 9 (3200 words )
  • Download(s) : 314
  • Published : March 11, 2013
Open Document
Text Preview
Improving Identifier Informativeness using Part of Speech Information Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA {binkley, lawrie}, Keywords: source code analysis tools, natural language processing, program comprehension, identifier analysis

Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which finds application in improving the searching of software repositories and extracting domain information found in identifiers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The presented empirical investigation finds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identifiers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 field names. From patterns in the tags several rules emerge that seek to improve structure-field naming.

Source Part of Extract Split Apply Source ⇒ Code ⇒ Field ⇒ Field ⇒ ⇒ Speech Template Code Mark-up Tagging Names Names

Figure 1. Process for POS tagging of field names. The text available in source-code artifacts, in particular a program’s identifiers, has a very different structure. For example the words of an identifier rarely form a grammatically correct sentence. This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improve retrieval results from software repositories [1, 11] and have also investigated the comprehensibility of source code identifiers [4, 6]. Fortunately, machine learning techniques are robust and, as reported in Section 2, good results are obtained using several sentence forming templates. This initial investigation also suggest rules specific for software that would improve tagging. For example the type of a declared variable can be factored into its tags. As an example application of POS tagging for source code, the tagger is then used to tag over 145,000 structurefield names. Equivalence classes of tags are then examined to produce rules for the automatic identification of poor names (as described in Section 3) and suggest improved names, which is left to future work.

1 Introduction
Software engineering can benefit from leveraging tools and techniques of other disciplines. Traditionally, natural language processing (NLP) tools solve problems by processing the natural language found in documents such as news articles and web pages. One such NLP tool is a partof-speech (POS) tagger. Tagging is, for example, crucial to the Named-Entity Recognition [3], which enables information about a person to be tracked within and across documents. Many POS taggers are built using machine learning based on newswire training data. Conventional wisdom is that these taggers work well on the newswire and similar artifacts; however, their effectiveness degrades as the input moves further away from the highly structured sentences found in traditional newswire articles. 1

2 Part-of-Speech Tagging
Before a POS tagger’s output can be used as input to down stream SE tools, the POS tagger itself needs to be vetted. This section describes an experiment performed to test the accuracy of POS tagging on field names mined from source code. The process used for mining and tagging the fields is first described, followed by the empirical results from the experiment. Figure 1 shows the pipeline used for...
tracking img