Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which ﬁnds application in improving the searching of software repositories and extracting domain information found in identiﬁers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The presented empirical investigation ﬁnds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identiﬁers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 ﬁeld names. From patterns in the tags several rules emerge that seek to improve structure-ﬁeld naming.
Source Part of Extract Split Apply Source ⇒ Code ⇒ Field ⇒ Field ⇒ ⇒ Speech Template Code Mark-up Tagging Names Names
Figure 1. Process for POS tagging of ﬁeld names. The text available in source-code artifacts, in particular a program’s identiﬁers, has a very different structure. For example the words of an identiﬁer rarely form a grammatically correct sentence. This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improve retrieval results from software repositories [1, 11] and have also investigated the comprehensibility of source code identiﬁers [4, 6]. Fortunately, machine learning techniques are robust and, as reported in Section 2, good results are obtained using several sentence forming templates. This initial investigation also suggest rules speciﬁc for software that would improve tagging. For example the type of a declared variable can be factored into its tags. As an example application of POS tagging for source code, the tagger is then used to tag over 145,000 structureﬁeld names. Equivalence classes of tags are then examined to produce rules for the automatic identiﬁcation of poor names (as described in Section 3) and suggest improved names, which is left to future work.
Software engineering can beneﬁt from leveraging tools and techniques of other disciplines. Traditionally, natural language processing (NLP) tools solve problems by processing the natural language found in documents such as news articles and web pages. One such NLP tool is a partof-speech (POS) tagger. Tagging is, for example, crucial to the Named-Entity Recognition , which enables information about a person to be tracked within and across documents. Many POS taggers are built using machine learning based on newswire training data. Conventional wisdom is that these taggers work well on the newswire and similar artifacts; however, their effectiveness degrades as the input moves further away from the highly structured sentences found in traditional newswire articles. 1
2 Part-of-Speech Tagging
Before a POS tagger’s output can be used as input to down stream SE tools, the POS tagger itself needs to be vetted. This section describes an experiment performed to test the accuracy of POS tagging on ﬁeld names mined from source code. The process used for mining and tagging the ﬁelds is ﬁrst described, followed by the empirical results from the experiment. Figure 1 shows the pipeline used for...