Part of Speech Recognizer

Improving Identiﬁer Informativeness using Part of Speech Information
Dave Binkley Matthew Hearn Dawn Lawrie Loyola University Maryland Baltimore MD 21210-2699, USA {binkley, lawrie}@cs.loyola.edu, mthearn@loyola.edu
Keywords: source code analysis tools, natural language processing, program comprehension, identiﬁer analysis

Abstract
Recent software development tools have exploited the mining of natural language information found within software and its supporting documentation. To make the most of this information, researchers have drawn upon the work of the natural language processing community for tools and techniques. One such tool provides part-of-speech information, which ﬁnds application in improving the searching of software repositories and extracting domain information found in identiﬁers. Unfortunately, the natural language found is software differs from that found in standard prose. This difference potentially limits the effectiveness of off-the-shelf tools. The presented empirical investigation ﬁnds that this limitation can be partially overcome, resulting in a tagger that is up to 88% accurate when applied to source code identiﬁers. The investigation then uses the improved part-of-speech information to tag a large corpus of over 145,000 ﬁeld names. From patterns in the tags several rules emerge that seek to improve structure-ﬁeld naming.

Source Part of Extract Split Apply Source ⇒ Code ⇒ Field ⇒ Field ⇒ ⇒ Speech Template Code Mark-up Tagging Names Names

Figure 1. Process for POS tagging of ﬁeld names. The text available in source-code artifacts, in particular a program’s identiﬁers, has a very different structure. For example the words of an identiﬁer rarely form a grammatically correct sentence. This raises an interesting question: can an existing POS tagger be made to work well on the natural language found in source code? Better POS information would aid existing techniques that have used limited POS information to successfully improve retrieval

References: [1] S. L. Abebe and P. Tonella. Natural language parsing of program element names for concept extraction. In 18th IEEE International Conference on Program Comprehension. IEEE, 2010. [2] K. Atkinson. Spell checking oriented word lists (scowl). [3] E. Boschee, R. Weischedel, and A. Zamanian. Automatic information extraction. In Proceedings of the International Conference on Intelligence Analysis, 2005. [4] B. Caprile and P. Tonella. Restructuring program identiﬁer names. In ICSM, 2000. [5] ML Collard, HH Kagdi, and JI Maletic. An XML-based lightweight C++ fact extractor. Program Comprehension, 2003. 11th IEEE International Workshop on, pages 134–143, 2003. [6] E. Høst and B. Østvold. The programmer’s lexicon, volume i: The verbs. In International Working Conference on Source Code Analysis and Manipulation, Beijing, China, September 2008. [7] E. W. Høst and B. M. Østvold. Debugging method names. In ECOOP 09. Springer Berlin / Heidelberg, 2009. [8] J. Jiang and C. Zhai. Instance weighting for domain adaptation in nlp. In ACL 2007, 2007. [9] D. Lawrie, D. Binkley, and C. Morrell. Normalizing source code vocabulary. In Proceedings of the 17th Working Conference on Reverse Engineering, 2010. [10] L. Shen, G. Satta, and A. K. Joshi. Guided learning for bidirectional sequence classiﬁcation. In ACL 07. ACL, June 2007. [11] D. Shepherd, Z. P. Fry, E. Hill, L. Pollock, and K. Vijay-Shanker. Using natural language program analysis to locate and understand action-oriented conerns. In AOSD 07. ACM, March 2007. [12] K. Toutanova, D. Klein, C. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In HLTNAACL 2003, 2003. 4 Related Work This section brieﬂy reviews three projects that use POS information. Each uses an off-the-shelf POS tagger or lookup table. First, Host et al. study naming of Java methods using a lookup table to assign POS tags [7]. Their aim is to ﬁnd what they call “naming bugs” by checking to see if the method’s implementation is properly indicated with the name of the method. Second, Abebe and Tonella study class, method, and attribute names using a POS tagger based on a modiﬁcation of minipar to formulate domain concepts [1]. Nouns in the identiﬁers are examined to form ontological relations between concepts. Based on a case study, their approach improved concept searching. Finally, Shepherd et al. considered ﬁnding concepts in code using natural language information [11]. The resulting Find-Concept tool locates action-oriented concerns more effectively than the other tools and with less user effort. This is made possible by POS information applied to source code. 4

Part of Speech Recognizer

You May Also Find These Documents Helpful

Service Request Sr Rm 004

Service Request Sr Rm 004

Pt1420 Unit 5 Language Constructs

Pt1420 Unit 5 Language Constructs

Nt1330 Unit 1 Assignment

Nt1330 Unit 1 Assignment

Pt1420 Unit 2 Research Assignment

Pt1420 Unit 2 Research Assignment

Unit 3 Assignment 1 - Choosing Port Scanning Software

Unit 3 Assignment 1 - Choosing Port Scanning Software

Unit 4 Visual Programming Level P2

Unit 4 Visual Programming Level P2

Gary Paulsen

Gary Paulsen

Automatic Sentence Generator

Automatic Sentence Generator

CAPSTONE PROJECT

CAPSTONE PROJECT

Skinner & Piaget

Skinner & Piaget

Design and Development of a Simple Program Compiler Pl/0

Design and Development of a Simple Program Compiler Pl/0

Science

Science

Java Editor

Java Editor

Code-Mixing Among University Students in Hong Kong Within the School’s Context

Code-Mixing Among University Students in Hong Kong Within the School’s Context

Google Books

Google Books

Related Topics