Advanced Knowledge Extraction from Webpages Using Natural Language Processing

Only available on StudyMode
  • Download(s) : 326
  • Published : May 21, 2012
Open Document
Text Preview

Advanced Knowledge Extraction from WebPages using Natural Language Processing Dominic George1 , Rivin Jose2 , Raul Pinto3 , Suman Raina4 and Amiya Kumar Tripathy5 kyo_dom@hotmail.com , rivinjose@gmai.com , raulpin101@gmail.com , aktripathy@iitb.ac.in , hi_sumanbhat@yahoo.com . Computer Engineering Department.

DON BOSCO INSTITUTE OF TECHNOLOGY.

Abstract

Information on the World Wide Web has been added at an unprecedented rate. This disorganization has led to limitations in accessing in data. The time required in finding useful data has increased since the amount of web pages of lesser significance has also increased exponentially. Through ‘Advanced Knowledge Extraction from WebPages using Natural Language Processing’ the effective time required to find useful information can be significantly lowered. With ever increasing data on the World Wide Web, AKEWNLP provides the only sustainable option for making optimum use of Data Resources.

Index Terms — Knowledge Extraction, Web mining, Intelligent Search, Knowledge extraction, Natural Language Processing.

1. Introduction

The Internet is composed of billions of pages data accounting to astronomically large amounts of data. A bulk of the data cannot be useful in its raw form. Hence there is a gap between availability and usability of data. The Aim of a Knowledge acquire system is to create a filtration module which differentiates between useful data and unwanted noise.

The currently available module of using the Internet for work is [1] Access a search engine [2] Enter a query [3] The search engine returns a list of probable desired results. This list are obtained based on the page ranking mechanism adopted by the search engine and does not necessarily be in line with the requirements of the user [4] Individually the user has to be redirected to several page and has to manually filter out information that’s beneficial from the rest. [5] This process is both time consuming and inefficient.

We would like to propose an alternative method to analyze data and extract useful information. [1] Enter a query [2] The query is analyzed and whatever appropriate path for returning a result that’s possible is considered. [3] The most optimum method is then adopted. [4] Raw Data is then scanned. [5] Depending on the nature of result required, levels of filtration is applied.[6] Once the result is applied, only a single value is returned to the user [7] Thus the process of screening by the user individually is avoided.

Extraction of information aimed at providing people more effective information acquisition means to cope with a serious challenge which the information explosion brought about. Web information extraction is the process, which extract a specific category of information from the Web page and make it into the structured data, and then write into the database to supply user queries. Therefore it is necessary to design and develop a web page system that can automatically extract information. Text messaging is the most important part of the Web page, so information extracted is that extracted the text pages. From a large number of pages to extract the information they need to Carry out re-organization and utilization.

Using this system we have been able to improve the relative speed and efficiency of the knowledge extraction process, which in turn will make it possible to be used in search engines to give better search results for user searches and other database oriented system where frequent searches are required.

2. Methodology and Design of System
The work of extracting WEB text is a huge project. Taking into account time and technology, it can be simplified according to the following scheme. First, the relevant web page source files, being in line with key words, should be stored in the message digest. Second, files in the message digest should be scanned and matched one by one to extract important information in the users’ interest....
tracking img