May Phyu Htun
Computer University (Mandalay)
The large size and the dynamic nature of the Web highlight the need for continuous support and updating of Web-based information retrieval systems. Crawlers facilitate the process by following the hyperlinks in Web pages to automatically download a partial snapshot of the Web. Traversing the web graph in breadth-first search order is a good crawling. This system is intended to study a crawling infrastructure and basic concepts in Web crawling. Then, web crawler application is implemented by using breadth-first search technique. Breadth-First Crawling checks each link on a page before proceeding to the next page. Thus, it crawls each link on the first page and then crawls each link on the first page’s first’ link, and so on, until each level of link has been exhausted. While Crawling the links of a URL address, the local HTML web pages are saved in a folder as MHTML format: (Single File Web Page).
The Web is a very large collection of pages and search engines serve as the primary discovery mechanism to the content. To be able to provide the search functionality, search engines use crawlers that automatically follow links to web pages and extract. Web crawlers are programs that exploit the graph structure of the Web to move from page to page. In their infancy such programs were also called wanderers, robots, spiders, fish, and worms, words that are quite evocative of Web imagery. Crawler can be viewed as a graph search problem. The Web is seen as a large graph with pages at its nodes and hyperlinks as its edges. Web Crawler moves from node to node by means of the hyperlinks that each node contains and that define the edges of the web graph. Therefore, many algorithms used in graph searching can be frequently observed in web crawling of transformed versions. Traversing the web graph in breadth-first search order is a good crawling strategy, as it tends to discover high-quality pages early on in the crawl. In its simplest form, a crawler starts from a seed page and then uses the external links within it to attend to other pages. The process repeats with the new pages offering more external links to follow, until a sufficient number of pages are identified or some higher level objective is reached. There is a continual need for crawlers to help applications stay current as new pages are added and old ones are deleted or moved. When a web crawler is given a set of starting URL the web crawler downloads the corresponding documents. If save as a Web page need to creates a folder that contains an .htm file and all supporting files, such as images, sound files, cascading style sheets, scripts, and more. Save your presentation as a Web page when you want to edit it with FrontPage or another HTML editor, and then post it to an existing Web site. An HTML document saved in MHTML format, which integrates inline graphics, applets, linked documents, and other supporting items referenced in the documents. The HTML combines all the “support” files in the folder into one big file. Convenient but it may not be supported by all browsers. The rest of this paper is organized as follows. Section (2) describes the correspondingly releated works of the system.Section (3) explains the general architecture of the system. Following is the Section (4) which explains the design of the system. Implementation of the system is discussed in section (5). The result of this system is evaluated in section (6). Finally, conclusion will follow in section (7).
Releated works of the system
Gautam Pant,Padmini Srinivasan and Filippo Menczer presented that the crawler maintains a list of unvisited URLs called the frontier. Each crawling loop involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page to extract the...