Software Architecture and Design Illuminated
Huỳnh Mạnh Đức
Nguyễn Phan Thanh
Đỗ Quyết Thắng
Phạm Tiến Mạnh
Trần Trọng Thức
There is a company who would like to create a list of English vocabulary to help people learn it well. The company knows some online source site of English vocabulary. One of the source is http://www.manythings.org/vocabulary/lists/l/. Your crawler is requested to crawl the vocabulary data only. They would like to have an easy to maintain crawler system to meet the need of the requirement change to crawl other information sites. They want to control the system both from a command prompt screen and a graphical user interface. The system must prevent the duplication in collected data and infinite loop while downloading. The company would like to run the system on a Linux/Unix operating system. Defining the data structure is one of the request on designing the crawler system. Report Requirement
Investigate the architecture of a crawler from Wikipedia, describe it into your report. (1 points)
Figure 1: High-level architecture of a standard Web crawler
The basic architecture of a standard Crawler consist of 4 modules - Multi-threaded downloader
Each module has their own role and cooperate smoothly with others. They implement methods composed from following steps: • Acquire URL of processed web document from Queue
• Download web document
• Parse document’s content to extract set of URL links to other resources and update Queue • Store web document for further processing
What we have describe are standard Crawler with high-level architecture. To make it more specific, we need separating each of module into parts with detail functions. We will do this in specific context and requirement.
Besides having a good crawling strategy, Crawler need also have a highly optimized architecture. A basic...
Please join StudyMode to read the full document