This report illustrates the architecture and design of a web spider program akin to Google. In order to implement the illustrated architecture the HTTP Client interface is simulated. The scope of implementation of web spider program is limited to demonstrating the functionality that can extract embedded URIs in a web resource HTML source code and build a connectivity matrix of inlinks, and outlinks to/from a URI. The eigensystem analysis of the connectivity matrix for the web spider program test sample is conducted to find eigenvalue and the corresponding eigenvector. The eigenvector provides the page rank of the web pages crawled by the web spider program.
We will write a custom Coursework on The Web Spider Program Akin to Google Search specifically for you
807 certified writers online
The objective of this project is to develop a web spider program akin to Google search. The Web Spider parses the HTML markup in the web page source code and creates a queue of all @href links in the file. The spider visits the page URLs inserted in the queue sequentially. A connectivity matrix is created for the inlinks and outlinks of all web pages visited by the spider. Eigensystem analysis of the connectivity matrix is performed to determine the page ranks of web page URLs in the Search Engine Result Page.
The chapter 2 of the report describes the software architecture of the system components identified to build a web spider program. The three system components identified are HTTP Client, Web Spider and Web Resource Database. HTTP Client is simulated and a C++ implementation of the other two components is developed.
The chapter 3 of the report presents the detailed design of simulated HTTP Client and Server interfaces, Web Spider program flow and Web Resource Database elements.
The chapter 4 of the report describes the implementation of important design and analysis components such as: how HTTP interface is simulated, how the connectivity matrix is build, an eigensystem is defined and page rank is discussed.
The chapter 5 presents the actual result of eigensystem analysis for the test sample and also illustrates a snapshot of the connectivity matrix on Microsoft Windows XP Professional console window.
A web spider encompasses the following tasks:
- Create an object instance for every unique web resource
- Download the web resource HTML source code file
- Parse the web resource file to find new outlinks
- Create a connectivity matrix for all unique URIs found by the web spider.
Refer Figure 1, the URI at the arrow head end is outlink of the web resource at the tail end. The URI at the arrow tail end is the inlink of the web resource at the arrow head end. Outlinks are anchor links embedded in the source content and programmed with HTML @href attribute. Outlinks connect a web resource to another web resource. Inlinks of a web resource are URI of web resources that embed the URI of the former web resource as outlink.
The tasks described above are performed by two separate threads that share the web resource database.
HTTP Client – This program fetches a web resource from the Web Resource Database and downloads the file corresponding to this web resource from the internet. The thread downloads a web resource from the internet only if the time since last download is greater than or equal to the refresh interval.
Web Spider – This program parses the source code file associated with a web resource to find outlinks embedded in the file. This program builds an inlink and outlink connectivity matrix for a web resource. The program starts with the creation of first web resource instance for the input URI (Web Crawler, 2008).
Web Resource Database – A web resource is an entity with a unique Uniform Resource Identifier (URI). There is an instance of a web resource object for each unique URI in the Web Resource Database. The Web Resource Database is a singly linked list of web resource objects. The Web Resource Database is shared between two processes: HTTP Client and Web Spider. Both the processes perform read and write operations in to the database and therefore mutual exclusion is required to protect the critical section of the code, i.e. Web Resource Database. The mutual exclusion between the two threads shall be provided with semaphores.
The high level functionality of the two programs is illustrated in the figure below.
The HTTP interface for downloading the file from the web server is not implemented for the project. Therefore the following tasks are simulated (Fielding, 2007):
Get your first paper with 15% OFF
The mapping of domain name to web site root directory is located on the web site host web server. The host web server maps the URI in HTTP Request GET message to a file path in the web site root directory. This act is simulated by creating a table Tbl Domain Dir of domain name to root directory mappings.
A Search Engine shall send HTTP Request GET message to retrieve a web resource from the web server. The content retrieval with HTTP Request from the web server is simulated by mapping URI to file path on a local server, this mapping is based on the data present in table Tbl Domain Dir.
The Search Engine processes received HTTP GET Response message to store the received web content in a file on a local server of the Search Engine. This act is simulated by storing the file name in the web resource object.
A web spider program is written in ‘C++’ programming language. Refer figure 2 for the flow chart of web spider thread. Web spider program is a single thread Win32 console application.
Accept the start URL to the Web spider program.
- Enter domain name : www.example.com
- Enter directory path : c:/temp/example
- Enter start URI : http://www.example.com
Display the connectivity matrix of the input web site.
- >URL :
- >File :
- >IN URL :
- >OUT URL :
The internal errors encountered by the Web spider program in crawling the web site or parsing the HTML source code of a web page are displayed on the console window.
The web spider program sequentially retrieves a web resource object from the WRDB and processes the HTML source file of the web resource. After processing all the web resource records in the global web resource list, the web spider program starts over again from the first web resource listNode.
Since the HTML source file corresponding to a web resource URI may be updated on the World Wide Web, therefore this file is intermittently refreshed by the HTTP Client. In order to process the updated HTML source file the web spider thread must re-visit the listNode corresponding to the web resource and hence it iterates over the global web resource list.
Web Resource Database
The Web Resource Database (WRDB) is a singly linked list of C++ objects that are instances of class listNode. This global web resource list has a listNode for every unique web resource found by the web spider program. When a HTML file corresponding to a URI is parsed to find embedded @href values, the former URI is called referringURI and the @href URI is called thisURI. The structure of the listNode is:
The structure of a linkNode on web resource inlinks and outlinks list is:
Add new web resource – When a web resource URI is found in the source code of a web resource URL, the uniqueness of thisURI is determined by comparing thisURI with the URI of all web resources present in the WRDB. If the URI comparison fails then thisURI is unique. A new web resource listNode instance for thisURI is created and added to the global web resource list (gWRList).
Since HTTP Client thread is not implemented therefore URI to file name binding is stored in the web resource when the web resource is created. HTML files corresponding to URIs are present in the test example directory.
Add new inlink –The referringURI is an inlink to thisURI. Since thisURI may be embedded as @href value in many web pages therefore thisURI can have more than one inlink. For any URI it must be possible to get all information about the web resources that provide inlinks to thisURI. Therefore when an embedded @href URI is found an instance of linkNode is created for the referringURI and is attached to the inlink list of thisURI. The linkNode contains the pointer to listNode of referringURI and hence all information about referringURI web resource is attainable from the inlinks list of thisURI.
Example: In figure 1, an arrow from blog.html to index.html suggests that, blog.html is a referringURI and index.html is thisURI, i.e. blog.html is an inlink to index.html.
Add new outlink – The thisURI is an outlink from referringURI. Since more than one @href may be embedded in a web page therefore referringURI can have more than one outlinks. For any URI it must be possible to get all information about the web resources that are outlinks. Therefore when an embedded @href URI is found an instance of linkNode is created for the thisURI and is attached to the outlink list of referringURI. The linkNode contains the pointer to listNode of thisURI and hence all information about thisURI web resource is attainable from the outlinks list of referringURI.
The HTTP interface simulation is executed in the following manner:
A web site with domain name www.example.com is created and stored in the directory C:tempexample.
The domain name binding to the web site root directory is done in the main function of the web spider program. Both domain name and web site root directory path are provided as user input.
The connectivity matrix of web pages visited by a Search Engine spider is the matrix where each cell represents a connection between the two nodes that identify the row and column. The cell value is either zero or non-zero and this value is determined by the absence or presence of a hyper link between the row and column nodes of the matrix. The value of the cell is non-zero if the two nodes are connected and the value is zero if the two nodes are not connected (Rodrigue, 2008).
If web page I has li ≥ 1 links to other webpages and webpage i links to webpage j, then the element in row i and column j of H is Hij =. Otherwise Hij = 0 (Wills, 2006).
The inlinks and outlinks of the sample web pages as found by the spider are listed in the table below.
Node Name IRI File Name OUT Links IN Links
- http://www.example.com index.
- html http://www.example.com/m1.html
- http://www.example.com/blog.html N4
- http://www.example.com/m1.html m1.html
- http://www.example.com/blog.html N1, N3,
- http://www.example.com/services.html services.html
- http://www.example.com/m1.html N1, N2
- http://www.example.com/blog.html blog.html
- http://www.example.com N1, N2
The connectivity matrix of the directed graph in figure 1 is a 4 x 4 grid.
In order to find an eigenvector for this connectivity matrix an eigensystem is defined as follows:
The connectivity matrix is calculated as:
Where, α is a damping factor, β is a column vector of all ones and v is the personalization vector. The personalization vector is the probability according to which the random web user shall select the nodes in the row. The damping factor suggests the random web user behavior in selection of a web page. If α web users select web pages according to the probability given in the connectivity matrix then (1-α) select a different strategy.
The Power Method of vector iteration is applied to determine the eigenvalue and eigenvector (Riordan, 2008; Garcia, 2006).
If H is an operator that acts upon vector x, then H(x) = k * x. Then we know that x has been preserved by H apart from a scalar multiplier k. It is k times bigger than what it was before H acted upon it. Therefore we call x an eigenvector of H with eigenvalue k (Riordan, 2008).
An initial eigenvector X is chosen for calculation of eigenvalue that must satisfy the following equation:
The convergence criteria selected for Power Method is λ=1.
Page rank determination
The page rank of a web page is the indication of importance of a web page as compared to other web pages found for the user search query. The web pages are listed on Search Engine Result Page in the descending order of the page rank value, i.e. the page with highest page rank value is listed first.
The page rank of a web page is given by the value of the corresponding element in the eigenvector. If the two web pages have same page rank some other factor such as content newness i.e. the publication date may be selected to determine the order of web page URIs in the Search Engine Result Page.
Results & Analysis
Following data is used for eigensystem analysis:
Damping factor α = 0.85
Personalization vector v =
Initial Eigenvector X =
This personalization vector has been chosen with an assumption that most random web users shall find the web site home page first and may not necessarily traverse other links on the home page.
After three (3) iterations of the Power Method it is determined that eigenvalue 0.9375 is the highest value. The eigenvector at this value is. The page rank of web pages in the www.example.com is determined as (1, 3, 3, 2) in the order N1, N2, N3 and N4 respectively. Based on this data it is also concluded that the web site blog is the second most popular web page on the web site after the web site home page.
The web spider program builds the same connectivity matrix if either of the four test sample HTML files URI is given as start URI input. Following is the snapshot of the console window on Microsoft Windows XP Professional system:
The implementation of web spider program in programming language C++ has been a learning exercise of HyperText Transfer Protocol (HTTP), functionality of a web spider and the eigensystem to determine the page rank of web pages found by the web spider. In order to present the working implementation of the web spider architecture drawn in this report the functionality of the web spider was limited to the analysis of HTML element with @href attribute. In order to create a complete connectivity matrix of web resources available on World Wide Web the HTML elements such as , etc. that include attributes such as @src, @cite and @href must also be considered.
To do so when HTML source code file is parsed these elements that include an attribute with value type URI must not be skipped and the URIs must be retrieved to form complete connectivity matrix. Also this implementation of web spider program considers only “http” URIs and the scope of web spider can be expanded to consider “https” and other session protocols.
The eigensystem akin to Google is defined in order to calculate the page rank of unique URIs found by the web spider. The analysis of the connectivity matrix brings forth interesting data about user behavior. When a personalization vector is chosen with an assumption that only web site home page is visited by random web users, the Power Method is applied to the connectivity matrix in order to determine the eigenvalue and eigenvector. It is found that blog has higher rank as compared to the other two web pages i.e. services.html and m1.html. It can be concluded based on damping factor application to the connectivity matrix that the random behavior of web user affects the page rank of blog.html. The possible reason for this affect on page rank could be syndication service such as Atom/RSS feed available for the blog.
A Guide to Harvard Referencing. (2005). Leeds Metropolitan University. Web.
Centrality. (2008). Wikipedia. Web.
Fielding, R. & et. al. (2007). Hypertext Transfer Protocol – HTTP/1.1. W3C. Web.
Garcia, E. Dr. (2006). Matrix Tutorial 3: Eigenvalues and Eigenvectors. Mi Islita. Web.
Raggett, Dave., Hors, Le, Arnaud. & Jacobs, Ian. (1999). HTML 4.01 Specification. W3C. Web.
Riordan, Sally. (2008). What are Eigen Values?. Physlink. Web.
Rodrigue, Jean-Paul. Dr. (2008). Connectivity Matrix. Dept. of Economics & Geography, Hofstra University. Web.
Web crawler. (2008). Wikipedia. Web.
Wills, S. Rebecca. (2006). Google’s PageRank: The Math Behind the Search Engine. North Carolina State University. Web.