Welcome to WebGrapher. Below you will find instructions for the installation and use of WebGrapher, as well as a short summary of WebGrapher's license (for full licensing terms, please see the license/ directory within the downloaded packge). This readme file is always available from the WebGrapher Web site.
In order to install and run Gumball Gauntlet you must have Java 1.5 and AT&T's GraphViz installed. Both of these programs can be downloaded from the WebGrapher Web site
. In addition, we recommend the following specifications for the computer on which you run WebGrapher:Below is a short description of the different WebGrapher downloads.
WebGrapher with Complete Source
Everything can be compiled from the root directory via the "compile" shell program or the "compile.bat" program if you're in Windows. The "cleanup" or "cleanup.bat" file will remove all compiled .class files. The remaining shell and batch files will be described below.
WebGrapher, Binaries Only
Download this version if you just want to run the program and don't need the program source.
Both of these versions are zip files. Simply download the zip file and then unzip it in a convenient directory. If you're using the source version, then you need to run the "compile" or "compile.bat" program to compile the files before continuing. Also don't forget that you must have Java 1.5 and AT&T's GraphViz installed.
When you unzipped the zip file, it created a WebGrapher directory. If you're not already in the root of this directory, go there now. You'll notice, in addition to several directories, some runnable files called "run-crawler", "run-display" and "FULL-EXAMPLE". There are in fact two versions of each of these files: one with a ".bat" extension, and one without. The ones without the ".bat" extension are shell files and should be used within a Unix/Linux environment. The batch file versions should be used under Windows. From this point forward, I won't distinguish between the two (I'll only give the name of the file without any extension), but you should use the appropriate file depending upon your OS.
WebGrapher has two main components: a Web crawler component that crawls a Web site looking for links, and that also reports any errors that it finds, and a graphing component. You will always need to run the Web crawler portion of the program; however, the graphing portion is option. Some users may only want to know about errors on their Web site, and for them, using the crawler component is enough. Others may want to see a visual depiction of their Web site, in which case they will need to use both the crawler and the graphing components.Using the Web Crawler
If run the "run-crawler" program, you'll see the crawler's syntax. It should look like the following:

You'll notice that the crawler takes a number of parameters. You can pass these parameters on the command line to run-crawler, or alternatively you can make another shell/batch file that you click on that calls run-crawler. In either case, you need to pass at least two parameters so that the crawler knows what to do:
The URL of the Web site that you want to crawl is required. This must of course be an absolute URL. It can be a simple domain name, such as "http://www.mysite.com", or it can be a particular page in a domain, such as "http://www.mysite.com/drafts/thirdpage.htm". In either case, the crawler will start at the url you give and then "spread out" from there, following links from your initial URL to other pages on your Web site.
The crawler will find links that go to other domain names, but it will not crawl those links; it stays on the initial domain. For example, given "http://www.mysite.com", it may find links to "http://www.mysite.com/page1" and "http://www.mysite.com/page1/movies/movies.htm" and scan those files for more links, but if it finds a link to "http://www.anothersite.com", it will not scan it. In other words, it won't leave your domain.
Also, the crawler respects the "no robots" standard. The "no robots" standard allows a Web site owner to tell robots scanning her Web site that she doesn't want them scanning certain files or directories. The crawler will detect norobots.txt files and respect the restrictions they specify. If you are the owner of a Web site and you want the crawler to scan files/directories forbidden by a "norobots.txt" file, then you'll have to temporarily modify that file in order for the crawler to scan them.
The output filename is required. As the crawler scans links, it adds these links to an output file that is used by the graphing program to display a visual depiction of your Web site. You need to specify a filename even if you don't plan on viewing the visual graph right away. (You may find you want to view it later, in which case, the output will be saved.)
You can enter any filename, but the actual file created will be with a ".dot" extension on the end. You should not add this extension to the filename; the program will do it automatically. Note also that the crawler will not prompt to overwrite a previous file, so keep that in mind. If you enter a filename and output file with that name already exists, it will be overwritten.
The crawler scans links in breadth-first-search rather than depth-first-search order. That means that will add every link on a page to its output before retrieving the next page, and the next page it retrieves will be on the same "branching level". It is smart enough to not get stuck in circular links (A->B->A). But keep this search order in mind as we explore some of the other options that you can pass to the crawler. Of course, only the URL and filename are required.
The -max parameter specifies the maximum number of links you want to find. After the crawler encounters this maximum number, it will halt. For example, if you pass in -max=10, the crawler will find up to 10 links and then stop. It might stop before 10 if there are less than 10 links on the Web site, but it will not exceed 10. It's often a good idea to specify this parameter if you're crawling a really huge site and you don't need to scan every link, just so that the crawler finishes in a reasonable amount of time (and also so that your output file size is not too humongous). If -max is not given, then the default maximum of a million links is used. Also, -max cannot exceed a million links, so at present time the crawler cannot be used to scan more than a million links at a time. (And quite frankly, you probably wouldn't want to.)
The -br parameter tells the crawler the maximum number of broken links it should find before halting. So for example, if -br=5, then the crawler will find up to 5 broken links before halting; it will of course report every broken link to stderr. If unspecified, the crawler will find up to the maximum number of links (one million if unspecified). Note that a broken link is a link where the target is missing. So for example, if the crawler finds a link "./mypage.htm" and there is no such page on your Web site, then it will report it as a broken link.
The -inv parameter works the same way as the -br parameter, except that it tracks the number of invalid links. An invalid link is a link whose syntax is incorrect. For example, if the link includes some invalid character, it will be reported as an invalid link. By definition, invalid links are also broken links, since a link that is formatted improperly could never "work"; but we report them as invalid because that gives you more information--it lets you know something has been entered wrong, rather than it being just a matter of a missing file.
The -q option means "quiet". By default, the crawler will report every link it finds to stdout. If you specify -q, then the crawler will only report processing info, such as finding a broken or invalid link, or alternatively finding no errors at all. But error information (broken and invalid links) will always be reported, regardless of whether or not -q is set.
The -i option forces the crawler to add links in the output file for images. By default, the crawler will not add links to images to the output file (nor will it report them). The reason is that most Web pages have many, many images, and including them all in the output file tends to clutter up the resulting graph. But you can force the inclusion of image links by using this option. Also, at present you need to include the -i option if you want the crawler to check for missing (broken) or invalid image links.
For the exact syntax of the crawler, see the picture above. Under Windows, make sure to put every parameter you pass inside quote marks.
The WebGrapher team would like to thank Jef Poskanzer, whose crawler program has been integrated here, with only modest modifications. Thank you Jef.
Using the Display Program
If you run "run-display", you'll see error message that says "You must supply the name of the input file (something.dot)." In fact, that's the only parameter you need to supply. Recall that the .dot file is the output produced by the crawler, so of course you need to run the crawler first (at some point in the past) before running the display the program.
The format of the filename parameter you pass to "run-display" must either be an absolute path to the file, or a relative path from the perspective of the display program. For an example of using a relative path, see the source for the included "FULL-EXAMPLE" program. The FULL-EXAMPLE program runs a complete example, from crawling the Web and producing an output file, to viewing that output file with the display program.
After passing a valid filespec to the display program, the display program will parse the dot file and produce a visual graph representing the scan of your Web site. You may see a "reformatting layout--please wait" when the program first starts. It shouldn't take long. Then an overview picture of your graph will be displayed. This is the "birds-eye" view; it shows you the general layout of your graph (of your Web site), but not the details of the links. Note that the program will start out as a small window, but you can maximize the window by using the icon at the upper right of the title bar with the arrow pointing NorthEast. Do so now.
Next, take the mouse and move it so it is on top of the links and hold it on top of one for a few seconds. It's best if you've just run the FULL-EXAMPLE program and are now looking at the resulting graph. Move your mouse until it's on top of the "big black rectangle". The black rectangle represents your root node, the initial URL that you entered into the crawler. If you hold the mouse over it, you should see "Node: http://www.yahoo.com" pop up for the root node in the FULL-EXAMPLE graph. These little pop-ups are called tooltips. Next, move your mouse over to some of the circlular nodes at the right and hold it there. You'll see the details of those links as well. Those are the files/targets pointed to by the root node. Now, carefully move and hold your mouse over the edges that connect the different nodes. For instance, there should be an edge coming out of the root node and leading to some other node. If you put the mouse in the right place, you'll see the details of this edge (of this link). Note that the terminal node (target) and the edge may not report the information in exactly the same way. This is because nodes (the root nodes and all others) give the url in absolute form, but the edge gives the link in its original form. If the link (edge) is a relative link, it will appear in blue; otherwise, it's an absolute link and it will appear in black.If there are any broken links, they will appear as red rectanges; and invalid links will appear as orange rectanges. All other nodes are white circles.
Now, very large graphs will contain a lot of nodes and edges, and often you'll want to zoom in on details. You can zoom in one level by right clicking on the "blank" part of the graph and choosing "Zoom In"; likewise, you can zoom out by right clicking and choosing "Zoom Out". To get to the highest zoom level (the greatest level of detail), right click on a blank part of the graph and choose "Reset Zoom". So get back to the birds-eye view, right click and choose "Scale to Fit". If at any time the graph is bigger than the window, there will be scroll bars that you can use to move around the graph. Also, tooltips are active at all times unless you turn them off (via a right-click option).
There are other options available from the right-click menu, but they are not supported at this time and your results may vary. The right-click Print option may or may not work; however, there are four buttons at the left-hand side of the window, one of which is labeled "Print". You can click on this to produce a postscript file that can be printed using any postscript program. Note that the printout of a graph is large, and the resulting printout will include several pages that you have to assemble together to recreate the graph. Each page has a number on the corner (x,y) that indicates its position in the final output; using these coordinates; you can tape all the pages together to create a print version of your graph. You can quit the display program at any time by clicking the "Quit" button on the left or the close icon at the top right of the window.
The display program used is Grappa, a program by John Mocenigo of AT&T, modified only very slightly. The WebGrapher team would like to thank John and AT&T for this awesome program.
This display program is not the only one you can use. The GraphViz software comes with a program called Dotty that you can also use to read your "filename.dot" output file. With Dotty, you can also move nodes around, delete nodes and add nodes. So Dotty is a nice tool for editing; but we have found that the Grappa program integrated into WebGrapher produces a nicer-looking graph. There are also other programs on the Internet that can read dot files; any program that supports the dot-file standard should be able to read the dot file produced by WebGrapher.This is a short summary of WebGrapher's license. For the full license, please refer to the documentation provided in the license/ directory in your download.
WebGrapher v0.1 was created by the team Group 5 in class CIS422, a class offered in the Spring of 2005 by the Computer and Information Science department at the University of Oregon. This class was taught by Dr. Stuart Faulk, who is in no way responsible for the code, but the developers are grateful for his guidance. The members of Group 5 are:
Tim BarkerWebGrapher is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. To view a copy of this license, visit this link or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. The WebGrapher program integrates software by Jef Poskanzer, John Mocenigo and AT&T; as a user of WebGrapher, you agree to the licenses for these software programs as well; see the license/ directory in the download for more details.
We hope that you find this product useful and welcome any comments or suggestions. Please contact Mike here.