|
Broken Links Retrieval: 1 - Presentation
par Lionel
|
Part 1 - Presentation
Abstract: in this article, we present a method for identifying
broken links in a Domino Intranet web site. Our method relies on a Java
program that automatically collects links and check their validity. Let's
start by explaining the need for links verification, the different methods
for checking links and the choice of the tool.
1. The Scenario
Our web site belongs to an Intranet Portal. It is based on a publication
template that allows authorized end-users to publish various pieces of
information: files (Word, Excel, PDF, ...), web pages (html, images, ...),
texts, and links. The publication occurs through forms and despite a basic
fields validation process, users are free to enter the content they want.
The links that we want to verify are not those that are part of the publishing
application, such as a link to a view that shows recent updates. In that
case, a broken link should be considered as a bug. Its resolution is a
well known process and is not in the scope of this article.
We will only focus on links entered by end-users in their publishing
activity.
2. Understanding Broken Links
Usually, broken links correspond to HTTP requests that return a status
code 404 (Not Found). If you want to know more about status code, go to
W3C web site (for example: HTTP
Status Codes).
In this article, we assume that all status codes equal or higher
than 400 are broken links, except the code 401. 401 (Unauthorized)
means that the user needs first to authenticate in order to view the page.
401 codes make it impossible to decide whether the link is valid or not.
They should be reported as warning messages, not as failures.
Including 5xx errors in our specification of broken links might be regarded
as very restrictive. In fact, we extend the definition of a broken link
to everything that result in a failure for the end-user. If a user can
not see a page he is authorized for, then we regard this as a broken link.
3. Overview of our Method
The most practical method for verifying the validity of a link is
to send the HTTP request and analyze the status code returned by the response.
By automating this activity, we can dramatically improve the overall quality
of the Intranet sites content.
The basic description of an automatic process is:
- For each page: collect the links inside the page
- For each link: send the request and store the status code (we work
by exception by only retaining errors)
- After completion, prepare personalized reports that contains the errors
and a link for correction
- Notify authors or editorial managers by e-mail
4. Collecting data
We can collect links in two different ways: first, we can retrieve
data from the Domino environment: we can imagine an agent that opens each
Notes document, looks at specific fields and extracts the URLs to check.
It is just perfect for single value text fields. But it becomes difficult
when the information is located in multi-valued or rich text fields: in
that case, we need an HTML parser, which is not obvious.
The second solution for collecting links consist in opening a page through
an HTTP request and get back an array of links. This is what you get if
you use the native Domino method getDocumentByURL (defined in both
LotusScript and Java). I do not use this method for two reasons: in Java
it does not work in standalone applications; and in LotusScript, it generates
memory problems that cause a server crash on my machines!
A better alternate solution consists in using an HTTP Client to retrieve
the URLs. This term refers to all software applications that can send
HTTP requests and receive responses, such as our browsers for example.
You can develop a Java HTTP Client by using a set of classes and adding
your own specifications.
The only problem with the HTTP Client is the high number of links returned:
most of them are out of our scope because they are part of the application's
features, such as a "Print this document" link, for example.
In this article, we decided to mix the two solutions: data will be collected
both from the NotesDatabase and from the HTTP response (the HTML page).
5. Verifying links
The HTTP Client verifies links by sending the HTTP request and by analyzing
the returned response. It seems simple but many problems can occur. The
risk is to get more errors than really exist simply because the HTTP Client
is not able to interpret correctly the URL.
This can happen in the following situations:
- Relative links (eg /index.htm instead of http://www.my.com/index.htm)
- JavaScript links (eg.: <a href="javascript:
document.location='me.htm';">)
- Redirected pages
A proxy server can also be a problem if the HTTP Client is not able to
go through the proxy. These potential issue will govern the choice for
an HTTP Client.
6. Choosing the best Java HTTP Client
We decided to use Java for developping the client. The reasons are quite
simple: Java is supported by Domino, it offers a good support of network
management (especially with TCP/IP) and last, but not least, a large amount
of free code is available through Internet. Our intention was to minimize
the coding activity.
Many free Java HTTP Clients are already available. J2SE
includes its own methods in the class java.net (HttpUrlConnection).
Unfortunately this class is very basic and it is difficult, though possible,
to cover the issues from chapter 5.
After a few queries, I discovered the following HTTP Clients:
You can also buy a product from Nogoop
software. I made some tests and I retained HttpUnit. My criteria were
the ease of use and the available documentation.
It's time now to enter into technical details. First we have to download,
install and configure HttpUnit. Then we'll make some tests to make sure
our implementation works. Let's
go!
|