Domino Références : Site francophone de ressources pour les développeurs Lotus Notes / Domino Domino Références
Site francophone de ressources pour les développeurs Notes/Domino
Annuaire Articles Forums Contact
Recherche
 
Mot exact résultats
Au Sommaire ...
Annuaire commenté des sites traitant de Lotus Notes/Domino
Articles d'actualité
Les forums de discussion
Une remarque, une critique, un encouragement. N'hésitez pas à me contacter.
The Team that meets in Forum.
Newsletter
email:   
s'abonner se désabonner 
Statistiques
 Stats du site

 

Broken Links Retrieval: 1 - Presentation
par Lionel

Part 1 - Presentation


Abstract: in this article, we present a method for identifying broken links in a Domino Intranet web site. Our method relies on a Java program that automatically collects links and check their validity. Let's start by explaining the need for links verification, the different methods for checking links and the choice of the tool.

1. The Scenario
Our web site belongs to an Intranet Portal. It is based on a publication template that allows authorized end-users to publish various pieces of information: files (Word, Excel, PDF, ...), web pages (html, images, ...), texts, and links. The publication occurs through forms and despite a basic fields validation process, users are free to enter the content they want.

The links that we want to verify are not those that are part of the publishing application, such as a link to a view that shows recent updates. In that case, a broken link should be considered as a bug. Its resolution is a well known process and is not in the scope of this article.

We will only focus on links entered by end-users in their publishing activity.

 

2. Understanding Broken Links
Usually, broken links correspond to HTTP requests that return a status code 404 (Not Found). If you want to know more about status code, go to W3C web site (for example: HTTP Status Codes).

In this article, we assume that all status codes equal or higher than 400 are broken links, except the code 401. 401 (Unauthorized) means that the user needs first to authenticate in order to view the page. 401 codes make it impossible to decide whether the link is valid or not. They should be reported as warning messages, not as failures.

Including 5xx errors in our specification of broken links might be regarded as very restrictive. In fact, we extend the definition of a broken link to everything that result in a failure for the end-user. If a user can not see a page he is authorized for, then we regard this as a broken link.

 

3. Overview of our Method
The most practical method for verifying the validity of a link is to send the HTTP request and analyze the status code returned by the response. By automating this activity, we can dramatically improve the overall quality of the Intranet sites content.

The basic description of an automatic process is:

  • For each page: collect the links inside the page
  • For each link: send the request and store the status code (we work by exception by only retaining errors)
  • After completion, prepare personalized reports that contains the errors and a link for correction
  • Notify authors or editorial managers by e-mail

 

4. Collecting data
We can collect links in two different ways: first, we can retrieve data from the Domino environment: we can imagine an agent that opens each Notes document, looks at specific fields and extracts the URLs to check. It is just perfect for single value text fields. But it becomes difficult when the information is located in multi-valued or rich text fields: in that case, we need an HTML parser, which is not obvious.

The second solution for collecting links consist in opening a page through an HTTP request and get back an array of links. This is what you get if you use the native Domino method getDocumentByURL (defined in both LotusScript and Java). I do not use this method for two reasons: in Java it does not work in standalone applications; and in LotusScript, it generates memory problems that cause a server crash on my machines!

A better alternate solution consists in using an HTTP Client to retrieve the URLs. This term refers to all software applications that can send HTTP requests and receive responses, such as our browsers for example. You can develop a Java HTTP Client by using a set of classes and adding your own specifications.

The only problem with the HTTP Client is the high number of links returned: most of them are out of our scope because they are part of the application's features, such as a "Print this document" link, for example.

In this article, we decided to mix the two solutions: data will be collected both from the NotesDatabase and from the HTTP response (the HTML page).

 

5. Verifying links
The HTTP Client verifies links by sending the HTTP request and by analyzing the returned response. It seems simple but many problems can occur. The risk is to get more errors than really exist simply because the HTTP Client is not able to interpret correctly the URL.

This can happen in the following situations:

  • Relative links (eg /index.htm instead of http://www.my.com/index.htm)
  • JavaScript links (eg.: <a href="javascript: document.location='me.htm';">)
  • Redirected pages

A proxy server can also be a problem if the HTTP Client is not able to go through the proxy. These potential issue will govern the choice for an HTTP Client.

 

6. Choosing the best Java HTTP Client
We decided to use Java for developping the client. The reasons are quite simple: Java is supported by Domino, it offers a good support of network management (especially with TCP/IP) and last, but not least, a large amount of free code is available through Internet. Our intention was to minimize the coding activity.

Many free Java HTTP Clients are already available. J2SE includes its own methods in the class java.net (HttpUrlConnection). Unfortunately this class is very basic and it is difficult, though possible, to cover the issues from chapter 5.

After a few queries, I discovered the following HTTP Clients:

You can also buy a product from Nogoop software. I made some tests and I retained HttpUnit. My criteria were the ease of use and the available documentation.

 


It's time now to enter into technical details. First we have to download, install and configure HttpUnit. Then we'll make some tests to make sure our implementation works. Let's go!

Mise à jour: 18/09/2003
Conception: Lionel, 2001-2002