Control in the WWW Hyperspace

Massimo Marchiori
The World Wide Web Consortium (W3C)
MIT
Laboratory for Computer Science
545 Technology Square, NE43-350
Cambridge, MA 02139, USA
massimo@w3.org

Position:

Pirelli ad
The power of the World Wide Web lies in its distributed character.
The weakness of the World Wide Web lies in its distributed character.
In this position paper we focus on how to solve this second issue, that is to say, we
provide an attempt to cope with the apparently unsolvable problem of the lack of
general control in the World Wide Web, focussing on the hyper structure problem.

Control

The World Wide Web has revolutionized the way people can access information. In a sense, the web is a collection of a multitude of spatially distributed databases. On the one hand, its superior flexibility relies in the opportunity to overcome the spatial barriers, and to freely jump from one site to another via hyper links, with a simple mouse click. On the other hand, the power of the World Wide Web relies on its size and variety, since it collects an enormous amount of different sites, as reported by every recent estimation (cf. [4]). However, this great power also raises what is the major problem of the web: navigating in hyper space is becoming more and more difficult, since the web is nowadays very poorly connected (see for instance [1]). The problem is in appearance unattackable: the power of the web relies in its distributed character, and so there is no possible global control on it. In this position paper we propose a solution to this issue, and test its effectiveness via small-scale simulations called web arenas. It is true that there is no global control on users maintaining sites in the web, but there is nevertheless a way to incite such users to improve the web navigational structure: the idea is to provide suitable ``bonuses'' to the maintainers of web pages so to foster navigation cooperation among sites. But who is going to provide such ``bonuses'', and what form can they have? As market studies clearly indicate, in order to survive into the WWW informative jungle, web users have to almost exclusively resort on search engines (automatic catalogs of the web) and repositories (human collections of links usually topics-based). In turn, repositories are now resorting themselves on search engines to keep their databases up-to-date. Thus, the crucial component in the information management chain is given by search engines. Therefore, the idea is that these bonuses should be provided by search engines' score: if the web structure is in some sense improved by a web page, such page will get a higher rank. Indeed, search engines have become so important in the advertisement market that it has become essential for companies to have their pages listed in top positions of search engines, in order to get a significant web-based promotion. Starting with the already pioneering work of Rhodes ([5]), this phenomenon is now boosting at such a rate to have provoked serious problems to search engines (see e.g. [4]), and has revolutionized the web design companies, which are now specifically asked not only to design good web sites, but also to make them rank high in search engines. A vast number of new companies was born just to make customer web pages as visible as possible. More and more companies, like Exploit, Allwilk, Northern Webs, Ryley & Associates, PlanetOcean, SignPost, Did-It, Mentor Marketing, etc., explicitly study ways to rank high a page in search engines. OpenText arrived to sell ``preferred listings'', i.e. assuring a particular entry to stay in the top ten for some time (for a discussion on the effects of such a policy, see for instance [7]). We have studied the effects of search engines bonuses on the navigational structure of the web. We will report on extensive testings with different bonuses, measuring their effect on the navigation in the WWW,
including a novel cooperation bonus that provides an amazingly good solution. These tests shed new light on the bonus approach, and show how the cooperation bonus ranks by far as the best method. The tests also include the important visibility bonus, which is currently implemented by many search engines: it is shown how the effects of this bonus on the global navigability of the World Wide Web are deleterious, and so its usage should be avoided by search engines.
 

Navigation in the WWW

In order to be really useful, highways in the World Wide Web must not be built by chance. That is to say, there should be a rationale when building a piece of the information highway: if a user is looking at a web object because he is interested in some particular information, he should (also) be offered with links pointing to objects offering related information. Thus, given two web objects A and B (belonging to two different sites), we say that there is navigation cooperation from a web object A to another web object B if a user navigating to A seeking for a specific information can proceed with his search by navigating to B. The prime ingredient for navigation cooperation is therefore the hyper link from A to B (however, the presence of such a link does not automatically imply navigation cooperation, as we will see later). Note that the notion of (navigation) cooperation is not necessarily symmetric: one object can cooperate with another, but the converse may not hold.

A cooperative link is potentially improving the web connectivity, and the usefulness of navigation for users. However, in order to quantify how fruitful the usage of cooperative links can be, we need a way to measure the navigability of a web structure. First, we need what is called a categorization (also called classification) of the web objects. A categorization classifies each web objects into a certain category. Thus, for example, we could have as categories of interest Computers and Music, with the intended meaning of indicating those web objects dealing with computers and music, respectively. Then, a categorization would be a set of web objects classified in the category Computers (those web objects that we classify as pertaining to computers), and a set of web objects classified in the category Music (those web objects that we classify as pertaining to music). Once we have a categorization of the web objects, it is clear how to intuitively measure the navigability of a web structure: a user looking for information in a particular category must be easily able to navigate through all the objects belonging to that category when starting from one of these objects. This means such objects must be tightly connected by hyper links, and that they shouldn't be too much connected to objects belonging to other categories, otherwise one can get lost while navigating. These informal provisos can be expressed as follows. Let S be a subset of a web structure W: then, the cohesion of S can be measured as the difference between the percentage of ``intra'' connectivity (to what extent the elements of the subset are connected to each other), and the percentage of ``inter'' connectivity (to what extent the subset is connected with the rest of the web). So, now we have all the tools to measure how fruitful the navigation in a web structure can be: the navigability of a web structure W w.r.t. a categorization Cat is the average cohesion of each category of Cat.
 

Impulse

As said, we have measured the impact of suitable bonus on the global navigational structure of the WWW.
In order to do that, we have employed small-scale simulations of the World Wide Web, so-called Web arenas.
Web arenas are in a sense a small closed-world where users, the "players", can play in while creating new web
objects and modifying them. The only limitation is that the hyper structure must stay within the arena, that is to say,
users are only entitled to hyper-reference objects internal to the Web arena. We assigned to each member of the population some specific categories drawn from the Excite ontology. The important thing was that only the name of the category was given, but no mention at all was made about the ontology itself. This means that not only each player was unaware of the existence of the ontology, but also that only the particular category name was provided, and not the whole classification. In order to play a web arena game, there is also the need for a goal. We played the game several times: each time, a specific search engine for the web arena was used. The goal for each player was to rank high in the search engine, for each of the categories he was provided with, just like if he was the responsible of the site in the ``real'' WWW, and wanted to have its site noticed. Note that, just as in the WWW case, a maintainer of a site devoted to a certain argument does not have a ``sure'' way to establish that his site will have a high rank for all the users using a search engine and looking for information related to his site (in our terminology, pertinent to that category). Thus, what one can do is only to try to rank high for many keywords that are in all likelihood representative or related to the given category.

We studied the behavior of four kinds of bonuses.

The obtained results were rather striking and instructive.
In the first case, the situation resembled the current situation present in the World Wide Web: the outcome of the simulation was a web arena with a very poor connectivity, and (what's worst), with a very low navigability too.
In the second case,  connectivity stayed very poor, and navigability was worst than the ``no-bonus'' case.
In the third case, connectivity increased a lot, and the navigability improved by much.
In the fourth case, the progress made with the cooperation bonus got lost: the connectivity increased by far, but this time there was no rationale of providing ``navigationally useful'' links. Even, the trend was to link sites not belonging to the same market niche, thus further ruining navigability.

In the presentation, we will present in detail the cooperation bonus solution and the simulation results, provide insights on the practical meaning of such results, and discuss the impact of this solution for the search engine developers side and for the web advertisement market.

References

1
TIM BRAY.

Measuring the Web.
Fifth International World Wide Web Conference, May, Paris, 1996.
2
MASSIMO MARCHIORI.

The Quest for Correct Information on the Web: Hyper Search Engines.
Sixth International World Wide Web Conference, April, Santa Clara, California, 1997.
3
MASSIMO MARCHIORI.

Security of World Wide Web Search Engines.
Reliability, Quality and Safety of Software-Intensive Systems. Chapman & Hall, 1997.
4
NIELSEN MEDIA RESEARCH.

Web Audience Measurement: Issues, Challenges and Solutions.
IPQC Conference on Performance Measurement for Web Sites, San Francisco, 1996.
5
JIM RHODES.

How to Promote Your Business Web Pages.
6
DANNY SULLIVAN.

Webmaster's Guide to Search Engines.
Calafia Consulting, 1997.
7
NICK WINGFIELD.

Engine sells results, draws fire.
C|net Inc., June, 1996.