Control in the WWW Hyperspace
Massimo Marchiori
The World Wide Web Consortium (W3C)
MIT
Laboratory for Computer Science
545 Technology Square, NE43-350
Cambridge, MA 02139, USA
massimo@w3.org
Position:
The power of the World Wide Web lies in its distributed character.
The weakness of the World Wide Web lies in its distributed character.
In this position paper we focus on how to solve this second
issue, that is to say, we
provide an attempt to cope with the apparently unsolvable problem
of the lack of
general control in the World Wide Web, focussing on the hyper
structure problem.
Control
The World Wide Web has revolutionized the way people can access information.
In a sense, the web is a collection of a multitude of spatially distributed
databases. On the one hand, its superior flexibility relies in the opportunity
to overcome the spatial barriers, and to freely jump from one site to another
via hyper links, with a simple mouse click. On the other hand, the power
of the World Wide Web relies on its size and variety, since it collects
an enormous amount of different sites, as reported by every recent estimation
(cf. [4]).
However, this great power also raises what is the major problem of the
web: navigating in hyper space is becoming more and more difficult, since
the web is nowadays very poorly connected (see for instance [1]).
The problem is in appearance unattackable: the power of the web relies
in its distributed character, and so there is no possible global control
on it. In this position paper we propose a solution to this issue, and
test its effectiveness via small-scale simulations called web arenas.
It is true that there is no global control on users maintaining sites in
the web, but there is nevertheless a way to incite such users to
improve the web navigational structure: the idea is to provide suitable
``bonuses'' to the maintainers of web pages so to foster navigation cooperation
among sites. But who is going to provide such ``bonuses'', and what form
can they have? As market studies clearly indicate, in order to survive
into the WWW informative jungle, web users have to almost exclusively resort
on search engines (automatic catalogs of the web) and repositories (human
collections of links usually topics-based). In turn, repositories are now
resorting themselves on search engines to keep their databases up-to-date.
Thus, the crucial component in the information management chain is given
by search engines. Therefore, the idea is that these bonuses should be
provided by search engines' score: if the web structure is in some sense
improved by a web page, such page will get a higher rank. Indeed, search
engines have become so important in the advertisement market that it has
become essential for companies to have their pages listed in top positions
of search engines, in order to get a significant web-based promotion. Starting
with the already pioneering work of Rhodes ([5]),
this phenomenon is now boosting at such a rate to have provoked serious
problems to search engines (see e.g. [4]),
and has revolutionized the web design companies, which are now specifically
asked not only to design good web sites, but also to make them rank high
in search engines. A vast number of new companies was born just to make
customer web pages as visible as possible. More and more companies, like
Exploit, Allwilk, Northern Webs, Ryley & Associates, PlanetOcean, SignPost,
Did-It, Mentor Marketing, etc., explicitly study ways to rank high a page
in search engines. OpenText arrived to sell ``preferred listings'', i.e.
assuring a particular entry to stay in the top ten for some time (for a
discussion on the effects of such a policy, see for instance [7]).
We have studied the effects of search engines bonuses on the navigational
structure of the web. We will report on extensive testings with different
bonuses, measuring their effect on the navigation in the WWW,
including a novel cooperation bonus that provides an amazingly
good solution. These tests shed new light on the bonus approach, and show
how the cooperation bonus ranks by far as the best method. The tests also
include the important visibility bonus, which is currently implemented
by many search engines: it is shown how the effects of this bonus on the
global navigability of the World Wide Web are deleterious, and so its usage
should be avoided by search engines.
Navigation in the WWW
In order to be really useful, highways in the World Wide Web must not be
built by chance. That is to say, there should be a rationale when building
a piece of the information highway: if a user is looking at a web object
because he is interested in some particular information, he should (also)
be offered with links pointing to objects offering related information.
Thus, given two web objects A and B (belonging to two different
sites), we say that there is navigation cooperation from a web object
A to another web object B if a user navigating to A
seeking for a specific information can proceed with his search by navigating
to B. The prime ingredient for navigation cooperation is therefore
the hyper link from A to B (however, the presence
of such a link does not automatically imply navigation cooperation, as
we will see later). Note that the notion of (navigation) cooperation is
not necessarily symmetric: one object can cooperate with another, but the
converse may not hold.
A cooperative link is potentially improving the web connectivity, and
the usefulness of navigation for users. However, in order to quantify how
fruitful the usage of cooperative links can be, we need a way to measure
the navigability of a web structure. First, we need what is called
a categorization (also called classification) of the web
objects. A categorization classifies each web objects into a certain category.
Thus, for example, we could have as categories of interest Computers
and Music, with the intended meaning of indicating those web objects
dealing with computers and music, respectively. Then, a categorization
would be a set of web objects classified in the category Computers
(those web objects that we classify as pertaining to computers), and a
set of web objects classified in the category Music (those web objects
that we classify as pertaining to music). Once we have a categorization
of the web objects, it is clear how to intuitively measure the navigability
of a web structure: a user looking for information in a particular category
must be easily able to navigate through all the objects belonging to that
category when starting from one of these objects. This means such objects
must be tightly connected by hyper links, and that they shouldn't be too
much connected to objects belonging to other categories, otherwise one
can get lost while navigating. These informal provisos can be expressed
as follows. Let S be a subset of a web structure W: then,
the cohesion of S can be measured as the difference between
the percentage of ``intra'' connectivity (to what extent the elements of
the subset are connected to each other), and the percentage of ``inter''
connectivity (to what extent the subset is connected with the rest of the
web). So, now we have all the tools to measure how fruitful the navigation
in a web structure can be: the navigability of a web structure
W w.r.t. a categorization Cat is the average cohesion of each category
of Cat.
Impulse
As said, we have measured the impact of suitable bonus on the global navigational
structure of the WWW.
In order to do that, we have employed small-scale simulations of the
World Wide Web, so-called Web arenas.
Web arenas are in a sense a small closed-world where users, the "players",
can play in while creating new web
objects and modifying them. The only limitation is that the hyper structure
must stay within the arena, that is to say,
users are only entitled to hyper-reference objects internal to the
Web arena. We assigned to each member of the population some specific categories
drawn from the Excite ontology. The important thing was that only the name
of the category was given, but no mention at all was made about the
ontology itself. This means that not only each player was unaware of
the existence of the ontology, but also that only the particular category
name was provided, and not the whole classification. In order to play a
web arena game, there is also the need for a goal. We played the game several
times: each time, a specific search engine for the web arena was used.
The goal for each player was to rank high in the search engine, for each
of the categories he was provided with, just like if he was the responsible
of the site in the ``real'' WWW, and wanted to have its site noticed. Note
that, just as in the WWW case, a maintainer of a site devoted to a certain
argument does not have a ``sure'' way to establish that his site will have
a high rank for all the users using a search engine and looking for information
related to his site (in our terminology, pertinent to that category). Thus,
what one can do is only to try to rank high for many keywords that are
in all likelihood representative or related to the given category.
We studied the behavior of four kinds of bonuses.
-
The first one was the no-bonus: that is to say, no bonuses were
used. In order to get a situation as realistic as possible, we just employed
as search engine a basic module (actually part of a bigger search engine
being developed by the author) performing classic weighted scoring based
on frequencies and counts (this performs roughly as good as each search
engine present nowadays). The same module was then reutilized in the other
four web arena games by adding on top of its score function a specific
bonus.
-
The second one was the visibility bonus, which gives bonuses to
a web object proportionally to the number of links that point to it. This
bonus is of particular importance (cf. [6])
because it is currently employed by many search engines like WebCrawler,
Excite, Lycos and Magellan (although not with the purpose to improve the
web structure, but just to enhance the evaluation of the score).
-
The third one was the cooperation bonus, a bonus that tries to quantify
the effectiveness of the hyperlink for navigation, by making used of the
so-called hyper-measure of information first introduced in [2].
-
The fourth one was the connectivity bonus, that just naïvely
assigns a bonus to each link (this can be seen as a degenerate case of
cooperation bonus, where the cooperation specification is always satisfied).
The obtained results were rather striking and instructive.
In the first case, the situation resembled the current situation present
in the World Wide Web: the outcome of the simulation was a web arena with
a very poor connectivity, and (what's worst), with a very low navigability
too.
In the second case, connectivity stayed very poor, and navigability
was worst than the ``no-bonus'' case.
In the third case, connectivity increased a lot, and the navigability
improved by much.
In the fourth case, the progress made with the cooperation bonus got
lost: the connectivity increased by far, but this time there was no rationale
of providing ``navigationally useful'' links. Even, the trend was to link
sites not belonging to the same market niche, thus further ruining navigability.
In the presentation, we will present in detail the cooperation bonus
solution and the simulation results, provide insights on the practical
meaning of such results, and discuss the impact of this solution for the
search engine developers side and for the web advertisement market.
References
-
1
-
TIM BRAY.
Measuring the Web.
Fifth International World Wide Web Conference, May, Paris, 1996.
-
2
-
MASSIMO MARCHIORI.
The Quest for Correct Information on the Web: Hyper Search Engines.
Sixth International World Wide Web Conference, April, Santa
Clara, California, 1997.
-
3
-
MASSIMO MARCHIORI.
Security of World Wide Web Search Engines.
Reliability, Quality and Safety of Software-Intensive Systems.
Chapman & Hall, 1997.
-
4
-
NIELSEN MEDIA RESEARCH.
Web Audience Measurement: Issues, Challenges and Solutions.
IPQC Conference on Performance Measurement for Web Sites, San
Francisco, 1996.
-
5
-
JIM RHODES.
How to Promote Your Business Web Pages.
-
6
-
DANNY SULLIVAN.
Webmaster's Guide to Search Engines.
Calafia Consulting, 1997.
-
7
-
NICK WINGFIELD.
Engine sells results, draws fire.
C|net Inc., June, 1996.