The dynamics of personal territories on the Web moreT. Beauvisage The dynamics of personal territories on the Web, Proceedings of the 19th ACM conference on Hypertext and hypermedia (Hypertext 2009), Torino (Italy), June-July 2009, pp. 25-34 |
36 views |
The Dynamics of Personal Territories on the Web
Thomas Beauvisage
Orange Labs 38-40, rue du Général Leclerc 92794 Issy-Les-Moulineaux – France (33)1 45 29 58 11
thomas.beauvisage@orange-ftgroup.com ABSTRACT
In this paper, we present a long-term study of user-centric Web traffic data collected in 2000-2002 and 2005-2006 from two large representative panels of French Internet users. Our work focuses on the dynamics of personal territories on the Web and their evolution between 2000 and 2006. At the session level, we distinguish four profiles of browsing dynamics in 2005-2006, and point out the growing dichotomy between straight routine sessions and exploratory browsing. At a global level, we observe that although each individual’s corpus of visited sites is permanently growing, his browsing practices are structured around routine well-known sites which operate as links providers to new sites. We argue that this tension between the known and the unknown is constitutive of Web practices and is a fundamental property of personal Web territories. the field of Computer Science and HCI. Web Usage Mining became an entire fieldwork for researchers. In the field of WUM, three main directions can be distinguished: 1) cognitive and formal aspects of navigation investigated by Cognitive Science; 2) server-side click-through data analysis, interested in specific behaviours on Web sites, especially search engines; and 3) usercentric approaches, mainly focusing on laboratory experiments with small groups of participants. An important distinction between these three research directions relates to the data source, i.e. user-centric or server-centric traffic traces. Due to the easy access to server logs, an important number of works have focused on web trails description and analysis from a server-centric point of view. Although they are not appropriate to describe global usages, they are often grounded on large-scale observation, and proposed valuable methods to describe browsing dynamics. As for user-centric studies in natural situation, the have the great advantage to embrace the whole browsing activity. However, they remain rare and mainly focused on the question of revisitation patterns in a descriptive approach. In this study, we wish to combine the benefits from both methodological approaches. We have the advantage of working on two long-term datasets collected in 2000-2002 (34 months) and 2005-2006 (19 months) from large and representative cohorts of French Internet users in natural situation. These unique data allow us to observe, in a plain user-centric approach, the longterm evolution of Web practices and of navigation dynamics and morphology. Therefore we can describe the structure of personal Web territories and the dynamic of their constitution within daily practices. In the following sections, we first present the existing works on Web browsing and detail the underlying methods and paradigms. Secondly, we expose our data collection naturalistic method and the data preparation processes we conducted. Afterwards, we present our findings on the evolution of browsing behaviors since 2000, by combining the description of the dynamics of sessions and the structure of Web territories.
Categories and Subject Descriptors
H.5.4 [Information Interfaces Hypertext/Hypermedia – User issues. and Presentation]:
General Terms
Measurement, Experimentation, Human Factors.
Keywords
User-centric Traffic Data, Web Usage Mining, Traffic Analysis, Browsing Behaviors, Usage Territories.
1. INTRODUCTION
Using the Web for personal or professional usage has become quite common. The growth of broadband internet access and the development of online services have participated in this evolution: Web 2.0 have increased communication and interaction capabilities, along with e-commerce development, while other traditionally stand-alone applications like audio players or games are now available online. Thus, within the past ten years, the Web has become a quite natural a common resource in daily life practices. Therefore, the fine description of Web browsing practices has become a stake for researchers as well as for Internet players, and many research programs have addressed this issue in
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’09, June 29–July 1, 2009, Torino, Italy. Copyright 2009 ACM 978-1-60558-486-7/09/06...$5.00.
2. RELATED WORK 2.1 Web Browsing and Cognitive Science
Cognitive Science proposes a model of human activity as information processing; which has been followed by many studies in the field of Information Retrieval and Artificial Intelligence. It was first applied to hypermedia systems in the 80’s: the objective was to establish a model of user navigation on a given hypermedia in order to draw recommendations and guidelines for hypermedia design [13]. At the minimal level of the hypertext browsing activity, the objective is to describe the shape and dynamics of the user’s navigation through a given hypertext. The study conducted by Canter and al. [15] is frequently cited in this
context. The authors distinguish four basic forms in navigation: path, ring, loopiness and spike. These forms are recognized and combined to produce six indicators which give a formal description of navigation. This early work is often referred to by studies on Web navigation in the field on Cognitive Psychology. Therefore, these works often propose taxonomies of navigation strategies, but no general consensus can be seen on the definition of these strategies [9]. As a matter a fact, the key question here is the interpretation of the navigation patterns observed in various contexts of activity and performed by diverse individual profiles. A whole series of works on Web navigation preserved these paradigms resulting from the studies on the hypermedia. Grounded in Cognitive Science, they focused on the modeling of the user in the situation of browsing, and the discovery of underlying “cognitive structures” and “mental models”. This approach led in most cases to user-centric studies on restricted samples in “laboratory situation”, either focusing on “usability” of a given Web site, or on the question of “confusion” of the users “lost in hyperspace” [41]. As a matter a fact, these early works directly inherit paradigms from Information Retrieval and Artificial Intelligence: Web pages are seen as documents containing “informational molecules”, and user activity is considered as tasks which can be handled with a problem solving approach. In this context, a number of studies deal with information seeking strategies and with the use of certain kinds of sites, e.g. [3, 18] on knowledge workers, [26, 31] on search strategies, or [3, 23] on the differences between visual or verbal cognitive structures. Therefore, these works gain knowledge on specific behavior and situations: people are observed “in laboratory”, and perform predefined actions by researchers who want to validate certain assumptions. An important part of these studies also aims to build models of Web surfing: [33] presented a model of “navigation through information”, MESA (Model for Evaluating Site Architecture) build upon a model of navigation and the structure of the Web site; [10] proposed a method of analysis, CWW (Cognitive Walkthrough for the Web) adapting to the Web existing methods in Cognitive Science to measure the usability of a site; [22] established the SNIF-ACT, a cognitive model of navigation in information retrieval. In all cases, the user is modeled as performing a task, and following hyperlinks is treated as a decision to accomplish this task. These approaches have the advantage of addressing complexity in the issue of navigation in hypertext and its meaning in a given context. However, they remain grounded on the cognitive science paradigms: contents, modes of access and browsing activities are considered within the prism of information retrieval and, by extension, Web contents are valued under this unique angle. This approach postulates equivalence between patterns of navigation, tasks and motivation of the user, and therefore does not embrace the diversity of available Web contents and services, and the correspondent diversity of their usages in daily practices.
browsing patterns and Web trails within a given site [38]. They apply datamining techniques to Web logs mining, and therefore make high use of statistical analysis and graph theory. Three studies conducted in the late 90’s illustrate this particular trend: first, [11] proposed a modeling of navigation in the form of an Hypertext Probabilistic Grammar (HPG) which allows to mine for Web sessions databases and to identify recurrent sequences. At the same time, and pursuing the same objectives; [20] released Webmin, a piece of software which applies datamining techniques to the study of Web uses; finally, [37] developed Web Usage Miner, a software capable of aggregating navigations trees on a site, and providing an SQL-like language, Mint, to search for navigation patterns. These early works were naturally carried on in the 2000’s, supported by industrial needs for Web personalization. As shown in [14], the Adaptive Web area deals today with a wide range of problems, including improving the architecture and design of Web sites, detailed analysis of the audience (most visited pages, sessions segmentation by type of visit), optimization of servers and adaptive content. In the meantime, a number of more descriptive site-centric studies were conducted all over the past ten years. A particular attention was paid to search engines logs: the goal of the analysis is to improve the answers from the keywords searched, the reformulation, and results pages consulted, e.g. [12, 32]. However, the site-centric positioning of these works makes them unable to exceed local correlations between certain types of sites and variables related to the user, and to embrace the diversity of practices. To address this issue, some works have been trying to include in the analysis some factors relative to the page content. For instance, [1] add to the "classical" statistical analysis of log files some "concepts" attached to each page. In the same vein, [25] presents an approach to navigation combining site-centric usage data (server logs), page content and site topology (structure of links between pages); [17] proposes to identify, the social profiles and the correspondent demographics of a Web site's visitors; [27] assesses the ability to predict age and gender of a visitor from visited content and browsing structure on a general portal. These studies are interesting in the fact that they try to mix information on the visited contents and on the browsing shape and structure. However, they remain dependent on the server-centric approach: as shown by [35] which collected browsing data at the intermediate level of a content provider, the server-centric view is partial and biased, and the conclusions can be drawn from such approaches should always be taken with caution, especially when they tend to draw general models of browsing behavior.
2.3 User Activity in Natural Situation
The collection of user-centric data in natural situation was the object of some few works in the early 2000’s, and has met a renewed interest in recent years, with the need to dispose of finegrained data on the use of Web 2.0 services. Along with browsing patterns analysis, new issues are addressed on Web audience monetization and loyalty. As an evidence of this trend, the WWW 2006 Conference workshop on “Logging Traces of Web Activity: The Mechanics of Data Collection” paid much attention to technical choices for collecting user-centric traces, and to data preparation and cleaning. In parallel, techniques for gathering user-centric data is growing [29, 34]. Despite their high descriptive potential for Web usage mining, the studies based on the analysis of browsing traces collected on the
2.2 Server-Centric Works
There is a relatively abundant literature dealing with the analysis of users paths on the basis of Web server logs. Based on the task recognition techniques mentioned above, or involving statistical analysis based on Markov chains, analysis of time series and data mining tools, these works try to exploit browsing traces to improve user experience and knowledge on a Web site. Early server-centric studies mainly focused on the research of recurrent
Table 1. User-centric log-based Web browsing studies summary Ref. [16] [21] [39] [18] [19] [5] [6] Authors Catledge and Pitkow Cunha et al. Taucher and Greenberg Choo et al. Cockburn and McKenzie Beaudouin and Assadi Beauvisage Data collection 1994 1994-1995 1995 1998 1999-2000 2000 2000-2002 Nb. users. 107 37 23 34 17 1,140 3,372 / 597 72 20 25 612,000 597 / 1434 Duration 3 weeks 5 weeks 6 weeks 2 weeks 4 months 12 months 12 / 34 months 6 months 1 week 3 months (avg.) 5 weeks 34 / 19 months Tracking method XMosaic modification Mosaic modification XMosaic modification Custom Web tracker Netscape history file PC probe PC probe Focus Cutting logs into sessions; evolution of page vocabulary, access modes to Web pages (Forward, Back, Open URL, Link, Anchor…) Generalized power-law distribution in Web data (documents size, user requests); self-similarity in traffic data Page access methods; revisitation mechanisms description and modeling; recommendations for Web browsers history mechanisms design Seeking behaviors of knowledge workers; focus on significant information seeking episodes for modeling Page revisitation description; growth of URL vocabulary, temporal aspects of revisitation, bookmarks usage patterns Search engines usage profiles Global Internet usage segmentation (Web, mail, chat, peer-to-peer); Web browsing dynamics; link between page content and sessions morphology; Web users browsing profiles. Focus on Digital Libraries users; global Web practices; access modes to digital contents; combination of sites in DL access. Evidences on page view, sessions cutting, browser window usage and speed of browsing Focus on revisitation: user actions (Back, Link…) when revisiting pages, targeted Web sites categories, temporal aspects of revisitation Focus on page revisitation; categorization of page according to revisitation patterns Web territories dynamics: user, sessions and content dynamics
[4] [24] [40] [2, 3] this study
Assadi et al. Hawkey and Inkpen Weinreich et al. Adar et al. Beauvisage
2002 2004 2004-2005 2006 2000-2002, 2005-2006
PC probe IE tracking add-on Local Web proxy + Firefox add-on Windows Live Toolbar PC probe
user side in natural situation are rare. The main reason for that is the difficulty and cost to implement such methods. We accounted only a dozen of studies conducted under this methodology, listed in Table 1. In order to track Web browsing, most of these studies make use of browser add-ons. These modules record the addresses of visited pages, and sometimes even more precise interactions with user interface. With a browser-oriented view, the collected data lose coverage because they cannot contextualize browsing into global PC activity, but they gain accuracy in UI interactions description thanks to data such as actions on menu. These works share some common findings on Web browsing. First, browsing responds to a power-law distribution: the majority of requests address a small part of Web documents and pages, while a large number of pages are rarely seen. Besides, the authors note the steady growth of the "vocabulary" of visited pages along with the concentration of traffic on a small number of pages reviewed frequently. In addition, most of these studies conducted interviews with users in order to gather further information on the context in which the pages are revisited or on the strategies used by participants when searching the Web. These studies are valuable in that they deal with “real-world” navigation on the side of the user. They provide important results in terms of statistics and methodology. Two criticisms can be made,
however: on the one hand, the data often concern a limited number of users and/or cover only a relatively limited duration. On the other hand, the content of pages is never discussed: however, it seems a particularly interesting point to know what relationship may exist between browsing strategies, navigation patterns and the visit of new sites, services and themes on the Web. In this paper, we wish to overcome these two difficulties by offering a description of browsing patterns based on PC probe data collected over a long period of observation.
2.4 Our Approach of “Web Territories”
In this paper, we wish to bring a contribution to Web browsing description based on automatically collected data from a large, representative and well described panel. Many of the existing studies focus on search strategies, or on specific groups of users (typically low-income users). For instance, the HomeNetToo project [28] combined interviews, direct observation, and automatically recorded activity traces, but these traces concerned only Internet activity. Besides, the automatic gathering of usage traces allows avoiding the problem of precision concerning declarative self-estimated consumption and behavior. Thus, we aim at providing a global insight on browsing dynamics. Previous work on large-scale browsing traces concentrated on the
concept of “revisitation”, and provided valuable description of patterns. We wish to go further, and therefore propose the concept of “usage territory” to describe browsing dynamics. These “territories” reflect everyone’s browsing behavior, including three dimensions. Firstly, the dynamics of navigation, referring to the session’s rhythm, shape, its length and complexity. Secondly, the dynamics of the user: his navigation history, preferences, routine or discovery practices. And finally, the dynamics of the content: pages, services, games, etc. In practice, browsing behaviors are crossing and embedding these three dimensions: the shape of the session, the knowledge of the visited contents and the kind of accessed content are linked. We wish to embrace these three dimensions with the concept of “territory”: the collection of daily practices draws the frontiers of each user’s Web landscape within the wide range of available sites. This global approach led us to base our work on statistical classification tools in order to provide a global description of the three dimensions of Web territories: 1) at the session level, classifying Web sessions according to their topology; 2) at the user level, qualifying Web sites profiles according to the frequency and regularity of individual visits; and 3) crossing these information with content description to draw a raw taxonomy of sites based on how the appear in daily practices.
3.2 Web Usage Data Characteristics
All the Internet-equipped households of the panel installed a tracking application on their computers. We used a modified version of the tracking technology owned by NetRatings for its audience measurement. The modification concerns user personal identification, which is done by a pop-up window asking the user to identify himself in the household members list. This identification window appears at PC start-up, and after every 30 minutes of inactivity on the PC. Once running, the probe appeared as an icon in the system tray of the task bar. The monitoring performed by the probe could be stopped by right-clicking this icon, but we have good reasons to believe that this functionality was hardly used. The NetRatings probe is an unobtrusive proprietary panellist software which “silently” gathers and relays Internet usage data (such as the use of chat, e-mail, instant messaging applications, forums, audio, video download and peer-to-peer applications) which passes through the TCP/IP layer. It consists in a small computer application, and has no impact on regular use of the user’s computer; it automatically starts up when the computer is booted up. The data are collected and analyzed at the network layer level, which stands between the different applications accessing the network and the remote servers. Regularly, data recorded are sent automatically via an Internet connection to a server without disturbing the user, and are tthen validated and loaded into a database. The Web traffic data we gathered had already been pre-processed in two ways. Firstly, the probe combines methods in order to have each log entry corresponding to an explicit page view from the user, and to ensure proper page accounting: request filtering: not all HTTP requests are logged, e.g. image files, style sheets, JavaScript, etc. are filtered out, and only successful HTTP return codes are traced; frameset and iframe reduction: a single request is logged; focus and UI interaction: requests are logged only when the browser has the focus and the user performs a browsing action.
3. METHODOLOGY 3.1 Panels and Probe
Our first methodological choice involved working on qualified panels. Unlike site-centric approaches which have great difficulty in identifying the people behind IP addresses or cookies and find it even harder to categorize them, audience panels provide access to information about all the individuals in the household. This panel approach was deployed within two research projects conducted in France. In both cases, the panels were representative of the French Internet population: 2000-2002 panel: 597 individuals observed during 34 months from January 2000 to October 2002. This dataset was collected within the SensNet project, aiming at describing global Internet usage and practices [8]. 2005-2006 panel: 1,434 individuals from 661 households, observed during 19 months from April 2005 to October 2006. This dataset is related to a wider research project named Entrelacs that aimed at analyzing the interweaving of the ICT and the sociability in France [36].
-
-
These two projects relied on the combination of various data sources: individual questionnaires, telephone and internet traffic measures and qualitative studies. Such combined approach is highly relevant to understand the patterns of mediated interpersonal communication, as shown in [30]. This material is quite unique considering the large period of observation and the number of households and users in the panel and its representativeness of the French Internet-equipped population. Individuals are described by the conventional socio-demographic variables concerning them and their household. Households were also described by their audiovisual and communication equipment ownership. Finally, information on their Internet connection was collected such as date of first connection, type of Internet connection (PSTN, broadband) and place of connection.
The second pre-processing concerns the identification of sessions, i.e. the need to recognize coherent sequences of user activity within traffic data. Here, the industry standard of 30 minutes of Web inactivity is applied to divide the activity log continuum into sessions. Therefore, our raw data consist in a list of visited URL for each user, with a timestamp and a session id.
4. Traffic Data Preparation 4.1 Site Identification
Our main data preparation task consisted in Web sites identification. We needed to determine quite precisely what different sites had been visited during the sessions and the Domain Name information does not always provide the right answer to this question. Indeed, the concept of "site", although intuitive, turns out to be quite problematic for the analysis. The technical definition that associates a site with a DNS is certainly valid in most cases, but comes up against two obstacles: dispersion and aliases: the important Web sites (portals, ecommmerce…) provide localized versions, which can either span across multiple TLD, such as Google (google.com,
google.fr…) or use subdomains, like Yahoo (www.yahoo.com for the US, fr.yahoo.com for French version) aggregation: conversely, the Domain Name may be far too general. This is particularly the case for some personal Web sites or for blogs which are not self-hosted. For instance, a Geocities personal site address looks like ‘www.geocities.com/Heartland/5978’, and a blog hosted Blogger looks like ‘secretsweb.blogspot.com’. In both cases, a DNS rule for site identification would aggregate all the Geocities personal web sites and the Blogspot blogs as one, ‘geocities.com’ and ‘blogspot.com’.
-
the number and length of detours, the importance of these detours in terms of session temporality, distinguish and quantify the "fixation points" of the session.
To deal with this issue, we propose the concept of editorial site: we consider a site to be a publication area with a single editorial entity, whether that is an individual, an organisation or a company. The definition here is less economic or legal than author-based and as such, personal sites must be distinguished from their host. To calculate this editorial site, we developed a platform named CatService [7]. Basically, CatService requires a set of pattern matching rules, built with the help of a formalism based on regular expressions. These rules enable us to associate a class of URLs with a portal type, a portal name and a service, and to calculate the editorial site relative to each URL. The reference base and rules are built manually by the users of the application. Thus, for each URL, the editorial site identification process follows three steps: 1. if a hosting service (personal web sites, blogs) is associated with the URL, specific site identification rules are applied, customized for each hosting service provider; else, if a portal is associated with the URL, it is used as the identifier. For instance: ‘fr.yahoo.com’ is identified as Yahoo France, while others ‘yahoo.com’ URLs are associated with Yahoo US. otherwise, the DNS is used as the editorial Web site identifier. Table 2. Site identification methods results (2005-2006) Identification source % sites % viewed pages 1. Hosting service 27% 2% 2. Portal 0,1% 25% 3. DNS 73% 73% As shown in Table 2, hosted web sites account for 27% of the final amount of sites, which validates the necessity of our method.
Figure 1. Concentration of revisits Our commitment to simplicity and robustness leads us to consider the sequential dimension partially. Only complex graph analysis tools would hold together, in the same statistical object, Web browsing form and content. In the indicators that we propose, sequentiality is analyzed independently from content. We formalize Web sessions as a sequence of symbols that represent the visited pages. Using this representation, we propose the following indicators:
• • •
N: length of the session (number of steps) n: number of unique elements seen in the session r=
2.
n N
: average browsing linearity rate, equal to 1 if it the
3.
session is linear, becoming closer to 0 as this linearity diminishes
• •
R: number of elements revisited, i.e. seen more than once
N −n c = R : average number of revisits per element revisited.
This indicator represents the concentration of revisits to one or several elements and enables "star-like" browsing to be identified: in the example in Figure 1. Concentration of revisits, in both sessions N=12, n=9, and r=1.3, but c is different (3 vs. 1).
4.2 Topological and Temporal Indicators
The temporal dimension is one of the fundamental elements in browsing analysis. It is important on two levels: firstly, it involves calculating visiting times and the time spent on each page or each site. Secondly, it is important to examine the order in which the contents are accessed and its value in the browsing dynamics. To do so, we have developed statistical indicators capable of representing the browsing dynamics. We propose a simple robust approach based on the construction of indicators enabling us to represent the “patterns” and the rhythm of browsing. The indicators must enable us to represent certain specific aspects of the session: the linearity (each page/site is seen only once),
We also build indicators that take into account periods spent on each element visited, whilst considering browsing sequences: T: total duration of the session; average and mean periods spent on each step; T1: the time spent on elements of the session seen once;
-
d = proportion of the time spent on elements seen once in the entire session This indicator is close to the linearity r rate, but is applied to duration: it is equal to 1 if the session is linear, and 0 if it is not at all linear.
T1 T :
We also wanted qualitative information on the way that pages are revisited. Particularly, we wanted to measure the use of the
browser Back function. To do this, we identified Back sequences and isolated them from the rest of the session. We therefore produced a new series of indicators relative to the use of Back and to the sessions where Back was removed: B: number of Back-type sequences, whatever their length; Nb: the length of the browsing path (number of steps) once the Back sequences have been removed;
50 45 40 35 30 25 20 15 10 5 0 May-00 May-01 May-02 Jan-00 May-03 Jan-01 May-04 Jan-02 May-05 Jan-03 May-06 Jan-04 Jan-05 Sep-00 Sep-01 Sep-02 Sep-03 Sep-04 Sep-05 Jan-06 Sep-06
-
Nb b = N : proportion of the Back actions in the total number
of steps in the session. The closer the indicator approaches to 0, the more space Back actions take in the session.
Quantifying Back-type actions is interesting for two reasons. At the page level, it corresponds to using one of the browser functions and shows a method of using the interfaces and the progress of the browsing. On the site level, correspondence with the browser function is only relevant if one single page is seen on each site in the back and forth sequence and does not refer so much to a function of the user interface as to strengthening the identification of pivotal sites within star-type browsing. We calculated these indicators at both page-level and site-level representations of the sessions, and used them to classify Web sessions according to their dynamic, as we will show in next section.
Figure 3. Mean session duration, 2000-2006 (in minutes) This trend reflects the profound evolution of the Internet over the last six years: after a phase of ICT adoption dominated by lowspeed access, with an increase in the length of sessions over the period 2000-2002, Web practices are now stabilized and inserted in daily practices, grounded on widespread broadband access (90% in December 2006). At the same time, the contents and services available on the Web have changed too: we can refer to the Web 2.0 sites, but also to the maturity and diversification of the other sites offering online shopping, audio and video services, geographical research, etc. We find a correspondent evolution in the form and dynamic of Web sessions. We established a classification of Web sessions in five groups based on their topology, i.e. the topological indicators described below. We have projected the 2005-2006 sessions onto this classification, in order to see changes in groups’ distribution in 2006 (see Table 3). Table 3. Five session profiles built in 2000-2002 and projection on 2005-2006 sessions Lightning sessions Fully linear, 1-2 sites, 1-3 min. 2000/2002 proportion: 15% 2005/2006 projection: 19% Targeted sessions Linear at site level, 1-2 sites, some few back actions at page level
5. BROWSING DYNAMICS 5.1 From Browsing 1.0 to Browsing 2.0
The Web is at the center of PC uses at home: in 2005-2006, it represented 43% of individual PC usage time in average. In 20052006, French Internet users spent almost four hours per week browsing the Web on average. This time budget devoted to the Web has faced significant change since 2000, as evidenced by comparing 2005-2006 data with those of the 2000-2002 panel.
45 40 35 30 25 20 15 10 5 0 Jan-00 May-00 Sep-00 Jan-01 May-01 Sep-01 Jan-02 May-02 Sep-02 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06
2000/2002 proportion: 22% 2005/2006 projection: 14% Backbone sessions 3-10 sites, 4-13 min, a backbone of visited sites with few “side steps” 2000/2002 proportion: 22% 2005/2006 projection: 21%
Figure 2. Mean number of sessions per user, 2000-2006 Between 2002 and 2005, the average number of Web sessions per user has more than doubled (see Figure 2). Meantime, Internet users spend more time surfing in 2006 than five years ago, from less than one hour per week at the beginning of 2000, to more than two hours late 2002 and four hours in 2005/2006. This trend also has implications for the conduct of the sessions themselves, with a decline in the length of sessions, dropping from 45 minutes in 2002 to half an hour three years later (see Figure 3).
Hub sessions Low linearity, >5 sites, >35 min, back actions concentrated on a small set of sites 2000/2002 proportion: 29% 2005/2006 projection: 32% Scattered sessions Very low linearity, lots of back actions, >35 min, >10 sites 2000/2002 proportion: 13% 2005/2006 projection: 13%
Therefore, the transformation of navigational practices between 2000/2002 and 2005/2006 has induced a stronger opposition between simple routine sessions and long and complex ones. The global shortening of Web sessions benefits to punctual usages of Web resources: lightning sessions, the shortest and most direct ones, constituted 15% of sessions in 2000-2002, they accounted for 19% in 2005-2006. In the meantime, the targeted sessions, still linear but longer, have seen their share drop from 22% to 14%. The other main evolution concerns longer navigation trails: in 2005-2006, when Web sessions get longer than 15 minutes, they tend to organize themselves around a central backbone, with either minimal actions (backbone sessions), or more complicated imbricate loops and circles (exploratory sessions) around this central axis. The efficiency of search engines is certainly no stranger to this trend: people scan now fewer result pages from search engines than five years ago, especially before the arrival of Google on this market. The distribution of the four session profiles in 2005-2006 is not equal among all individuals: some of them prefer straight and short sessions, while others developed more exploratory behaviors. The orientation of users toward a particular kind of session is strongly linked to the overall Web usage intensity (see Figure 4). When users increase their Web activity, their exploratory behaviors tend to occupy a more important place in their navigation.
100% 23% 80% 60% 40% 20% 0%
1st quartile 2nd quartile 3rd quartile 4th quartile
Some continuity seems to be emerging between 2000 and 2006, since the proportion of each session profile hardly changed between the two periods. The only notable difference concerns the lightning sessions, whose share rose from 15 to 19%, to the detriment of targeted sessions. This evolution, parallel to the overall reduction of the average duration of sessions, suggests an increase of occasional uses of Web resources, supported by the increase of broadband Internet access. However, are these five profiles of sessions, build on 200-2002 data, still valid after four years of Web development and practices of Internet users? If we conduct a new classification of sessions on the same topological variables with the 2005-2006 data, we get some interesting differences in the construction of the groups, which show that the five profiles in 2002 are not quite adapted to the description of the sessions in 2006. Firstly, for optimal group coherence in 2006, there are not five, but four profiles (see Table 4). Table 4. 2005-2006 Web sessions distribution in the 2002 and 2006 session profiles lightning 17% 17% 2005-2006 profiles back- exploratargeted bone tory Total 1% 19% 14% 14% 21% 1% 21% 15% 18% 32% 1% 13% 13% 14% 37% 31% 100%
25%
29%
33% exploratory
35%
39%
38%
37%
backbone targeted lignthing
15% 16% 28% 20% 15% 18% 14% 16%
lightning targeted backbone hub scattered Total
2000-2002 profiles
Figure 4. Individual Web usage intensity and session profiles (2005-2006)
5.2 Usages Territories on the Web
These elements on browsing dynamics observed at the session level are completed by findings at the global level of personal territories on the Web. By “personal territories”, we refer to the Web as it is performed by user visits. Each individual’s browsing activity constitutes his personal corpus of sites: how is it built? How does it evolve over time? How do user’s visits shape this fluctuating usage landscape? When considering each user’s visited sites, we observed that personal Web territories are structured around a small set of sites visited regularly, occupying the majority of browsing time, while most of the sites visited by individual were never reviewed thereafter. These “ephemeral sites” constitute the majority of distinct visited sites (78% in 2000-2002, 73% in 2005-2005), while they occupied only a small amount of browsing time (24% in 2000-2002, 19% in 2005-2006).
Within these four groups, two are almost identical to those built in 2002: those which are simple and direct – lightning sessions (17%) and targeted sessions (14%). The evolutions have occurred in longer sessions and more complex Web activity. These complex sessions are splitted into two groups in 2005-2006: on the one hand, a category of backbone sessions (37%) is structurally similar to the 2002 profile but lengthens in time and in number of sites visited. On the other hand, a new category of exploratory sessions (31%) emerges, where the user is often comes back on previous steps, visits many sites (more than ten in 70% of these sessions vs. 28% overall) and devotes much time to navigation (more than half an hour).
Finally, for each user, only a small set of sites creates real loyalty and revisitation. However, this concept of loyalty should not be confused with the regularity of visits to a site by a given user: it is necessary to link a visit to a site to a specific activity, and in this context, regularity can be very different from the frequency. Thus, a user can visit always the same site for a specific task, but this task can occur only occasionally or rarely: the online tax return services provides a typical example of this situation. In contrast, Web users may visit intensively some sites over a short period only, e.g. real estate ads sites. In order to reflect loyalty to sites with a user-centric approach, we propose to distinguish the site span, i.e. the number of days between its first and last visit by a given individual, and the intensity of visits toward a site over its “active” time span for the user. Actually we believe that the intensity should not be taken globally over the observation period, but should be relative to the “active” period of the site for a user. Thus, In order to be able to compare individual territories despite the variety of individual practices (e.g. number of sessions, visited sites), we calculate two ratios for each [user-site] couple: site span: the number of “active weeks” including an access to the site, divided by the total number of weeks of observation. intensity: the number of sessions including an access to the site, divided by the number of sessions over the site span period when the site is “active” for the individual.
This crossing of intensity and site span reflects the structure of individual usage territories on the Web. This map of usage territories of the Web is particularly interesting when comparing the number of sites in each category and the time spent on these sites (see Table 5). In average, in 2005-2006, the time spent on the Web was mainly devoted to the categories of sites involving intensive visits (54%) and/or high span (47%), while they concern less than 10% of the visited sites. Three areas of this map deserve special attention: routine sites involve both wide span and high intensity (0.8% of sites, 32% of the browsing time). At the global level, the Web sites found in this category are major Internet players, including Web service providers: Google, MSN, French ISP sites (Orange, Free, etc.), and popular ecommerce Web sites (eBay, CDiscount). However, when examining routine sites individual by individual, we discover that aside these widely used sites, each user also includes in his routine behaviors some specific sites, such as online trading, blogs, or community news sites. seasonal sites involve wide span and low intensity of visits (5% of the sites, 10% of the browsing time). This category refers mainly to Web services whose use is inherently occasional e.g. PagesJaunes (directory enquiries), SNCF (French railway), or Amazon, but also to some contentoriented sites such as Wikipedia, AlloCine (movie trailers) or Doctissimo (medical information). transient sites are visited intensively over a short period (7% of the sites, 16% of the browsing time). They correspond to a targeted need or search of the Web: they are not “active” more than a month, but are present in more than half of the sessions in that period. Unlike the two previous categories, sites associated with this kind of territory are very diverse, and refer to a wide variety of Web content and services, depending mostly on individual tastes and needs.
-
-
The two ratios were discretized into three classes (low / medium / high): 0.05 and 0.1 boundaries for the intensity; and 0.33 and 0.5 for the site span, which roughly correspond to 6 and 9 months. After that, we can project each individual corpus of visited sites into ten categories: nine categories crossing intensity and span classes, plus a special category for ephemeral sites visited in one session only. Table 5. Average structure of Web usage territories in 20052006 Number of sites: Intensity high site span large middle low 0.8% 0.4% 7% middle 0.5% 0.3% 3% low 5% 3% 8% ephemeral sites 73%
-
We have seen that the transformation of browsing practices between 2000-2002 and 2005-2006 was accompanied by a doubling the browsing time and the number of sessions per user. Can we find out a correspondent evolution in the structure of individual usage territories on the Web? We reproduced our usage territory categorization method on the 2000-2002 data, but given the importance of the duration of observation in our calculus, we have limited the analysis to a similar 19 month period, from April 2001 to October 2002 (see Table 6 below). The comparison between the two periods reveals a surprising result: although the average number of sites seen by a visitor was multiplied by 2.4, Web usages in 2006 were more concentrated on routine sites. The routine sites gathered a quarter of browsing time in 2001-2002, and a third in 2006; their number has also increased from 3.7 to 9.6 on average. The seasonal sites faced a similar evolution, from 3 to 10% of browsing time on average. Meantime, the low-span transient sites have decreased from 25% to 16% of browsing time.
Browsing time : Intensity high site span large middle low 32% 6% 16% middle 5% 1% 3% low 10% 3% 4% ephemeral sites 19%
Key: in average, 0.8% of the Web sites visited by an individual are visited intensively and over a large period of time ; these sites represent 32% of the global browsing time of the user.
Table 6. Average structure of Web usage territories in 20012002 Number of sites: Intensity high site span large middle low 1.0% 0.9% 10% middle 0.4% 0.4% 3% low 2% 2% 4% ephemeral sites 78%
already almost mature at the beginning of the observation. The broadband equipment rate at home in France was 70% in April 2005, and 90% in December 2006. Previous user-centric studies on Web browsing had found in the early days of the Web, on short periods of observation, that navigation led to constantly discover new sites, but rarely to visit them more than once. They clearly distinguished this behavior in the corpus of pages and sites of every user. Our findings from long-term observation of representative panels confirm and refine these initial observations. First, loyalty does not necessarily imply frequency: a user may have to visit the same sites or in a given context, while this context occurs rarely. Booking train ticket, looking for a job, planning a trip are all activities that, while not common, can be self-similar in context. Thus, our results suggest that, in a usercentric view, some new indicators of loyalty can be set up by distinguishing visiting intensity and site span: any Web site would wish to be visited both intensively and over a long period, but not all sites can pretend to such position in usage territories depending on their content. Secondly, in a context of perpetual renewal of sites, content and services such as Web 2.0 sites and social platforms or web-based applications, the users need to adapt their behavior to this trend and are potentially in a perpetual state of learning and discovery. The structure of personal territories on the Web demonstrates the strong grounding of practices on familiar tools and services, but also the importance of the ephemeral sites. Thus, if the “wild surfing” behaviors are rare in practice, they certainly represent the attractive part of Web practices and explain its success. This tension between the known and the unknown appears to be constitutive of Web practices.
Browsing time: Intensity high site span large middle low 26% 10% 25% middle 3% 1% 3% low 3% 2% 2% ephemeral sites 24%
Thus, within four years of Web practices, the dichotomy between routine usages and exploratory behavior has strengthened. Nevertheless, the continuous amount of ephemeral sites points out the importance of the Web as a source of discovery and renewal for everyday practices. Indeed, we observe a strong complementarity between the known and the unknown in daily practices. This can be seen at the global level: ephemeral sites are continuously present in browsing behaviors over the 19 months of observation, and should be considered as a constitutive part of Web practices. It is also the case at the session level: routine sites are over-represented in targeted sessions, but are also consulted in exploratory sessions involving the discovery of new contents. We could observe that routine territories involve the well-known Web giants, but also topic-specific sites corresponding to the user’s interests. The detailed observation of these specific sites shows that they all provide dynamic and/or social contents and services. Their attractiveness seems to come from the continuous renewal of their content, as well as their ability to point out new resources on the Web for the user.
7. REFERENCES
[1] Acharyya, S. and Ghosh, J. Context-sensitive modeling of Web-surfing behaviour using concept trees. In Proc. WEBKDD 2003, Washington, USA, 2003. [2] Adar, E., Teevan, J. and Dumais, S. T. Large scale analysis of Web revisitation patterns. In Proc. CHI 2008, Florence, Italy, 2008, ACM Press, 1197-1206. [3] Adar, E., Teevan, J. and Dumais, S. T. Resonance on the web: web dynamics and revisitation patterns. In Proc. CHI 2009, Boston, MA, USA, 2009, ACM, 1381-1390. [4] Assadi, H., Beauvisage, T., Lupovici, C. and Cloarec, T. Users and uses of online digital libraries in France. In Proc. ECDL 2003, Trondheim, Norway, 2003, Springer. [5] Beaudouin, V. and Assadi, H. Usages des moteurs de recherche : une approche centrée utilisateurs. In Proc. JADT 2002, Saint-Malo, France, 2002, 33-44. [6] Beauvisage, T. Sémantique des parcours des utilisateurs sur le Web. PhD, Université de Paris X, Nanterre, France, 2004. [7] Beauvisage, T. and Assadi, H. From User-Centric Web Traffic Data to Usage Data. In Proc. WWW 2005, Chiba, Japan, 2005. [8] Beauvisage, T., Beaudouin, V. and Assadi, H. Internet 1.0: early users, early uses. Annals of Telecommunications, 62, 3-4 (2007), 283-304.
6. CONCLUSION
Our study, conducted on a representative French panel over a large period, points out the mechanism of computer adoption in daily life. Our work is based on the longest to date user-centric navigation logs, collected in 2000-2002 (34 months) and 20052006 (19 months) from representative cohorts of Internet users. These unique data allow us to observe, in a plain user-centric approach, the long-term evolution of practices and of navigation dynamics and morphology. Therefore we focused on the structure of personal Web territories and the dynamic of their constitution within daily practices. We believe that the usage behaviors and trends observed here in France can be extended to the majority of mature Internet markets such as North America, Western Europe or North East Asia (Japan, South Korea). Indeed, the French Internet market has known a strong development in the past ten years, and was
[9] Bidel, S., Lemoine, L., Piat, F., Artires, T. and P., G. Statistical machine learning for tracking hypermedia user behavior. In Proc. MLIRUM'03: Second Workshop on Machine Learning, Information Retrieval and User Modeling, 2003. [10] Blackmon, M. H., Polson, P., Kitajima, M. and Lewis, C. Cognitive Walkthrough for the Web. In Proc. CHI 2002, 2002. [11] Borges, J. and Levene, M. A Fine Grained Heuristic to Capture Web Navigation Patterns. SIGKDD Explorations, 2, 1 (2000), 40-50. [12] Broder, A. A taxonomy of web search. SIGIR Forum, 36, 2 (2002). [13] Brusilovsky, P. Adaptive Hypermedia. User Modeling and User Adapted Interaction, 11, 1-2 (2001), 87-110. [14] Brusilovsky, P. The Adaptive Web. Methods and Strategies of Web Personalization Springer, City, 2007. [15] Canter, D., Rivers, R. and Storrs, G. Characterizing user navigation through complex data structures. Behavioural and Information Technology, 4 (1985), 93-102. [16] Catledge, L. D. and Pitkow, J. E. Characterizing browsing strategies in the World-Wide Web. Computer Networks and ISDN Systems, 27, 6 (1995), 1065-1073. [17] Chevalier, K., Bothorel, C. and Corruble, V. Discovering Rich Navigation Patterns on a Web Site. In Proc. Discovery Science 2003, 2003, Springer, 62-75. [18] Choo, C. W., Detlor, B. and Turnbull, D. Working The Web: An Empirical Model of Web Use. In Proc. HICSS-33, Maui, Hawaii, 2000. [19] Cockburn, A. and McKenzie, B. What do Web users do ? An empirical analysis of Web use. International Journal of Human-Computer Studies(2000), 903-922. [20] Cooley, R., Mobasher, B. and Srivastava, J. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1, 1 (1999), 5-32. [21] Cunha, C., Bestavros, A. and Crovella, M. E. Characteristics of WWW Client-based Traces. Computer Science Department, Boston University, 1995. [22] Fu, W.-T. and Pirolli, P. SNIF-ACT: A Cognitive Model of User Navigation on the World Wide Web. Human-Computer Interaction, 22, 4 (2007), 355-412. [23] Graff, M. Individual differences in hypertext browsing strategies. Behaviour & Information Technology, 24, 2 (2005), 93-99. [24] Hawkey, K. and Inkpen, K. Web browsing today: the impact of changing contexts on user activity. In Proc. CHI 2005, Portland, USA, 2005, 1443-1446. [25] Heer, J. and Chi, E. Separating the Swarm: Categorization Methods for User Sessions on the Web. In Proc. CHI 2002, Minneapolis, USA, 2002, ACM Press, 243-250. [26] Hölscher, C. and Strube, G. Web search behavior of Internet experts and newbies. In Proc. WWW'9, Amsterdam, The Netherlands, 2000, ACM Press, 337-346.
[27] Hu, J., Zeng, H.-J., Li, H., Niu, C. and Chen, Z. Demographic Prediction Based on User's Browsing Behavior. In Proc. WWW 2007, Alberta, Canada, 2007. [28] Jackson, L. A., Von Eye, A., Barbatsis, G., Biocca, F., Zhao, Y. and Fitzgerald, H. E. Internet attitudes and Internet use: some surprising findings from the HomeNetToo project. International journal of human-computer studies, 59, 3 (2003), 355-382. [29] Kelly, D. and Belkin, N. Display time as implicit feedback: understanding task effects. In Proc. SIGIR 2004, Sheffield, United Kingdom, 2004, ACM Press, 377-384. [30] Kim, H., Kim, G. J., Park, H. W. and Rice, R. E. Configurations of relationships in different media: FtF, email, instant messenger, mobile phone, and SMS. Journal of Computer-Mediated Communication, 12, 4 (2007). [31] Kim, K.-S. Experienced Web Users' Search Behavior: Effects of Focus and Emotion Control. In Proc. American Society for Information Science and Technology, Charlotte, USA, 2005. [32] Lee, U., Liu, Z. and Cho, J. Automatic Identification of User Goals in Web Search. In Proc. WWW 2005, Chiba, Japan, 2005, 391-400. [33] Miller, C. S. and Remington, R. W. Modeling information navigation: Implications for information architecture. Human-Computer Interaction, 19, 3 (2004), 226-271. [34] Obendorf, H., Weinreich, H. and Hass, T. Automatic support for web user studies with SCONE and TEA. In Proc. CHI'04, Vienna, Austria, 2004, ACM, 1135-1138. [35] Padmanabhan, B., Zheng, Z. and Kimbrough, S. O. Personalization from incomplete data: what you donメt know can hurt. In Proc. KDD'01, San Francisco, California, 2001, ACM, 154-163. [36] Smoreda, Z., Beauvisage, T., De Bailliencourt, T. and Assadi, H. Saisir les pratiques numériques dans leur globalité. Réseaux, 26, 148-149 (2007), 19-43. [37] Spiliopoulou, M., Faulstich, L. C. and Winkler, K. A Data Miner Analyzing the Navigational Behaviour of Web Users. In Proc. Workshop on Machine Learning in User Modelling of the ACAI'99, Creta, Greece, 1999, 588-589. [38] Srivastava, J., Desikan, P. and Kumar, V. Web Mining – Accomplishments & Future Directions. In Proc. PAKDD'03, Seoul, Korea, 2003. [39] Tauscher, L. and Greenberg, S. How people revisit web pages : empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, 47, 1 (1997), 97-138. [40] Weinreich, H., Obendorf, H., Herder, E. and Mayer, M. Off the beaten tracks: exploring three aspects of Web navigation. In Proc. WWW 2006, Edinburgh, Scotland, 2006, ACM Press, 133-142. [41] Xu, G., Cockburn, A. and McKenzie, B. Lost on the Web: An Introduction to Web Navigation Research. In Proc. The Fourth New Zealand Computer Science Research Students Conference, 2001.