08 Август

Emergence of Consensus and Shared Vocabularies in Collaborative Tagging Systems

Известная научная публикация о самоупорядочивании тегов социальных закладочных сервисах в общие для пользователей «категории». Исследование проводилось на данных сервиса Delicious в 2009 году. Цитаты ниже — отличная основа, на которой можно строить и проверить свои собственные гипотезы.


Classification, on the other hand, “involves the orderly and systematic assignment of each entity to one and only one class within a system of mutually exclusive and non-overlapping classes; it mandates consistent application of these principles within the framework of a prescribed ordering of reality” [Jacob 2004]


Other authors argue that tagging enables users to order and share data more efficiently than using classification schemes; the free-association process involved in tagging is cognitively much more simple than are decisions about finding and matching existing categories
[Butterfield 2004].


Again, by stable we do not mean that users stop tagging the resource, but instead that users collectively settle on a group of tags that describe the resource well and new users mostly reinforce already-present tags with the same frequency as they are represented in the existing distribution.


Mika [2005] addresses the problem of extracting taxonomic information from tagging systems in the form of Semantic Web ontologies. The article extends the traditional model of taxonomies by incorporating a social dimension, thus establishing an essential connection between tagging and the techniques developed in the SemanticWeb arena.


Huberman [2006], which also make use of del.icio.us data. They show the majority of sites reach their peak popularity, the highest frequency of tagging in a given time period, within ten days of being saved on del.icio.us (67% in their dataset), though some sites are “rediscovered” by users (about 17% in their dataset), suggesting stability in most sites but some degree of “burstiness” in the dynamics that could lead to cyclical patterns of stability characteristic of chaotic systems.


Therefore, while tags assigned to resources are accurate, their distributions may not be suitable to make a significant impact on search performance. This is somewhat in line with our findings: while tags converge relatively fast to stable power law distributions (cf. Section 2), the top of these distributions may contain common (or obvious) tags. Asolution to this problem (also suggested in Heymann et al. [2008]) may be a better mechanism for recommending tags.


The shared tag vocabularies (cf. Section 5 of this article) are not fully-fledged formal Semantic Web ontologies, but they can also be useful structures for many information retrieval applications, even without additional formalization.


When a search is complete and a resource of interest is found, collaborative tagging often requires the user to tag the resource in order to store the result in his or her personal collection. This causes a feedback cycle. These characteristics motivate many systems like del.icio.us and it is well-known that feedback cycles are one ingredient of complex systems [Bar-Yam 2003], giving further indication that a power law in the tagging distribution might emerge.


This pattern where the top tags are considerably more popular than the rest of the tags indicates a fundamental effect of the way tags are distributed in individual websites which is independent of the content of individual websites.


However for virtually all of the sites in the data set considered, the proportion of times a tag from the top 25 positions is used relative to the total number of times that a resource is tagged did stabilize
over time. So, while the total number of tags per resource grows continuously, the relative frequency of the tags in the top of the tag distribution compared to the those in the long tail does stabilize to a constant ratio.


Some nodes are much larger than others which again shows that taggers prefer to use to general, heavily used tags (e.g., the tag “art” was used 25 times more than “chaos”).


We acknowledge that this is a restricted definition: in some applications, especially Semantic Web approaches, we would also like to know precisely how these terms are related. This type of structural information is difficult to extract only from tags, given the simple structure of folksonomies. Nevertheless, our approach could still prove useful in such applications: for example, one could
construct the set of related terms as a first rough step and then a human expert (or, perhaps, another [semi]-automated method) could be used to add more more detail to the extracted vocabulary set.


Another potential application is in selecting terms for sponsored search auctions. Some keywords
(tags) bring a high value to advertisers, and knowing all the related keywords in a category that people can potentially use in search for can be very useful information for an advertiser.


We see the emergence of stable power law distributions as an aspect of what may be seen as collective consensus around the categorization of information driven by tagging behaviors.


Finally, we show that vocabularies extracted from collaborative tagging data can be significantly richer, at least for some domains, than the ones that can be extracted from general search engine query logs.


Another important direction of work would be examining the effects of using specialized sub-communities of users in the study of convergence of tag distributions and resulting information structures, rather than the entire user population as in this article. As shown by Heymann et al. [2008], del.icio.us is not dominated by a small number of core users, but other tagging sites may be. We know relatively little about how user concentration might influence the types of information structures that can be derived from tags. Furthermore, the shared vocabulary used by a specialized subcommunity of users may differ considerably to that of a larger user base.