A new parameter-free approach to clustering XML documents by structure is proposed. The idea is to consider various forms of structural patterns occurring in the XML documents to form a hierarchy of nested clusters. At any level in the hierarchy, clusters explain how the XML documents can be grouped on the basis of common structural patterns of the form considered at that level. The resulting explanation is progressively refined at the subsequent level, where another type of structural patterns is used to divide the individual clusters from the above level into subgroups, revealing meaningful and previously uncaught structural differences. Each cluster in the hierarchy is summarized through a novel technique into a corresponding representative, that provides a clear and differentiated understanding of the structural information within the cluster. A comparative evaluation conducted over both real-world and synthetic XML data proves the quality of the results of the devised approach in terms of effectiveness and cluster summarization.

Fast and effective hierarchical clustering of XML documents by structure

Ritacco E.
Co-primo
2010-01-01

Abstract

A new parameter-free approach to clustering XML documents by structure is proposed. The idea is to consider various forms of structural patterns occurring in the XML documents to form a hierarchy of nested clusters. At any level in the hierarchy, clusters explain how the XML documents can be grouped on the basis of common structural patterns of the form considered at that level. The resulting explanation is progressively refined at the subsequent level, where another type of structural patterns is used to divide the individual clusters from the above level into subgroups, revealing meaningful and previously uncaught structural differences. Each cluster in the hierarchy is summarized through a novel technique into a corresponding representative, that provides a clear and differentiated understanding of the structural information within the cluster. A comparative evaluation conducted over both real-world and synthetic XML data proves the quality of the results of the devised approach in terms of effectiveness and cluster summarization.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1248970
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact