Home Welcome Introduction Offer Exchange Student Report Member Cooperation Contact UsBBS
Work Report
TAO CHENG
Zhejiang University/Computer Science


Our work during the two-month stay in the Knowledge and Database Systems Laboratory in National Technical University of Athens is mainly on improving the paper "Clustering XML Documents by Structure".

We read quite a lot of papers on this topic. New ideas came into our mind when reading these papers. After discussion, we agreed on to make some modifications to the original paper, such as adding the classifying part and etc. After two-month's work, the improvements we made are as follows:

1. Add the pre-processing part. This procedure results in great reduction of computing time and also improves the clustering accuracy.

2. Reduce the complexity of Tree Edit Algorithm. The original algorithm from Selcow works in exponential time. Our new approach works in quadric time. This also results in a dramatic reduction in computing time.

3. Change the stopping rule. The original stopping rule was from intuitive. We applied the C-index stopping rule that is found in L. J. Hubert and J. R. Levin's paper. This approach proved to be effective.

4. Add the classifying part. This part mainly deals with the problem upon the arrival of new XML documents. The K-NN algorithm we use proved to be effective.

5. Add the mapping clusters with DTDs part. In this part, we use an application DDbE to extract a DTD for every derived cluster and then use a DTD parser to get a tree structure for every DTD. By using the Tree Edit Algorithm, we are able to map the clusters with the original DTDs.

In the last week, we together worked on writing a technical report on the whole project. We described the whole project in detail and gave out the experimental results and our explanation. The name of the technical report is "Clustering and Classifying XML Documents"

My work during the stay in the lab mainly covers the following parts.

1. Get the idea of pre-processing after discussion with Theodore and Klaas. Implement the pre-processing algorithm that is comprised of two parts: nesting reduction and repetition reduction.

2. Detect the complexity of the original Tree Edit Algorithm. Implement the new algorithm that reduces the complexity from exponential to quadric.

3. Work together with Klaas to implement the clustering algorithm. More specially, I finished the part to get a Minimum Spanning Tree from the original graph.

4. Using the application DDbE to derive DTD for every cluster.

5. Using the DTD parse to get a tree structure for a given DTD.

6. Run experiments. More specifically on Timing analysis and evaluating the accuracy of clustering.

7. Work together with Klaas to write the technical report. "Clustering and Classifying XML Documents"

The following graph shows the whole project. And under the graph, we give a short description for every single part.

Clustering and Classifying XML Document

Parse XML file:

We use XML SaxParser to parse the XML files. The Parser constructs a tree structure for every XML document.

Pre-processing:

First phase: Deal with the nesting reduction problem.
Second phase: Deal with the repetition reduction problem.
After these 2 phrases, we get a summary tree for every XML document.

This summary tree will better represent the structure rules the XML document conforms to. Because we would like to cluster by similar DTDs this summary tree will generate more correct results.

Calculate similarity distance between summary trees:

We are now applying Andres's algorithm to calculate the similarity distance between summary trees. (The only difference is that we disallow the tree insertion and deletion operation.) The complexity of this algorithm is quadric.

Clustering algorithm

We use the Single Link Clustering algorithm. And then we apply the C-index stopping rule to get the right number of clusters.

Classification Algorithm

We use the K-nearest-neighbor algorithm. This means that we take K XML files from each cluster and compare them to the new file. After this we rank all the XML files and choose the cluster that the highest rankings.

Evaluation of clustering results

Here we try to specify a good method for evaluating our clustering results.

Extract DTD

We will use a software package DDBEV that can generate a DTD that will conform to all XMLs in each cluster.

DTD Parser

After we have 1 DTD for each cluster we will extract a tree structure from this with a java program we found on the Internet. Generating this tree we will loose information, for example the | , * and ? operations. Because the structure weighs more on the similarity instead of the operations this will hopefully not be a problem.

Compare Original and Cluster DTD Trees

Using this simplified tree we can now also compare the original DTD and the DTD from each cluster using our Tree Edit Distance Algorithm.


 

Copyright©2002 IAESTE CHINA All Rights Reserved Contact us at:webmaster@iaeste-china.org