|
Our work during the two-month stay in the Knowledge and Database
Systems Laboratory in National Technical University of Athens is
mainly on improving the paper "Clustering XML Documents by
Structure".
We read quite a lot of papers on this topic. New ideas came into
our mind when reading these papers. After discussion, we agreed
on to make some modifications to the original paper, such as adding
the classifying part and etc. After two-month's work, the improvements
we made are as follows:
1. Add the pre-processing part. This procedure results in great
reduction of computing time and also improves the clustering accuracy.
2. Reduce the complexity of Tree Edit Algorithm. The original algorithm
from Selcow works in exponential time. Our new approach works in
quadric time. This also results in a dramatic reduction in computing
time.
3. Change the stopping rule. The original stopping rule was from
intuitive. We applied the C-index stopping rule that is found in
L. J. Hubert and J. R. Levin's paper. This approach proved to be
effective.
4. Add the classifying part. This part mainly deals with the problem
upon the arrival of new XML documents. The K-NN algorithm we use
proved to be effective.
5. Add the mapping clusters with DTDs part. In this part, we use
an application DDbE to extract a DTD for every derived cluster and
then use a DTD parser to get a tree structure for every DTD. By
using the Tree Edit Algorithm, we are able to map the clusters with
the original DTDs.
In the last week, we together worked on writing a technical report
on the whole project. We described the whole project in detail and
gave out the experimental results and our explanation. The name
of the technical report is "Clustering and Classifying XML
Documents"
My work during the stay in the lab mainly covers the following
parts.
1. Get the idea of pre-processing after discussion with Theodore
and Klaas. Implement the pre-processing algorithm that is comprised
of two parts: nesting reduction and repetition reduction.
2. Detect the complexity of the original Tree Edit Algorithm. Implement
the new algorithm that reduces the complexity from exponential to
quadric.
3. Work together with Klaas to implement the clustering algorithm.
More specially, I finished the part to get a Minimum Spanning Tree
from the original graph.
4. Using the application DDbE to derive DTD for every cluster.
5. Using the DTD parse to get a tree structure for a given DTD.
6. Run experiments. More specifically on Timing analysis and evaluating
the accuracy of clustering.
7. Work together with Klaas to write the technical report. "Clustering
and Classifying XML Documents"
The following graph shows the whole project. And under the graph,
we give a short description for every single part.
Clustering and Classifying XML Document
Parse XML file:
We use XML SaxParser to parse the XML files. The Parser constructs
a tree structure for every XML document.
Pre-processing:
First phase: Deal with the nesting reduction problem.
Second phase: Deal with the repetition reduction problem.
After these 2 phrases, we get a summary tree for every XML document.
This summary tree will better represent the structure rules the
XML document conforms to. Because we would like to cluster by similar
DTDs this summary tree will generate more correct results.
Calculate similarity distance between summary trees:
We are now applying Andres's algorithm to calculate the similarity
distance between summary trees. (The only difference is that we
disallow the tree insertion and deletion operation.) The complexity
of this algorithm is quadric.
Clustering algorithm
We use the Single Link Clustering algorithm. And then we apply
the C-index stopping rule to get the right number of clusters.
Classification Algorithm
We use the K-nearest-neighbor algorithm. This means that we take
K XML files from each cluster and compare them to the new file.
After this we rank all the XML files and choose the cluster that
the highest rankings.
Evaluation of clustering results
Here we try to specify a good method for evaluating our clustering
results.
Extract DTD
We will use a software package DDBEV that can generate a DTD that
will conform to all XMLs in each cluster.
DTD Parser
After we have 1 DTD for each cluster we will extract a tree structure
from this with a java program we found on the Internet. Generating
this tree we will loose information, for example the | , * and ?
operations. Because the structure weighs more on the similarity
instead of the operations this will hopefully not be a problem.
Compare Original and Cluster DTD Trees
Using this simplified tree we can now also compare the original
DTD and the DTD from each cluster using our Tree Edit Distance Algorithm.
|