Related work

Status of XML diff - a preliminary overview

XML documents can be represented as a tree (attributes are not taken into consideration). An XML differencing algorithm can therefore be represented as a differencing between two tree structures. In general, the difference between two trees (i.e. the delta) is represented as a list of operations that when applied to the first tree, gives as result the second tree. There are several algorithms described in the literature, which have their specific properties. Roughly put and algorithm may be efficient calculating the diff between large trees, while ignoring the number of operations needed. Alternatively, an algorithm may produce quality diff, which is characterized by minimizing the number of operations needed to transform the first tree into the second tree. For the efficient algorithms typically a limited number of operations are considered (e.g. create node, delete node). For the quality algorithms, typically a larger set of operations is considered (e.g. move node).

As far as we could oversee, there is no off-the-shelve differencing tool/library that everybody uses. There is an initiative of Nokia to standardize the xml diff (http://tools.ietf.org/html/rfc5261). And there are several relatively recent implementations floating around (e.g. http://code.google.com/p/fc-xmldiff/).

Although an xml-diff is feasible, a merge of two documents into a third (which is typically what we want) is more complicated. As far as we could oversee there are few stable implementations that do this. There are a few papers on the topic (see below).

Quality diffs are typically better suited for human consumption, but the algorithms are expensive. However, in case of Scenari, fragments are relatively small so performance is not really an issue. Furthermore, an idea could be to use the available schema to improve the quality and/or efficiency of the algorithm. As far as we know this has not been tried before and could be beneficial to the (expensive) move operation.

Similarly, visualization of diffs is an open topic. Typically, it is simply avoided in end-user applications because the results are hard to interpret. A possible idea we have been experimenting with is to present the generated html for two atomic fragments to a user. Since fragments are typically relatively small the generated html is small too and can be presented on a single screen without the need for scrolling. The composite fragments could be visualized using a graph visualization.