From: Paul Rohr (paul@abisource.com)
Date: Sat May 18 2002 - 13:40:39 EDT
While I haven't had time to carefully read and think through all the design 
work going on in the "revision marks" threads, I'm happy to see that people 
are seriously engaging with the complexity of the problem. 
The following conceptual overview of how the piece table works might be 
helpful for folks who aren't familiar with it. 
piece table concepts
--------------------
If you dig into the implementation of the piece table, you'll note that it 
includes:
  I. the entire document as originally loaded, 
  II. all new content & formatting added during the edit session, 
  III. an audit history of all the changes, in order, and 
  IV. bookkeeping information to construct a "current" view of the document. 
Of these, the rest of AbiWord spends most of its time dealing with IV.  III 
is created in response to specific edit methods, and is also somewhat 
visible via the undo/redo mechanism that allows you to replay them.  By 
comparison, the first two are pretty low-level and the rest of the AbiWord 
never sees them.  
For example, the drawing code just references IV, as do the exporters.  
revision marks
--------------
AFAICT, the most useful thing about revision marks is the ability to tell 
the difference between two states of a document:
  - what it was like when someone loaded it, and 
  - what it was like when they saved it.  
In short, while you do want to know overall what the diff between those two 
states was -- content that got added, deleted, moved, or reformatted by a 
specific contributor -- you *don't* need to know what all of the specific 
typing glitches in between were.  For example, if an editor retyped a 
sentence fifteen different ways before saving it, that's just not useful 
information.  [Plus which, it's a pain in the ass to keep track of it.] 
the insight
-----------
If you look at how the piece table is constructed in memory, there are 
basically five chunks of memory. 
  - a contiguous buffer of the original document's text (read-only)
  - a contiguous buffer of all new text typed (append-only)
  - the original document's properties (again, read-only)
  - new properties added (again, append-only)
  - a heck of a lot of bookkeeping to tell which parts of each are being 
    used in the current version. 
In short, I believe that this is enough information to detect all of the 
relevant changes at export time. 
caveat
------
The following example just shows how to detect text-level changes.  There's 
definitely an additional level of complexity needed to capture formatting 
changes, too.  
(Indeed, FWIW, I sometimes think that we should take advantage of the fact 
that our file format is text-based, and use a tailored XML-aware version of 
diff to handle all of the low-level work here.)
BTW, *do* people use revision marks to keep track of formatting changes too? 
 What does the UI look like? 
for example
-----------
Say that the original document was:
  The quick brown fox jumped over the lazy dog.
After a few cuts and pastes, plus a little typing, the new document reads:
  The lazy green fox slept next to the quick dog.  
In short, the text stored in the piece table can be divided into the 
following five equivalence classes.  The first three subdivide the content 
from the original document as follows:
1. text that didn't change
  The quick brown fox jumped over the lazy dog.
  ^^^^            ^^^^            ^^^^     ^^^^
2. text that's no longer used
  The quick brown fox jumped over the lazy dog.
            ^^^^^^    ^^^^^^^^^^^^
3. text that got moved:
  The quick brown fox jumped over the lazy dog.
      ^^^^^                           ^^^^
The other two equivalence classes come from the "other half" of the piece 
table, namely:
4.  new text that got used
  sloppieysleppt on top of ofver greenmowvemauve next to
          ^^^^ ^                 ^^^^^           ^^^^^^^
5.  new text that got abandoned
  sloppieysleppt on top of ofver greenmowvemauve next to
  ^^^^^^^^    ^ ^^^^^^^^^^^^^^^^^     ^^^^^^^^^^^
Yes, that's actually what the piece table looks like at the lowest levels.  
[Indeed, it's usually even messier.]  There are two big streams of text:
  - one as imported
  - one of all the characters typed in this edit session
All of the rest of the PT machinery just assembles together the relevant 
pieces of the text as needed.  It's a really, really cool data structure.  
so what?
--------
Take a look at the five equivalence classes in the above example.  When we 
currently persist the file format, the piece table extracts an ordered 
sequence of fragments of text that corresponds to the following equivalence 
classes:
  1. unchanged text
  3. moved text
  4. new text
Even if we're supporting revision marks, there's still no reason to export 
the following equivalence class:
  5. new-and-abandoned
However, we *will* want to export the other four, and differentiate them 
accordingly:
  1. unchanged text
  2. deleted text
  3. moved text
  4. new text
My hope is that when revision tracking is turned on, we can algorithmically 
identify which equivalence class a given frag is in, and then emit and flag 
them, in order, at export time.  For example, the following are easy:
  #4 is used, and comes from the append buffer
  #1 is used, and comes from the read-only buffer
  #2 is *not* used, and comes from the read-only buffer
The first two (#4 and #1) are easy to detect and sequence properly.  #2 can 
be inferred (by a brute force scan, if needed), although getting them 
ordered properly will be more work.  
Thus, the only screw case will be #3.  As with #1 and #4, sequencing them is 
easy -- that's what the PT is already doing for us -- but it's not 
immediately obvious what the best way is to differentiate #3 from #1.  
Export-time alternatives might include:
  - scanning change records, 
  - maintaining a stack to detect reordered frags, or 
  - something more clever.  
To be honest, I forget how Jeff and I actually coded the copy/paste logic.  
It very well may be that equivalence class #3 gets decomposed into a pair of 
add/delete operations.  If so, then this isn't even an issue. 
next steps
----------
The above explanation just addresses whether and how to detect changes 
between the version loaded and the version about to be saved.  It does NOT 
address:
  - how that should be reflected in the file format, NOR
  - how small a subset of CVS's branching & merging features to reinvent. 
As we've seen, the latter can quickly get way too complex.  Still, I hope 
that people find this a useful starting point. 
Enjoy!
Paul
This archive was generated by hypermail 2.1.4 : Sat May 18 2002 - 13:43:12 EDT