Reference Management in Silva (done)

Note: this is an old text written by Martijn Faassen in 2004. The underlying infrastructure has improved since then but it's still good food for thought.

Goals

In this section, we identify goals, constraints and questions for the reference management system.

  • When an object is moved, all references to it should still be correct.
  • A user should be able to quickly identify broken references in a document. As soon as a broken reference is created, this should be indicated. Broken references should however still be accepted, as the user could correct matters shortly.
  • A user should be able to easily see whether an object is being referenced, so that the user can decide not to delete/close a document after all.
  • A user should be able to get an overview of all objects that reference the current object.
  • A user should be able to get an overview of all objects being referenced by the current object, and perhaps an easy UI to modify them.
  • Reference management should work with workflow and the versioning system. The details of what this requirement means are still to be worked out. Are references to versions or to versioned content objects?
  • We prefer readable hyperlinks to opaque links containing long numbers in public pages as well as in the editor.
  • Reference management should work with relative links.
  • Ghosted content in a ghost folder should ideally be referencing other ghosted content in the same folder, not the original objects.
  • Reference management should work with caching. A potential problem is the caching of references that once worked but now are broken.
  • Reference management should work with security; no information should be exposed that a user with a role cannot access.
  • What does it mean to place a reference to a container object?
  • What does automatic reference generation like done by the Table of Contents mean in this system?
  • Preview-mode links should ideally point to other preview-mode pages, not public pages.
  • Provide a good upgrade-path from current unmanaged Silva to ref-managed Silva.
  • Recover from situations well where references in the database became broken because of arbitrary reasons.
  • Tackle issues concerning previewing versus viewing content. In 'preview mode', all references should point to other preview versions, if possible. The exact semantics of this needs to be decided, but if all references are under control of the reference management system this should at least be implementable.
  • How does the reference management system interact with (deep) virtual hosting? A reference that points outside a virtual host cannot be made to work unless special measures are taken. What will we do?
  • A common case is where a copy of a tree is made as a backup. For instance, an organization may want to the current guide and call it 'Guide 2004'. Further edits will be made in the original. This operation itself would not have problems (as long as internal links in the copy still work). Another operation would be to leave the original version alone and create a new copy to work in. In this case, the organisation will want to update the links in the site to the new copy, either immediately or at a later date. Facilities need to be made available to make this possible.
  • Support must be given for URL stability. While a reference management system makes it easy to change around the internal links in a site, people will still want to bookmark pages and link into them from other sites. People will want to make sure that URLs don't break even if large sections of the site are moved or copied to new versions. This can be tackled by leaving ghosts or ghost folders in place that always point to the "current" version of the information.
  • Thought needs to be given to the management of references outside the site; i.e. external URLs.
  • Do we need support for explicit breaking of references? In some cases, we want to move an object away, knowing that we'll place a new object in the original place. We want the references to point to that. We could support this either by breaking the references and having them somehow reestablish by the new object, or by having a facility in the UI to make these references point to the new object.
  • We need to think about what references look like in the XML representation. Do we save the reference ids or the paths, or both? Saving the reference ids and importing them again leads to problems, as it could lead to duplicate unique ids, as well as to broken references or even worse, reference to strange places just based on accidental similarities in site-unique ids.

System Components

In this section, we will identify (potential) components that are needed to support the reference management system.

Reference database

A reference database is needed that maps which object is referencing what. It should be possible to quickly identify all objects that are referencing a particular object.

This database should be persistent, as it should not be gone when the system restarts. It should also be ZEO-shareable.

It seems all right to share this database between multiple instances of Silva in a Zope. It could even be external to Silva altogether, in for instance a relational database or REST web system.

In addition, it may be useful to have a unique id <-> path database, depending on what approach we take.

Event system

In order to keep the reference up to date in the light of a host of events in the system that could modify references, an event system is the most robust way to tackle this. Using an event system the place in the code where the events are sent can be decoupled from the actions taken. This allows extensibility.

Proposed is the use of the Zope 3 event system architecture, as exposed by Five.

Link integrity during entry

Currently the editors have a 'liberal' link system, that allows the entry of links to relative paths that only work due to acquisition; these wrong links are actually stored and "corrected" during display.

Something could be worked out that so that only links that truly reference something in the system could be stored. This way link-integrity in the database can be established.

User interface

In this section, we will identify (potential) UI aspects that will be affected by the reference management system.

References screen

Each content object will gain a references screen that will show the list of references to this object, as well as the references that this object makes to other objects.

Is this screen per version or per object? For outgoing references per-version may make more sense, for incoming references per-object may be better.

Publication/contents screen

It should be easy to see whether a content object has references to it, to help editors in their decision to close or delete objects.

Editor hints

Whenever a broken reference is created, it would be nice to display this directly in the editor (different link style). This is hard for client-side editors such as Kupu however, so perhaps a 'check references' button would be easier to implement.

Do we add this ability to both Kupu and the forms editor or only Kupu?

Public rendering

Ignoring caching problems for the moment, it should be possible to render broken links differently than actual functional links. This way, these broken links could be made 'non-clickable', instead of what we do now, which is to "correct" them to avoid cycles that confuse web crawling robots.

Other gains

Besides avoiding broken references in a site, a reference management system has other potential gains.

Smarter caching

An event system could help in making pages more cacheable. If events are sent out for any structural change, pages with table of contents or links in the layout templates could conceivably become permanently cacheable, as explicit invalidation messages could then be used to notify a caching system such as Squid to stop caching a page.

Ranking system

A reference management system could be mined for 'ranking' objects, similar to the way google pagerank works. This could help with search results.

Some designs

Below we will sketch out some designs to solve particular problems in the system.

Relative linking

When a copy of a container tree is made, it can contain references that point to other objects in it, and it can also contain references that point to objects outside it. If a normal reference update strategy is taken, the inside-linking references will in the copy now point to the original. This is probably an undesired behavior. Inside-linking references should point inside the copy. This means that the references are actually pointing at a new object.

In Silva now, this problem is solved by using relative hyperlinks for such linking. These links in a copy of a tree will "automatically" point to the copied objects. In a version of Silva where reference management is in effect, we would like these relative links to be updated automatically if what they reference actually moves by itself, so these links need to be under control of the reference management system.

An algorithm for this could be as follows:

  • Copy original onto copy. New references are made in copy for any reference that was in original.
  • Find all references in copy point inside the original.
  • For all these references, calculate equivalent object in copy. This can be done by calculating the relative path into the original and then traversing it into the copy. Change references in copy to point to the new location.

Now, all references in the copy should be pointing to the right thing again. An efficient implementation of this algorithm requires a fast way to find all references inside a container.

Open question: what should be the relative referencing policy for Ghosts and Ghost Folders? What does the user expect?

Preview mode

Silva objects come in multiple versions. In "preview mode", the non-published version (if any) will be displayed for all objects. All links in these objects will be to other non-published versions, if any, so that a user in preview mode will stay in preview mode wherever he goes. Even when no non-published version exists and only a published version is available, the preview rendering of such a version will only offer links to other previews.

In order to accomplish this, link rendering needs to be different in preview mode as in normal public mode. Once you are on a preview display of an object, any link away from that object should be going to another preview display. Since a reference management system has control over all links in the system, this should be doable.

Ideally all links generated by layout templates should also be under control of this system, as otherwise any click on a link in the layout template will still force an exist from the preview display. It should be made easy for a layout designer to make the layout template do the right thing. An alternative that is simpler is to turn off layout templates in preview mode altogether, perhaps replacing it with a standard layout template. This would make things easier, though at the risk of actually defeating the purpose of preview mode -- in a preview, a user would like to see the site as possible as it would look like when published.

Smarter caching

References can come in multiple types:

  • hard reference. This reference can be broken. It'll update automatically when the referenced object is moved.
  • soft reference. This reference cannot be broken. It's there because of a table of content or navigation logic. When the referenced object is moved, the reference is automatically broken, though may be reestablished.

Managing soft references is important if we need to build "perfect" smart caching. When an object that is soft-referenced changes it title or moves, all the caches for the objects that reference (soft or hard) them will be invalidated.

An example of a cache built that operates on a soft-referencing structure right now in Silva is the sidebar cache in the Silva UI.

A hard-reference is placed whenever a reference is actually made by the user. This reference consists of a hyperlink or ghost object. Soft references are not explicitly made by the user, but are instead made by rules. A table of contents for instance makes automatic soft-references to all published content from its level on up to a certain depth. A navigation template's behavior can also be described in rules.

Rules are active in areas in the site; typically from one location down, until perhaps overridden by another rule deeper down. These rules need to be stored in a per-site rule database that can be efficiently queried. Whenever an operation takes place that could affect a reference, any rules that would fire need to be looked up and be activated. These will then update the reference database, adding or removing soft references where necessary.

Cache invalidation itself could also be implemented as a rule, though this would be a rule that updates the cache database, not the reference database. This way, cache invalidation strategies could be different in different subsections of the site. This is a rather different design strategy though compared to the strategy where the cache invalidation strategy is simple, but where soft-references are maintained.

Algorithm: temporary relative paths

Turn all references into relative paths in the copy. Then turn relative paths back again into unique id references.

Relative paths may also be used to solve the ghost folder problem. If links are rendered as relative links if possible, a ghost folder could just take the rendering of the original, and the link would still resolve to a ghosted object.

Algorithm: knowing whether document cache is up to date

A document can currently be cached. In the context of reference management, a reference to document foo/bar can suddenly point to foo/baz. The rendering of the document will need to be invalidated.

A document could know whether its reference rendering is incorrect if the unique-id facility kept a timestamp for the last update of the unique-id. This timestamp could be compared with the timestamp of the creation of the cache.

Development plan

Phase I: event system

An event system is needed to track all the changes in a Silva site that pertain to reference management. This includes object moves, copies, changes to content, changes to titles and more. Using an event system means that the handling of the event can be loosely coupled from the place where the event is sent, keeping the codebase more clean and allowing code defined in Silva extensions to react to Silva events as well.

Silva will need to be modified so it sends out events.

A system needs to be in place that can listen to these events and store information based on that in a central ZODB-based repository.

The aim is to base this system on the Zope 3 event system, and include it into Five. Event-sending is technology that can be readily included into Zope 2 with Five, and this has in fact already been accomplished. Handling the events so that information can be maintained in the ZODB needs more work however.

Phase II: unique id facility

Based on the event system, we will create an unique id facility. Each Silva object that needs to become managed by the reference management system needs an id that is unique to that Silva instance. When an object is moved, the id should remain the same. Ids can be resolved programmatically to find the object again.

Zope 3 contains a unique id facility as well, but to enable this in Zope 2 it will need some modifications.

Phase III: introduction of references

Silva Documents and other referencing objects will be adjusted so that next to traditional (relative) hyperlinks, they can also contain references.

To a public viewer of the object nothing changes -- unique-id based references will in fact be rendered as hyperlinks as well. This is done to preserve the user experience of readable, meaningful urls in Silva.

An other now has the choice when creating a reference to another document between using a URL-based hyperlink or a reference number. The author can easily switch back and forth between the two.

The advantage of using a reference number is that when an object so referenced moves, the reference will not break.

The advantage of using URLs is that relative URLs and external URLs are still possible. Relative URLs are important when closely related documents link to each other. When these objects are copied together, the user expects the copies to point to each other, not back to the originals. While this is solvable within the reference management system, we will not solve this yet in phase III. Instead we advise the use of relative hyperlinks for this use case for the time being.

In this phase, a database of references is not yet required -- just a database of unique ids and where the objects with these unique ids are.

Interlude: a word on ghosts and ghost folders

We hope that ghosts and ghost folders can be implemented in better ways once the reference management system is in place. References in ghost folders pose similar problems as references in copies of related objects. Items in folders that refer to each other will need to be ghosts that refer to each other in a ghost folder. With reference management in place, the default behavior would be having them point to the originals. A separate project we foresee is a change in the way ghost folders work to make this possible.

Phase IV: full reference management

Now all references in the site are under the control of the reference management system. There is a complete database of all references, and efficient querying facilities for it. URL-based references to other resources managed by the reference management system are not possible anymore -- these are automatically translated to references internally.

The UI will now offer facilities to see what is referencing a particular content object.

The problem with copying of object which link to each other mentioned in phase will also be solved: see the section about relative linking for a possible algorithm.

The exposure of reference information does not have to be rolled out in one big step. There is a lot of information and a lot of places in the user interface (contents screen, publication screen, per-object "what references me" screen, document editor) where this information could be useful. Exposing more and more of this information can be rolled out over multiple versions of Silva.

Phase V: Soft references

Soft, rule based references are introduced to make smart caching and possible large performance and scalability gains possible.

Author

Martijn Faassen, 9 September 2004