Affects Version/s:ManifoldCF 1.7
Fix Version/s:ManifoldCF 1.7
Component/s:Framework agents process
In some cases, documents that are indexed may be virtual children of those that are queued. A good example of this is RSS feeds where the data being indexed all comes from the feed.
In order to implement this, the following changes would be required:
(1) IProcessActivity.ingestDocument() has a variant which allows you to include a virtual child document identifier in addition to the main document identifier.
(2) IIncrementalIngester's addOrReplaceDocument receives TWO document keys – one for main (queued) document identifier, one for child virtual document identifier.
(3) IIncrementalIngester has two new methods: beginDocument() and endDocument(), both of which take a main (queued) document identifier as an argument.
(4) ingeststatus table has two additional columns: a state, and a child key.
(5) The flow is: at beginDocument() time, put all records relating to a document into a "processing" state. Documents that are seen have their state changed. Documents never encountered are deleted at the end.
(6) Incremental decisions not to update an output record STILL will require that the record be touched and its state set.
(7) DocumentIngest records for the entire set of children will be fetched when the document is queued.
(8) The getDocumentVersions() method must be modified to allow return of version strings for all children, although there can be "shortcuts" as well (where a single version string applies to all children.)
(9) The decision about whether to refetch a document is based on the returned version strings and on those fetched by the stuffer thread.
(10) Similarly, processDocuments() receives version strings for all virtual children.
(11) There is no need to actively reset the state of document records on restart; the current logic should be robust enough to be able to generate the required deletions.
(12) Deleting a document deletes ALL child virtual documents. This happens within the incremental ingester.
(13) Requeuing interval must be computed across all children, taking the minimum, since there's no requirement that an ingeststatus record exist for the parent.
(14) All other logic, including making sure only one agent operates on a url at a time, is the same.
(15) Interrupting the delete phase is…