Thursday, July 19, 2007

Scope of Indexing

Two weeks ago, since he was at the point of building an XSL tranform to index MODS records, Chick asked us what the project's indexing scope would be. As an initial response, I drafted a page of XPath notations for the DLF MODS Guidelines Levels of Adoption. I used XPath because it is a language for finding information in an XML document and it is used to navigate through elements and attributes in an XML document. In other words, XPath lets someone see where in the MODS record the subject heading is coming from and what are the elements of the name field. Why did I use the 5 Levels of adoption for MODS? It was a place to start and the Metadata Working Group has given the process a lot of thought.

Our next step, and one the SWG began this week, is to draft a list of high-level display labels for the fielded search limits on an "Advanced Search" so that Chick can begin developing advanced searching capabilities. We will begin by focusing on the facet categories/fields (e.g., Subject, Date, Title): which to include, how to label/order them, etc. We plan on spending time on the facet values (the actual string in each field) once we begin getting feedback from the rapid prototyping assessment of the portal.

Between these two methods (using XPath notations and fielded search limits), we are describing how to build the indexes and how to search them. We are trying to achieve a level of indexing transparency to let the contributors know how we will index their collections with the MODS they provide. As David Reynolds puts it, "most of the guidelines would not result in an index, but rather, they would be used by data providers to check the quality and completeness of their MODS records against our standard. For example, the @encoding in the element mods/originInfo/dateIssued would not be used to create an index, but it would be a test of how well the date in question would conform to most of the records in Aquifer. Recommendations such as using @type on would not be used to generate an index per se, but would be used to filter results on a name index."

In our case, this means that XSLT will transform the MODS record into a SOLR schema for inputting documents into an index (mapping MODS to how SOLR wants to look at things).

Thanks to David Reynolds for taking the time to review and correct my first attempt at XPath syntax.

Thanks to Steve Toub for clarifying the difference between Facets and Facet Values.