Yannis Velegrakis ,
Vassilis Christophides ,
Panos Constantopoulos
Computer Science Department, University of Toronto,
Toronto, Ontario, Canada M5S-3H5
velgias@cs.toronto.edu
Institute of Computer Science, FORTH,
Vassilika Vouton, P.O.Box 1385, GR 711 10, Heraklion, Greece
Department of Computer Science, University of Crete, GR 71409,
Heraklion, Greece
{christop, panos}@ics.forth.gr
Z39.50 is a client/server protocol widely used in digital libraries and museums for searching and retrieving information spread over a number of heterogeneous sources. To overcome semantic and schematic discrepancies among the various data sources the protocol relies on a world view of information as a flat list of fields, called Access Points (AP). One of the major issues for building Z39.50 wrappers is to map this unstructured list of APs to the underlying source data. Unfortunately, existing Z39.50 wrappers have been developed from scratch and they do not provide abstract mapping languages with verifiable properties. In this paper, we advocate a Description Logic (DL) framework for the declarative specification of Z39.50 wrappers. We claim that the conceptualization of AP mappings enables a formal validation of the translation quality between the source and the Z39.50 view of information and therefore ensures the quality of the retrieved data (i.e. accuracy, consistency, completeness, etc.). Our contribution is twofold : (i) we propose a DL-based toolkit for the declarative specification of Z39.50 wrappers; and (ii) we enrich the generated Z39.50 wrappers with a number of added-value services (e.g. conceptual structuring of flat Z39.50 vocabularies).
KEYWORDS: Z39.50, Description Logics, Information Retrieval, Information Integration, Wrappers, Data Quality.
With the advances in digital processing and communication technologies an increasing number of organizations and individuals are using the Internet for publishing, broadcasting, and exchanging information all over the world. The ability to share, interpret, and manipulate information from multiple sources is a fundamental requirement for large scale applications e.g., digital libraries and museums. A widely used protocol for searching and retrieving information in a distributed environment is Z39.50 [2]. To achieve interoperability [43], the protocol (Z39.50 Version 3) relies on (i) standard messages, formats, and procedures governing the communication of Z39.50 clients and servers (system interoperability), (ii) a world view of information as a flat vocabulary of fields, called Access Points that abstracts representational details of source data (semantic and schematic interoperability), and (iii) basic textual search primitives to express Boolean queries in the form of field-value pairs (functional interoperability).
Sources should then wrap their actual data organization, format and access methods according to the Z39.50 specifications for an application, function, or community, as described in the various profiles (i.e. metadata) proposed by national or international bodies (e.g., Library of Congress, CIMI, etc.). It should be stressed that the quality of the established mappings between the source and the Z39.50 view of information is fundamental in order to ensure the quality of the retrieved data (i.e. accuracy, consistency, completeness, etc.). Unfortunately, most of the time, Z39.50 wrappers are developed using some programming language and they do not provide abstract mapping languages with verifiable properties [44, 11, 45]. In this paper, we advocate a Description Logic framework [9] (such as proposed in the context of the DARPA KSE [41]) for the declarative specification of Z39.50 wrappers using high-level concept languages. We claim that modeling the required mappings as first-class citizens, instead of hard coding them in the wrappers (i) allows the formal validation of the translation quality with respect to the AP semantics defined in a Z39.50 profile (e.g., equivalent, partially overlapped, etc.), and (ii) opens unexpected opportunities to tackle a number of Z39.50 pending issues (e.g., query failures due to unsupported APs, metadata retrieval, multiple answer sets handling, etc.).
Building a wrapper for an information source according to a Z39.50 profile (e.g., for digital libraries [34, 33], museums [46, 20], scientific and technical databases [21, 25], etc.) implies the translation of (i) the Z39.50 Access Points (AP) to the underlying source data structure and semantics, (ii) the Z39.50 Boolean filters to the source query primitives, and (iii) the returned source data from their original format to a predefined Z39.50 record syntax (e.g. GRS-1, US-MARC, XML). For loosely structured sources (e.g., Information Retrieval Systems) wrapping is relatively simple. It essentially requires to define some renaming mappings from the APs to the source data attributes, fields, tags, etc. (e.g., AP AU to field author, etc.). However, for highly structured sources (e.g., Database Management Systems, Knowledge Base Systems) the translation process is considerably more complex. This is mainly due to the fact that there exists a significant mismatch between the Z39.50 flat view of information and the underlying source data model and query language (e.g. relation or class based). In this context, what is really needed is to define for each AP a view on the source data.
To address this issue we introduce an intermediate level between the Z39.50 and the source world, based on advanced knowledge representation and reasoning support, specifically Description Logics (DL). DL provide declarative languages to represent and reason about interrelated sets of objects using modeling primitives such as concepts, roles, and individuals (concepts form a subsumption taxonomy having a bottom and top). Starting from a set of primitive concepts and roles representing source conceptualization, we capture the semantics of the AP mappings as derived concepts formed by primitive ones and standard DL concept operators [5]. Since DL can serve both as knowledge representation languages and as query languages [8, 42, 14], derived concepts essentially act as views [15] against which Z39.50 queries are evaluated with source data. Our contribution is twofold : (i) we propose a toolkit for the declarative specification of Z39.50 wrappers using standard DL reasoning mechanisms [23] (i.e., Concept Satisfiability, Subsumption and Instance Checking); and (ii) we enrich Z39.50 wrappers with a number of added-value services as described in the following :
The rest of the paper is organized as follows. In Section 2 we give an example of a cultural information source and describe the difficulties encountered in wrapping its structured contents according to a digital museum Z39.50 profile. In Section 3 we briefly recall the core Description Logic (DL) model and we show how it can be applied for the declarative specification of Z39.50 AP mappings. Section 4 presents the Z39.50 query processing in our DL framework and Section 5 elaborates on the added-value Z39.50 wrapping services we offer. The architecture of the developed wrapper toolkit is presented in Section 6. Finally, we conclude and discuss future work in Section 7.
In this section we describe the contents and structure of a cultural information source that will be used as running example in the rest of the paper. We focus on the mismatch between the conceptualization of our test database and the conceptualization of information in a Z39.50 profile for Digital Museums [46, 20], as well as on the consequent problems we have encountered in order to develop a Z39.50 wrapper in the context of the AQUARELLE and CIMIzit projects [38, 37].
As a testbed we use the CLIO cultural documentation system, developed at the Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH-ICS) in close cooperation with the Benaki Museum, Athens and the Historical Museum of Crete, Heraklion. CLIO supports the recording and management of an evolving body of knowledge about ensembles of cultural goods and addresses the needs of museum curators and researchers. It focuses on extensibility of knowledge, multiplicity of representation, and management of imprecise and incomplete information. The functional kernel of CLIO is the Semantic Index System (SIS) developed by FORTH-ICS [22]. SIS is a persistent storage system based on the object-oriented semantic network data model TELOS [39].
Figure 1 illustrates some features of our example data source inspired by the CLIO system namely simple and multiple classification as well as multi-valued and optional attributes. A museum object is represented as an instance of the class ``MuseumObject''. It may have (optional attributes) an owner (class ``Owner'') and be constructed with the use of one or more (multi-valued attributes) materials (class ``Material'') and techniques (class ``Technique''). Each museum object is associated a series of events (class ``Event'') characterized by their kind, date and involved actor. For instance, the saber of Androutsos (a hero of the 1821 Hellenic Revolution) is made of shaped silver (multiple instantiation) and it was constructed by Filimon in 1815. Although not illustrated in our example, SIS-TELOS also supports simple and multiple inheritance, unbounded classification allowing the definition of meta-schemata, and treats attributes as first class citizens classified on their own.
Figure 1: An Example of a Cultural Information Source
Z39.50 [2] is a session oriented and stateful application protocol, based on the client/server architecture. To overcome semantic and schematic discrepancies among the various data sources, Z39.50 relies on a common model of the information shared by all clients and servers. It consists of a flat list of fields, called Access Points (AP) (or more precisely Use Attributes), on which queries are expressed. For instance, in the CIMI [20] or AQUARELLE [46] profiles, the supplied APs correspond to general information categories like People (specific persons or cultural groups), Dates of many sorts (including dates of creation, acquisition, exhibition), Places (e.g. place of creation, places associated with an event, galleries, provenance), Subject (exact description of depicted material), Style (including movement and period), Method (including process and techniques), Material, etc. [31].
This vocabulary of fields is first employed by a client in order to search and identify records from the underlying sources and next, to retrieve some or all of them. Z39.50 queries are formulated using Boolean connectors (and, or, and-not), search terms (i.e. Use attribute-value pairs), and qualifiers specifying lexicographical comparisons (e.g., greater than), truncations (e.g. right or left), etc. Going back to our cultural scenario, the following query searches for all the museum objects related with Androutsos, that have been created after 1887 :
Q1: PersonalName=``Androutsos'' and (DateOfCreation=1887 Relation=``GreaterThan'')
According to Figure 1, the person Androutsos might be the creator (i.e. the actor involved in a creation event), or the owner of the object. This implies that a query on the AP PersonalName should be translated by the wrapper into queries on the source Actor and Owner classes. Furthermore, a query on the AP DateOfCreation should be translated into queries on the Time_Span class and the associated Object_Event and Kind classes. Finally, the returned museum objects information, should be formatted/converted by the wrappers according to a common agreed record syntax (e.g. GRS-1, US-MARC, XML) and structure (e.g., elements ObjectId, Title, Creator, CurrentLocation, etc.).
We believe that the underlying Z39.50 information model is more suitable to query loosely structured text bases than highly structured data sources. Indeed, due to the significant mismatch between the Z39.50 and the source information model, most of the existing structure and semantic richness of the sources is not taken into account during querying while the wrapping process becomes considerably more complicated. It becomes clear that a query on an AP may be translated to (i) a query on one or more classes or attributes (in cases of object-oriented sources); (ii) on one or more fields and joins of relations (in cases of relational sources); or (iii) to any other complex query in the source native query language. In this context, nothing guarantees that the semantics of the specified view correspond exactly to the semantics of the AP in the Z39.50 profile: it may be included in the original AP semantics, partially overlapped, etc. This is typically the kind of information that is missing from existing Z39.50 wrappers in order to ensure the quality of the retrieved data (i.e. accuracy, consistency, completeness, etc.). Two Z39.50 wrapping issues are worth further elaboration and they will be addressed in the subsequent sections.
Description Logics (DL), also known as terminological logics, has been intensively studied for more than a decade in the field of Knowledge Representation and Reasoning Systems (KRRS). DL provide declarative languages for the representation and reasoning about classes of objects and their relationships, encompassing other well-known formalisms such as entity-relationship or class inheritance models [17]. Recently DL have received considerable attention in the context of information integration systems [3, 35, 16, 27, 6] since it was proved to provide flexible formalisms to model and reason over a large number of views used for data integration [36]. We follow the same approach to declaratively define the required AP mappings as views over source data. It should be stressed that, compared to previous work on data integration, our context is quite different: (i) Z39.50 wrapping involves only one source at a time (vs. mediation of several sources); (ii) Z39.50 world view of information is intrinsically flat (vs. middleware structured models); and (iii) Z39.50 wrappers support some query processing essentially to perform set queries (vs. simple translations of queries and data). In the sequel, we briefly recall the core DL model that we use to cope with the various Z39.50 wrapping issues presented in the previous section and provide Z39.50 wrappers with formally verifiable mapping specifications.
The main modeling primitives of Description Logics (DL) are concepts, roles, and individuals. A concept describes a class of elements (individuals) in the domain of interest and is defined by the conditions that must be satisfied by the elements in the class. A role describes a relationship between two individuals. The two basic components of a DL system are the terminological box ( TBox) and the assertional box (ABox). The former contains the concepts (intentional knowledge) and the latter contains the individuals (extensional knowledge). There exist two types of concepts: Primitive and Derived. The definition of a primitive concept specifies only the necessary conditions for an individual to be an instance of it. On the other hand, the definition of a derived concept states the necessary and sufficient conditions for an individual to be instance of it. This implies that an individual has to be explicitly defined as instance of a primitive concept, while instances of derived concepts are inferred by the DL system.
The interpretation of a DL knowledge Base is where denotes a non-empty set of values (the domain) and an interpretation function, mapping every concept to a subset of , every role to a subset of , and every individual to an element of such that for different individuals a, b (Unique Name Assumption). Intuitively, the interpretation of a concept C (denoted as ) is the set of objects that are known to belong to that concept. A concept is said to be subsumed by another concept (denoted as ) if and only if . Based on this subsumption relation, a set of concepts can form a taxonomy having a bottom ( ) and top ( ) concept.
Table 1: Concept and Role forming operators
The part of the TBox that contains the primitive concepts is called schema part while derived concepts form the view part [15]. The TBox-schema part consists of a finite set of axioms having one of the forms: , , where A, C, D are primitive concepts, and R is a role (note that roles have restricted to and from values). The TBox-view part consists of a finite set of concepts definitions having the form where A is a derived concept and E is a concept expression formed by other concepts and the operators shown in Table 1. In the next subsection we will explain these operators through examples in order to define the required mappings of Z39.50 APs to the source data. Note that cycles in concept definitions are not allowed (see [40] for formal definitions). The ABox is defined from a finite set of declarations having one of the forms: C(a) and R(a,b). The first one (unary predicates) declares that individual a belongs to the interpretation of the primitive concept C and the second one (binary predicates) declares that there exists a role R from a to b (belonging respectively to the interpretations of concepts C and D in the definition of R). The main reasoning services [23] offered by a DL system are the following:
In the sequel we present how high level DL concept languages can be used by a wrapper toolkit to declaratively define the required mappings of Z39.50 APs to the source data. In a very natural way, source structure and semantics can be represented as primitive concepts and roles, while the AP mappings as derived concepts (i.e. views) defined on top. Figure 2 illustrates the primitive concepts (TBox-schema part) representing our cultural source schema given in Figure 1 while the derived concepts (TBox-view part) correspond to the established mappings of the CIMI-AQUARELLE profile APs [20, 46]. The data of our cultural source correspond to the individuals ( ABox) of the DL System. Note that this is only a logical view of information from the Z39.50 wrappers (see Section 6) and there is no need to actually load source data into the DL system (virtual Abox). In the following examples we illustrate the expressive power of the proposed DL concept language (see Table 1) to capture the various kinds of translations involved in Z39.50 wrapping for structured sources (see Section 2).
Similarly, the mapping of the AP MaterialAndTechnique is defined as :
Furthermore, mappings of abstract APs like Who describing any personal or corporate name that can be found in a source, are defined by using other AP concepts such as :
Finally, APs like Any, for full-text queries are easily mapped by considering the definitions of abstract APs like Who, What, When and Where (4W APs).
Figure 2: Modeling a Cultural Information Source and Z39.50 APs mappings in DL
The above expression has three parts: (i) the bracket expression corresponds to a concept having as interpretation only the individual ``Creation'', i.e. subsumed by Kind, (ii) the parenthesis expression represents the related creation Events, and (iii) the whole expression captures the Dates associated with these events. Note that the restriction of a role to and from values obviates the need to verify that the returned individuals actually belong to the interpretation of Date.
This implies that CorporateName, Location and Collection are concepts whose interpretation contains only one individual respectively ``Benaki Museum'', ``Benaki Museum Athens'' and ``Gun Collection''.
(or )
In both cases, wrappers are able to smoothly incorporate unsupported APs into the query processing.
To conclude this section we note that modeling the AP mappings as DL derived concepts allows to develop Z39.50 wrappers with formally verifiable properties. More precisely, (i) APs whose meaning is not at all or only implicitly represented in the source can be effectively mapped to avoid embarrassing query failures; and (ii) using the DL reasoning services like Concept Satisfiability we can infer if some or all of the APs mappings are ill-defined or if the used profile is inappropriate for a specific source. These added value services are quite useful for profile developers, Z39.50 wrappers administrators and end-users.
Having defined the mappings of the Z39.50 APs as derived concepts on top of a source schema (i.e. views), we now focus on Z39.50 query processing in our DL framework.
Since DL can serve both as a knowledge representation language and as a query language [8, 42, 14], Z39.50 queries can also be modeled as derived concepts. More precisely, a query can be seen as a description of the necessary and sufficient conditions that have to be satisfied by the individuals forming its answer set, i.e. its interpretation. Conversely, primitive (i.e., source) or derived concepts (i.e., AP mappings) can be used for data querying by considering their interpretation. In the sequel, we present how the Z39.50 Boolean filters can be (i) translated by the wrappers using the same DL concept language employed to map the Z39.50 APs, and (ii) rewritten by taking into account the defined AP views and the fixed central concept of the data actually returned by a source (see Section 2).
As we have seen in Section 2, Z39.50 queries are essentially composed of search terms with APs and qualifiers for comparisons, truncations, etc. which are combined using Boolean connectors. Consider, for instance, the following simple query (i.e. no qualifiers) :
Q2: PersonalName = ``Androutsos''
Recall that PersonalName is an AP, mapped as derived concept ( ) to the Actor and Owner concepts, and ``Androutsos'' a value considered as individual (a). Q2 can be translated into a basic query to the DL knowledge base using the Instance Checking reasoning service ( ) :
If the individual ``Androutsos'' is in the interpretation of the concept PersonalName (i.e. the union of the Actor and Owner interpretations), the knowledge base returns a positive answer and the answer set (i.e., query interpretation) contains only the individual ``Androutsos''. Else the answer set will be empty. More formally, core Z39.50 queries can be defined as DL derived concepts (Tbox-query part) that will be interpreted with source individuals (Abox) in the following way :
Note that query answering relies here on some form of closed world assumption [29]. In the style of [24] we make the realistic assumption about complete knowledge of the DL extensional part (i.e., source data) and thus consider in the interpretation of concepts only their known individuals.
Now let us see how we can express Z39.50 queries using relation or truncation qualifiers like, for instance :
Q3: PersonalName=``Andr'' Truncation=``Right''
These search operators are not directly expressed in a standard DL framework, but they can be captured as external functions. The DL operator TEST-C allows to call various test functions outside of a DL system. This operator is essentially an escape method from the limits of the DL expressiveness allowing to manipulate individuals using external functions written in some programming language (see e.g., CLASSIC [12]). A test function f gets an individual as argument and returns TRUE or FALSE if it satisfies the conditions specified in the body of the function. The interpretation of the expression TEST-C(f) is then all the individuals which, given as argument, the TRUE value is returned by f. Assuming that the various Z39.50 search operators are supported by the underlying source and defined as test functions in the wrapper (e.g., rtrunc, etc.), the Q3 can be translated as :
where is a source function performing right truncation on string ``Andr''.
Finally, capturing the Z39.50 Boolean connectors and, or, and and-not is straightforward using the concept forming operators , , and (see Table 1).
Unfortunately, the above translation into DL is not sufficient to express the exact semantics of Z39.50 queries as defined in a profile. We have seen in Section 2, that the result of a Z39.50 query is the set of related individuals belonging to a central concept of interest (e.g., the root of museum objects in our cultural scenario), rather than the set of individuals that belong to given AP derived concepts and satisfy the search conditions. To cope with this problem we need to define the central concept ( ) in the Tbox as a derived concept (e.g. ) and then introduce concept path expressions ( ) connecting, through roles, the individuals of with the various AP concepts involved in a query. For instance, for the AP derived concept DateofCreation used in Q1 we can consider the following path (see Figure 2) :
Since DateofCreation is only a simple case and AP derived concepts are usually defined by more complex concept expressions (e.g. PersonalName), what is really needed is to declare, for each of the involved primitive concepts (e.g. Actor, Owner), the corresponding paths to the central concept e.g., (see Figure 2) :
The same approach is followed in order to consider the paths of composite APs (e.g., the 4W APs) defined in terms of others. More formally :
These paths are then used during Z39.50 query translation to capture the exact answer set ( ) with individuals of the central concept. More precisely, we consider the following translation steps :
In Section 3 we showed the benefits from modeling Z39.50 AP mappings as DL concepts (i.e. views) in formally validating the Z39.50 wrapping quality. In this section we focus on the capability of DL-based wrappers to reason about the relationships between AP views as well as between these views and Z39.50 queries also represented as DL concepts. Specifically, we show (a) how a flat Z39.50 list of APs can be organized in a subsumption taxonomy thus rendering their underlying source-specific conceptual structure; and (b) how Z39.50 queries can be optimized with respect to their intentional semantics without accessing actual source data (virtual Abox).
Despite the simplified world view of information as a flat list of APs, Z39.50 profiles are usually developed according to an implicit conceptual structure of the information requested by the users. Indeed, the APs defined in a profile represent real world entities for a particular application, function, or community, at various abstraction levels and with different relationships between them. For example, in the CIMI-AQUARELLE profile [20, 46] we can observe a wide range of APs : from very abstract APs like Any, to general ones like What, Who, When and Where, (4W APs) until more specific like Date or DateOfCreation. Making explicit their relationships in the context of a specific source, is very useful for both end-users and third-party metadata providers. It essentially allows to understand why the conceptual structures of information in a source and a profile differ in order to improve the design of APs, query precision, interpretation of results, etc.
Figure 3: Structuring a Flat Vocabulary of Z39.50 APs
We rely on the DL Subsumption Checking reasoning service to organize in a taxonomy the derived concepts capturing the AP mappings for a source. For instance, given the definition of Date and DateOfCreation (see Section 3) it can be inferred that (see [47] for formal definitions and the used subsumption algorithm). In the simplest case the subsumption relationships are direct consequence of the definitions of composite AP concepts as for instance the 4W APs.
Figure 3 illustrates the subsumption taxonomy of several CIMI-AQUARELLE APs as they are mapped to our example source ( Tbox-view part). This taxonomy serves as advanced knowledge support about wrapped sources (i.e. metadata) which can be exploited off-line or on-line. In the latter case the Z39.50 Explain service can be used. Note that accessing and exchanging source metadata is not a simple task due to the different technologies (DBMS, KBS, etc.) employed by the sources and the various implementation choices made by wrapper administrators. We believe that a DL concept language can also be used to facilitate metadata retrieval (i.e. AP mappings) in a way commonly understood by all clients and independent from the underlying source/wrapper technology.
In Section 4 we have seen that DL concept languages used to capture the schema of a source and define Z39.50 APs mappings as views on top of it, can also be employed to express the Z39.50 queries against these views. Not surprisingly Z39.50 queries can then be classified into the concept taxonomy using the subsumption relationships between them and the other primitive or derived concepts (Tbox). The first benefit from this classification is to determine if a Z39.50 query can be effectively evaluated against the existing source schema and AP views. Indeed, after the translation of Z39.50 queries into a canonical DL form, wrappers are able to check whether the description (intension) of a query is contradictory without accessing the source data (ABox). For instance, the following query can be detected as inconsistent since it uses the AP ProtectionStatus mapped to the bottom concept.
Q4: PersonalName = ``Androutsos'' and ProtectionStatus = ``Preserved''
If now a query is semantically well-defined it can be appropriately classified by determining the set of its immediate subsumers and subsumees, i.e. the concepts found above or below in the taxonomy. This classification opens interesting optimization opportunities since it induces a set of semantic transformations in order to locate the exact place of concepts in the taxonomy [7]. Consider, for instance, the following query :
Q5: PersonalName=``Androutsos'' or Who=``Androutsos''
Since the derived concept Who subsumes PersonalName, Q5 will be rewritten into the following semantically equivalent query that will be actually evaluated with source data :
Q5': Who = ``Androutsos''
Recall that according to the semantics of Z39.50 queries, the result is always composed of individuals of a central concept ( ) like MuseumObject. Therefore Z39.50 queries like Q5 are always classified under defined in the Tbox-view part as derived concept (see Section 4). This enables an intelligent caching of query results [26, 4] by the wrappers and a consequent optimization of Z39.50 queries. If the concept representing a query is found to be equivalent to one already existing in the taxonomy, the interpretation of that concept can be returned as an answer set instead of evaluating it. This is the case of Q5 assuming that the equivalent query Q5' has been previously evaluated and cached. Alternatively, the interpretations of all the immediate subsumers have to be checked against the query conditions. This is extremely useful, as Z39.50 is a stateful protocol and queries are quite often simple refinements of previously issued queries, like for example :
Q6: Q5' and When = 1815
In this case Q5' subsumes Q6 and only the second part of the query needs to be evaluated by the underlying source (intersection is performed locally by the wrapper). Finally, the results of Q6 could be also cached in the wrapper. This implies that the interpretation of concept Q5' will contain only its proper individuals i.e. those not belonging to the interpretations of its immediate subsumees like Q6. Note that supporting several query answer sets proves to be quite expensive with current implementations of Z39.50 wrappers [44, 11, 45].
The architecture of the DL-based Toolkit we have developed [47] for the specification of Z39.50 wrappers is shown in Figure 4. It is composed of the following five modules.
Figure 4: The Z39.50 Wrapper Toolkit Architecture
Figure 5: A Part of the Z39.50 Wrapper Configuration File
All modules are operational while Module 4 actually supports only the DL Instance Checking service and sources built on top of the SIS-Telos [22]. Due to the similarities between the DL and SIS-Telos query models, the translation of the resulting DL query expressions into our cultural source is straightforward. We plan to extend this interface of Module 4 for other data source technologies, especially relational and object DBMSs (SQL, OQL), as already studied in [10, 13, 32].
To conclude, the modular architecture of the proposed toolkit allows to significantly reduce wrapper development and maintenance costs. First, the DL-based Module 4 can be reused in order to wrap the same source according to multiple, possibly overlapping profiles. This is not possible with the majority of the existing wrappers [44, 11, 45] where the AP mappings are hard-coded. In our approach, the profile becomes a characteristic of the client query, rather than a characteristic of the source. Second, the same Z39.50 server can support several wrapped sources. This is due to the fact that Modules 1,2 and 3 need not be aware of the Z39.50 APs (or Element) mappings to the various source data. This information is requested only by Module 4, i.e. the source wrapper. Hence, a server can support simultaneously sources of different technology, as well as Z39.50 profiles with different APs mappings in each data source.
In this work we have addressed the declarative specification of Z39.50 wrappers. We have presented a wrapper generation toolkit based on DL concept languages in order to map the Z39.50 world view of information to the underlying source data structure and semantics. The proposed DL mapping language offers a number of advantages : (i) the required views over source data can be easily defined while a wide range of intrinsic Z39.50 translation cases can be expressed (unlike standard DBMS query languages such as SQL); (ii) it comes equipped with verifiable properties allowing to formally validate the Z39.50 wrapping quality and therefore ensure the quality of retrieved data; (iii) it enables reasoning about the relationships between the defined views and thus rending to Z39.50 profile developers, end-users, etc., the source-specific conceptual structure of the Z39.50 information view; and (iv) it can also serve to translate Z39.50 queries, which opens interesting opportunities for semantic query optimization and caching of results, useful in the view of the stateful nature of the protocol.
Currently, the developed toolkit supports only the DL Instance Checking service for evaluating queries and sources built on top of SIS-Telos [22]. We plan to complete the implementation of the toolkit in order to provide full-fledged DL reasoning services. There is on-going study of the available DL systems for possible integration in our toolkit. Furthermore, we intend to validate our approach with several Z39.50 profiles and extend the wrapping facilities to other data source technologies (e.g., DBMS, IRS, etc.). Last but not least, we plan to apply the ideas presented in this paper at a higher level of information integration, in order to build intelligent mediators instead of wrappers [1].
We are grateful to the AQUARELLE and CIMI Consortiums for their technical support during this project. We also thank A. Analyti, D. Plexousakis and M. Dörr for helpful comments on a preliminary version of this paper.
Declarative Specification of Z39.50 Wrappers
using
Description Logics
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -show_section_numbers -split 0 report.tex.
The translation was initiated by Yannis Velegrakis on Wed Apr 14 11:18:03 EDT 1999