USING MULTIPLE SEMANTIC MEASURES FOR COREFERENCE RESOLUTION IN ONTOLOGY POPULATION

The problem of populating an ontology consists in adding to it some new, domain-specific content from an input expressed, in particular, in a natural language. We focus on an important aspect in the ontology population process – finding and resolving coreferences, i.e., similar mentions of entities in the input text. Our contribution is a novel formal framework that extends the state-of-the-art approaches to coreference resolution by using multiple semantic similarity properties in the resolution process, i.e., we extend the list of the ontological properties used for coreference resolution with additional properties such as inverse, symmetry, intersection, union, etc. We use the proposed framework to improve our previously proposed algorithm for coreference resolution used in our general approach to text analysis and information extraction for populating subject domain ontologies. We describe a multi-agent implementation of our information extraction system and we show that using additional semantic similarity measures for evaluating coreferential candidates improves the quality of the coreference resolution process, especially for complex objects whose coreferencing has not been yet studied in detail. Copyright © Research Institute for Intelligent Computer Systems, 2017. All rights reserved.


INTRODUCTION
The process of ontology population is the actively studied problem of adding new instances of concepts to the ontology.This process is a part of ontology acquisition [18] from a domain-specific content that is most often represented in a natural language.In this context, the solution for the ontology population task is interrelated with the elaboration of natural language processing (NLP) techniques applied in the process of information extraction (IE), with coreference resolution as one of the most challenging NLP tasks.
In linguistics, a reference is a relation of a text expression with some non-linguistic object or circumstances in the real or abstract world.The coreference resolution problem is to identify a particular text mention of a non-linguistic entity to its other mentions in this text.Traditionally, the process of coreference resolution consists of two main tasks: 1) the detection of entity mentions that are candidates for coreference, and 2) the pairwise comparison of candidate mentions in order to make the decision on candidate admissibility (whether the pair is valid or not) using some criteria.
The contribution of this paper is a formal framework for the broad use of properties of ontology classes and relations in the coreference resolution process.We exploit these properties for evaluating the semantic coreference similarity in the integral evaluation of coreference similarity.We use the proposed framework to improve our coreference resolution algorithm suggested in [6] for making the decision on the candidate admissibility, which is used in our general approach to text analysis and information extraction for populating subject domain ontology.In our approach, the following IE tasks are performed: the preliminary extraction of subject domain terms from a given text [14]; the segmentation of the text into formal and genre fragments (sentences, sections, headlines, etc) [22]; the construction of objects corresponding to instances of a subject domain ontology, from the terms [4] and the coreference resolution [6]; the lexical and syntactic disambiguation [5]; the update of the ontology with the processed objects (planned as future work).In our framework, the coreference resolution problem means detecting if some group of retrieved objects refers to the particular ontology instance.
There are several basic approaches to coreference resolution proposed in the literature.The most important trends in the field can be found in the comprehensive surveys [3,16,17,19].These trends can be categorized into rule-based and machine learning approaches.Early coreference resolution systems (dating back to 1970s and 1980s) are called "rule-based" as they rely on hand-coded heuristics that specify whether two expressions can or cannot corefer [1,2,11,24].The better term for this trend is "linguistic approach" [3] as it incorporates a lot of domain and linguistic knowledge: syntactic constraints, semantic features and preferences, and discourse-oriented theories, such as Centering model [7], which can predict the focus of attention and the choice of a referring expression for a sentence.Theoretical models consider integrated knowledge sources and reveal factors that help to remove unlikely candidates until the minimal set of plausible candidates is obtained, and then make use of the center or focus, or other preferences.Modern theories investigating multiplicity of factors involved in the coreference phenomenon (such as the notion of Referent activation based on a discourse structure, antecedent syntactic or semantic role, animacy, etc. proposed in [12]) were used directly or indirectly in [13].The mid-to-late 90's gave rise to "corpus-based" (a.k.a.machine learning) approaches which were inspired by the emergence of more powerful automatic parsers and taggers, and corpora annotated with coreference information to be used as a training data [8,23].Paper [3] gives a survey of machine learning based techniques with respect to the coreference resolution task starting from a simple statistical naive Bayes-based model to methods using decision trees and conditional random fields and others.Unfortunately, in limited subject domains (for example, technical documents) representative training text corpora do not exist usually.In these cases, it is reasonable to use classical rule-based methods.
In the context of ontology population, the rulebased approach called "ontology-driven IE" is of particular significance.In this approach, IE and ontology population are closely interrelated: an ontology is used to represent the IE process output while the ontology structure and knowledge represented in it help to solve IE domain-specific subtasks [15].In [10,25] the coreference resolution task is discussed with respect to both intra-and cross-document analysis.In both papers the ontology-level information is used to determine ontology object identity and similarity: they can be calculated using the object's own features' values and the values of features of other objects that are connected with this object by semantic relations.The approach to coreference resolution in [10] allows only certain types of named entities (persons, organizations, etc.), and the feature values comparison is made by direct string matching without use of any similarity measure.To avoid identification errors, they use a special hand-crafted database that contains validated objects with no duplicates: the identifiers (feature values) of the extracted objects are compared with the identifiers of objects in the base.In [25], the process consists of two consecutive steps.The first step deals with the coreference factors at the text level (such as string similarity) and produces typed entity and relation instances that are mapped into an RDF graph.Afterwards a semantic coreference algorithm runs on the RDF graph to revise the results of the textbased step: instances are merged if they belong to the same class in the domain ontology and their string similarity is higher than a predefined threshold.However, these approaches to coreference resolution provide insufficient completeness, in particular, due to the poor use of the features of ontology classes and relations.They take into account coincidence of classes and relations of coreferential candidates for the resolution, i.e. they use only the identity property of ontology elements.
There exist several attempts to apply distributed or agent-based techniques to the coreference resolution task, in particular [20,26].In [20] the coreference resolution factors (recency, number agreement, gender agreement etc.) are grouped in sets as constraint sources corresponding to the known partial theories of coreference.In [26] a common constraint agent allows for morphological agreement and semantic consistency, while different coreference types (where a candidate may be a name alias, nominal predicate, appositional, definite, demonstrative, or bare noun phrase) are charged to special agents.In both papers, agents correspond to the coreference resolution factors.The detection of coreference candidates is done sequentially.In [20] the agents make the decision about admissibility of a particular candidate in parallel, and in [26] the agents compose the system of sequential decision filters.Unfortunately, due to a low degree of concurrency, the performance of the coreference resolution in these agent systems is close to the performance of the sequential resolution process.
Our approach to coreference resolution [6] is rule-based, because we deal with limited subject domains.Our proposed algorithm is ontology-driven as it strongly relies on the structure of the underlying predefined domain ontology.We focus on full lexical items (nominals and names), since they bear more semantic clues than pronominals for making comparisons with ontology classes and instances.Ambiguities occurring at the linguistic level are resolved at the ontology level.We use a similarity measure to compare potential coreferential objects within the group.The detection and resolution of the coreference use the ontology properties of classes and the similarity measure.Unlike the previous ontology-driven approaches, our evaluation of the measure is not limited to string similarity and the identity property of ontology elements: the notion of similarity integrates textual factors (such as text distance and context dependence) with the factors based on the ontological properties of instances' attributes (class hierarchy, composition, transitivity, etc.).We use the special class agents to detect and resolve the coreference candidates using the similarity measure.Our agents work in parallel, which speeds up the process in comparison with the sequential and multi-agent approaches mentioned above.
In this paper, we suggest to extend the list of the ontological properties used for coreference resolution with additional properties such as inverse, symmetry, intersection, union, etc.Using these extra properties for evaluating coreference similarity improves the quality of the resolution process.Such evaluation method can be applied to any ontologydriven approach.Our way of using the ontology structure allows one to resolve coreferences more precisely even for complex objects such as descriptions of events and situations presented as ontology polyadic relations.To the best of our knowledge, coreferencing such complex objects has not been studied in detail yet.
The rest of the paper is organized as follows.In Section 2, we give some background definitions and formally state the problem of coreference resolution.Section 3 defines the semantic similarity measure in detail and gives some examples of its evaluation.Section 4 outlines our approach to multi-agent information extraction, gives the description of the process of the coreference resolution and presents the algorithm of computing the combined semantic similarity measure.In the concluding Section 5, we discuss future work.

BASIC DEFINITIONS
Let us consider an ontology of some particular subject domain, together with the ontology population rules, semantic and syntactic models for the language of the subject domain, and the term vocabulary.We assume that input data are provided as a finite natural language text, information from which is used for populating our ontology.We consider an OWL-like ontology representation [9].In the following, we list some properties of classes and attributes which are well-known in the area of ontology and description logics.We will use them in the process of detection and resolution of coreferences.This list does not claim to be comprehensive.The use of these properties for evaluating the semantic coreferential similarity improves the precision and recall of coreference resolution.We can evaluate the degree of identity/similarity of coreferential candidates using the fact that the data/relation attributes of these coreferential candidates are related by some of these relations and their values are consistent.In this paper, combinations of the properties are not considered, except the refinement relation which is the combination of the composition and inclusion relations.We use the standard notions of class and attribute inheritance relations.The relations on relation attributes correspond to the standard definitions of ontology relations between classes.O .We extend the standard list of properties with the refinement relation as the combination of the composition and inclusion relation, because in many practical cases of ontology relations the strict inclusion of the relation composition is required in coreferential candidates' comparison.For example, using the attribute relation live_in∘ include ⊏ appear_in we can deduce that if somebody lives in a house then the one can appear in a room of the house, but the opposite assertion does not hold, i.e. in some sense, attribute include refines live_in.

An ontology O of a subject domain
For the specific goal of this paper -evaluating the semantic coreference similarity -we introduce the following new notions.For classes and attributes, we take into account the hierarchical structure implied by the inheritance relation.Let

 The hierarchical group of the set C is
Hi(C)=⋃c ′∈C Hi(c′).
For cases when properties of attributes in Definition 1 are unknown for a given ontology to be populated, we use the necessary conditions of the properties for evaluating the semantic coreferential similarity.The following proposition formulates these conditions in a constructive way.We denote the necessary condition of a property x by x .The proof follows from Definition 1.
We define a set A of information objects (iobjects) retrieved from input data and corresponding to ontology instances.Every information object a∈A has the form (c a , Dat a , Rel a , G a , P a ), where  Dat a is the set of data attributes α a = (α, Val αa ), where o the name α ∈ Dat ca , and where o the name ρ∈ Rel ca , and V ρa is the set of iobjects of a class c ρa from C ρa ;  G a is the grammar information (morphological and syntactic features);  P a is the structural information (a set of positions in the input data).
We denote by Atr a = Dat a ∪ Rel a the set of all attributes.Note that the properties of natural language processing may cause assigning key attributes of i-objects with many values.Such ambiguities are resolved after the coreference resolution process is finished.Every i-object corresponds to some ontology instance in a natural way as follows.Let a = (c a , Dat a , Rel a , G a , P a ) be an i-object, then its corresponding ontology instance is a′ = (c a , Dat a′ , Rel a′ ), and every α ∈ Dat a′ has value(s) in V αa and every ρ∈ Rel a′ has values in V ρa .
For defining the problem of coreference resolution formally, we introduce the following collative relations on i-objects a,b ∈ A: , where ⊆ r is defined in the next paragraph.Further we say just co-candidates instead of coreferential candidates.
We define for i-objects the following notions, taking into account i-objects' co-candidates.Let a, b, c ∈ A, and X, Y ⊂ A.

 The coreferential group of the i-object a (co
 Coreferential conflict: i-objects a and b are in the coreferential conflict with respect to i-object The coreferential conflict means that some iobject is a co-candidate for two non-coreferential iobjects. The coreference resolution problem is to detect if given i-objects correspond to the same ontology instance.Our algorithm for coreference resolution discussed in Section 4 constructs conflict-free cogroups of co-candidates.This construction uses the coreference similarity of i-objects for resolving coreferential conflicts.The measure of coreference similarity for i-objects a and b is denoted as cs(a,b).If a↭ c b, then we say that the coreferential conflict is resolved to a iff cs(a,c) > cs (b,c).
The The semantic measure is discussed in the next section in detail, while the other three measures are briefly explained here.The context measure of similarity C(a,b) takes into account the information connectivity of i-objects in a given text.This measure depends on the number of i-objects which directly or indirectly use (1) attribute values from both a and b, and (2) attribute values borrowed by a from b, and by b from a, for the evaluation of their own attributes.The position measure of similarity P(a,b) takes into account various forms of closeness of i-objects in an input text.This measure depends on the number of segments, co-candidates in the conflict, and lexemes placed between the positions of a and b.The grammar measure of similarity G(a,b) is based on the standard linguistic features such as gender, number, person, etc.The details of these measures' definitions can be found in [6].

THE SEMANTIC MEASURE OF COREFERENCE SIMILARITY
The semantic measure of coreference similarity takes into account the attribute similarity of iobjects.This measure combines 11 types of the similarity which we summarize in Table 1.These types correspond to the properties introduced in In Table 1, letter x denotes the type of similarity: x ∈ {d, r, ⊓, ⊔, ∘, ⊳, ⌣, ⊑, *, t, s}.The ontology condition x is composed of the condition on the attributes and the corresponding necessary condition x from Proposition 1.This necessary condition is used when the properties of attributes in Definition 1 are unknown for a given populating ontology.The value condition x = (S x ≠ ∅ ∧ E x = ∅), where S x is the set of similar values and E x is the set of common values in the three cases of similarity (in other cases E x is not necessary to define).The x-similarity condition is x = x ∧ x .The power of similarity with respect to attributes γ a and δ b is sim(γ a , δ b ).For a relation attribute γ, we introduce the inverse cardinality ic(γ) = cardinality(γ ⌣ ), where cardinality is the standard numeric property of ontology relations [9].The value of ic(γ) characterizes the number of how many distinct instances may or must be related with the same instance by the relation corresponding to γ.This value is used in the computation of the power of similarity.
Following Table 1, we consider that for the iobjects a and b the attribute γ a is x-similar to attribute δ b iff x holds, and the power of the x- r, ⊓, ⊔, ∘, ⊳, ⌣, ⊑ The proof of the proposition is based on Definition 1 and Proposition 1.
Table 1.The types of semantic similarity Similarity Let us illustrate our introduced framework by a practice-relevant example with co-candidates whose attributes are related by the refinement and composition relations.The ontology's domain of our example is the area of Technical Documentation for Industrial Process Control (TDIPC).We consider the natural language description of a bottle-filling system example from [21].
Let us discuss the following fragment of the text that demonstrates the refinement similarity in the description: A filler tank holds fluid.In this system, the fluid is heated and maintained at 100 degrees Celsius.Although this might typically be performed with a PID implementation, in this case the steam valve is opened and steam is inserted into the tank when the temperature falls below 100 degrees, and closes when the temperature reaches 110 degrees Celsius.
These i-objects are co-candidates, because they have the identical class heater, and the key data attribute type is not defined for a.The refinement similarity of these i-objects is sim(ρ a , ξ b ) = 1, because the values of the attributes are consistent (S ⊳ ≠ ∅), the ontology of TDIPC contains the refinement relation: heat = heat ⊳ inside, and the inverse cardinality of heat is equal to 1.Note that the previous approaches for coreference resolution, e.g., from [10,25], would miss this coreference, because they consider coreferential candidates only with identical (may be after some normalization) key attributes, but the key attributes of the example i-objects are different.

Our next example text fragment illustrates the composition similarity:
There is a valve in the bottom of the filler tank that is opened when an empty bottle is present, the fluid is present, and the fluid is at or above 100 degrees.A photosensor attached to the filler tank determines when the bottle is full.
For this text fragment, our algorithm of text analysis creates the following i-objects: The approach to coreference resolution from [25] can consider the example i-objects in this case as potential coreferents due to the coincidence of their classes which are treated as key characteristics, but this coreference will not be established, because the example i-objects have different names and values of the attribute relations.The approach to coreference resolution from [10] would not consider the example i-objects as potential coreferents, because they have no key attributed defined.
Summarizing, the previously suggested approaches would miss some coreferents which our approach would consider; this demonstrates the higher degree of completeness of our approach to the coreference resolution as compared to related work.

A MULTI-AGENT APPROACH TO COREFERENCE RESOLUTION IN THE INFORMATION EXTRACTION
The coreference similarity measures, including the semantic measure from the previous section, are used in our coreference resolution algorithm which is the part of our general approach to information extraction (IE) for the ontology population outlined below.In this section, we sketch the approach as a whole, and we provide informal descriptions of the actions of agents that execute our multi-agent algorithm of the coreference resolution.
The input of our IE-system comprises: an ontology of some particular subject domain, the ontology population rules, semantic and syntactic models for the language of the subject domain, the term vocabulary, and input data as a finite natural language text.The output is the ontology populated by information from the text.
Our IE-system consists of the following five sequential modules.1.The module of lexical analysis executes a preliminary extraction of subject domain terms from a given text [14].This module takes the semantic and syntactic models, the term vocabulary, and the input text, and it produces the terminological cover (the set of lexical objects without structural information).Every lexical object has the same structure as i-objects: it stores the grammatical and structural information, but it has exactly one data value in a data domain from D O , and its class is a semantic class of the term vocabulary.

The segmentator module performs segmentation
of the text into formal and genre fragments (sentences, sections, headlines, etc) [22].This module receives the semantic and syntactic models, and the input text as the input, and its output is the segment cover representing text decomposition into formal and genre subunits.3. The main analysis module constructs objects, corresponding to instances of subject domain ontology, from the terms [4], and resolves coreference [6].The input for this module is the terminological cover with the structural information from the segment cover, and the analysis rules which implement semantic and syntactic models and ontology population rules.They are formulated by experts taking into account the ontology and language of subject domain.This module produces the set of iobjects with resolved coreference and unresolved lexical/syntactical ambiguity.4. The disambiguation module resolves lexical and syntactic ambiguity [5]; it takes the output of the main analysis module as its input, and yields the set of i-objects without ambiguities.5.The population module updates the ontology with the processed objects (planned as future work).The module's input is the output of the disambiguation module and the given ontology, and its output is this ontology populated by information from the text.Let us describe the main analysis module which performs the coreference resolution.The main analysis module performs two tasks in parallel: construction of i-objects and coreference resolution.
In the constructing process, the module generates new information based on information (attribute values) taken from i-objects and lexical objects using the analysis rules.This information is used to define new attribute values of existing i-objects and to generate new i-objects.Following the analysis rules, the module takes into consideration only linguistically and ontologically compatible sets of iobjects.Using information from one i-object for another i-object sets the information connection between these i-objects labeled by this information.These connections keep the history of the evolution of an i-object.They are used by the disambiguation module for evaluating the integration of the i-object, i.e. amount of information related to the i-object in the text.This construction process terminates when new information cannot be generated.
For the coreference resolution task the main analysis module constructs and updates the cogroups of i-objects in parallel with constructing iobjects.The coreferential conflict resolution in the co-groups based on the similarity of i-objects is performed after the termination of constructing iobjects.The result of this process are the conflictfree co-groups.The attribute values and information connections of i-agents in these co-groups are joined in the main analysis module for the further processes of the lexical/syntactical disambiguation and ontology population.One advantage of our approach to coreference resolution is that this joining improves the quality of the disambiguation and population processes, because it allows the corresponding modules to take into consideration all information about objects accessible from the text.Another advantage is that using the multiple similarity measures of the coreferents allows us to more precisely estimate the integration of i-objects into a given text than in our previous work [5].
In our multi-agent framework, we assign a separate agent for every i-object, every analysis rule, and every ontology class.These agents perform the following tasks of text analysis for ontology population: creating/updating i-objects and coreference resolution in parallel by i-agents, rule agents, and class agents, and then the ambiguity resolution by i-agents.These agents communicate and exchange data for executing their tasks.There is also an auxiliary agent: the master-agent detects terminations and coordinates all other agents in the disambiguation process.For the details of creating iobjects and disambiguation, see [4,5].The result of agent interactions is the system of i-objects without the coreferences, lexical, and syntactical ambiguities.All agents execute their protocols in parallel until, from time to time, it happens that none of the agents can proceed.Such termination events are detected by the master agent.We use our original algorithm for termination detection, which is based on activity counting.After detecting termination, the master agent sends coordination signals, depending on the task performed, to other agents.Our system of agents is dynamic: the rule agents can create new information agents, the class agents can kill the i-agents by joining duplicates and ontological equivalents, and co-candidates (at the end of the coreference resolution process), and, in the disambiguation process, the master agent can kill the i-agents whose i-objects are weakly integrated in a given text.The agents are connected by duplex channels.The master agent is connected with all agents, the i-agents are connected with their rule agents, class agents, and successors/predecessors by information connections.We assume that messages are transmitted instantly via a reliable medium and stored in channels until being read.
Let us briefly describe the process of coreference resolution by the class agents.Here, we do not distinguish an i-agent from its i-object if there is no ambiguity.Every class agent performs the following tasks: 1) creating the co-group for every newly born iagent; 2) updating the co-group for every i-agent in a case of its key attribute update; 3) regulating the attribute exchange between iagents; 4) computing the measures of the coreference similarity for i-objects in co-groups by formulas from Sections 2 and 3; 5) resolving the coreferential conflicts using the calculated measures; and 6) generating for every conflict-free co-group the integrating i-object (with the corresponding iagent) by joining the i-objects from the co-group.Every class agent acts at its level of the class hierarchy, i.e. in processing pairs of i-agents (testing for collative relations in creation/update co-groups, computing the similarity measures etc.); at least one i-agent must be in the class of this class agent.The higher class agents use results of constructing conflict-free groups from the lower class agents.The details of the coreference resolution process are described in [6], where we use a simpler semantic measure of coreference similarity in the task 4 and the computation of the measure is not discussed.Here we use more complex and precise semantic measure, so it is reasonable to describe its computation in detail.
Let us describe how the semantic similarity This proposition is the base for the following algorithm of computing the semantic similarity measure in the case of absence of a specification for the ontology properties of relation attributes.Using the implication chains from the proposition allows us not to compute many times the truth values of the same conditions.We introduce the following notation for conjunctions of Boolean formulas: φ x = φ y ∧ φ x/y .The following procedure SiMeasure(a,b) returns the semantic similarity measure S(a,b) = SimMes for i-objects a = (c a , Dat a , Rel a ) and b = (c b , Dat b , Rel b ).In the procedure, function sim x (γ,δ) returns the power of the corresponding similarity using the formulas from Table 1.else if t/r (ρ, ξ) then S = S + sim t (ρ, ξ); N++; continue;

CONCLUSION
Our main contribution in this paper is a formal framework for coreference resolution in the process of ontology population.The novelty of the suggested framework is the use of multiple properties of ontology classes and relations for solving the coreference resolution problem.Using multiple properties provides a significantly more precise and complete coreference identification due to taking into account more similarity factors than just the elements' equality as done in previous work.The properties used in our framework include class and attribute hierarchy, intersection, union, composition, refinement, inverse, inclusion, reflexive-transitive closure, transitivity, and symmetry.We describe in detail how these properties are efficiently used in evaluating the semantic similarity of coreferential candidates.This evaluation is integrated into our multi-agent system of information extraction (IE) from texts in a natural language, which significantly speeds up the IE process as compared to a sequential implementation.
As shown in Sections 3 and 4, our approach has several advantages over the previous work: 1) it provides a higher degree of completenes regarding the considered coreferents, 2) it improves the quality of the disambiguation and population processes, because it allows the corresponding modules to take into consideration all information about objects accessible from the text, and 3) using the multiple similarity measures of the coreferents allows us to more precisely estimate the integration of i-objects into a given text than in our own previous work [5].
In the future work, we plan to extend the above list of properties used in our framework with their meaningful combinations which appear in the practice of information extraction.While the presented properties are defined for binary ontology relations, we intend to specify them for n-ary ontology relations which represent situations and events of the real world.These properties will additionally improve the quality of coreference resolution.For the better estimation of the impact of the semantic similarity on the integrated evaluation of coreference similarity, we will investigate the frequency and significance of using particular ontology properties for defining the corresponding coefficients in the similarity evaluation formula.
includes the following elements:  a finite nonempty set C O of classes for representing the concepts of the subject domain,  a finite set D O of data domains, and  a finite set of attributes with names in Atr O = Dat O ∪Rel O , each of which has values in some data domain from D O (data attributes in Dat O ) or has values as instances of some classes (relation attributes in Rel O , which model binary relations).Every class c ∈ C O is defined by the tuple of attributes: c = (Dat c , Rel c ), where every data attribute α ∈ Dat c ⊆ Dat O has the domain d α ∈ D O with values in V dα and every relation attribute ρ ∈ Rel c ⊆ Rel O has values from classes C ρ ⊆ C O .We denote the class of an attribute γ by c γ .The set of all class attributes is denoted by Atr c = Dat c ∪ Rel c .This set includes the nonempty set of key attributes Atr K c .The key attributes can be data or relation attributes.We say that a is an instance of the class c a = (Dat ca , Rel ca ) (a ∈ c a ) iff a = (c a , Dat a , Rel a ), where every data attribute in Dat a has a name α a ∈ Dat ca with the values V αa from V dαa and every relation attribute in Rel a has a name ρ a ∈ Rel ca with the values V ρa as instances of the classes from C ρ .The data key attributes are always one-valued, i.e. every key attribute of every ontology instance may have only a single value.The relation key attributes correspond to bijective relations.We consider an ontology without data and class synonyms, i.e. ∀ α 1 , α 2 ∈ Dat O : d α1 ≠ d α2 and ∀ c 1 , c 2 ∈ C O : Atr c1 ≠ Atr c2 .The information content IC O of the ontology O is a set of instances of the classes from O. The ontology population problem is to compute an information content for a given ontology from the given input data.


duplication: a and b are duplicates (a = b) iff Atr K a = Atr K b and P a = P b ;  ontological equivalence: a and b are ontological equivalents (a ≡ b) iff Atr K a =Atr K b , and P a ≠ P b ;  coreference: a and b are coreferential candidates (a ≈ b) iff c a ≃ i c b , and Atr measure of the coreference similarity cs(a,b) is calculated as the normalized sum of four measures -semantic S(a,b), context C(a,b), position P(a,b) and grammar measures G(a,b) -as follows: cs(a,b) = ¼ (S(a,b) + C(a,b) + P(a,b) + G(a,b)).We leave for future work a more precise estimation of the contribution of each component to this measure that may change the corresponding coefficients in the formula.
, *, t, s}.If for the attributes of co-candidates a and b the semantic similarity condition ⋀ x∈X x holds, then these co-candidates correspond to the same ontology instance with the integral accuracy cs(a,b) which uses the semantic similarity powers sim x through the semantic similarity measure S(a,b).
a = bottle( … ρ b = open (gate: valve)), b = bottle( … ξ b = fill_from (reservoir: tank)), valve = gate( … π = in_bottom (reservoir: tank)).These i-objects are co-candidates, because they have the identical class bottle and their key attributes are not defined.The compositional similarity of these i-objects is sim(ρ a , ξ b ) = 1, because the values of the attributes are consistent (S ∘ ≠ ∅), in the ontology of TDIPC the following compositional relation is given: fill_from = open ∘ in_bottom, and the inverse cardinalities of fill_from and open are equal to 1.
SiMeasure(a,b) :: int S = 0; N = 0; 1. forall α ∈ Dat a , β∈ Dat b 2. if d (α,β) then S = S + sim d (α, β); N++; 3. forall ρ ∈ Rel a , ξ ∈ Rel b measure is computed.In this paper we consider the relation attributes which have exactly one property from Definition 1.If these properties are given for the given ontology (for example, they are summarized in some table RP) then the algorithm of computing the semantic similarity measure is trivial: it simply checks in the table RP if given relations ρ and ξ satisfy some property and computes the corresponding similarity measure sim(ρ,ξ) by formulas from Table1.If the table of properties RP does not exist, then the algorithm of computing the measure must check the necessary conditions from Proposition 1.In order to reduce the computation of the necessary conditions, it is reasonable to organize them into "implication" chains.The following proposition describes three such chains.The proof follows directly from the definitions of the necessary conditions.