Abstract
In this chapter, we introduce recent progress of the alignment inference stage.
You have full access to this open access chapter, Download chapter PDF
1 Introduction
Matching data instances that refer to the same real-world entity is a long-standing problem. It establishes the connections among multiple data sources and is critical to data integration and cleaning [33]. Therefore, the task has been actively studied; for instance, in the database community, various entity matching (EM) (and entity resolution (ER)) strategies are proposed to train a (supervised) classifier to predict whether a pair of data records match [9, 33].
Recently, due to the emergence and proliferation of knowledge graphs (KGs), matching entities in KGs draws much attention from both academia and industries. Distinct from traditional data matching, it brings its own challenges. In particular, it underlines the use of KGs’ structures for matching and manifests unique characteristics of data, e.g., imbalanced class distribution, and few attributive textual information, etc. As a consequence, although viable, following the traditional EM pipeline, it is difficult to train an effective classifier that can infer the equivalence between KGs’ entities. Hence, much effort has been dedicated to specifically addressing the matching of entities in KGs, which is also referred to as entity alignment (EA).
Nevertheless, early solutions to EA are mainly unsupervised [22, 42], i.e., no labeled data is assumed. They utilize discriminative features of entities (e.g., entity descriptions and relational structures) to infer the equivalent entity pair, which are, however, embarrassed by the heterogeneity of independently constructed KGs [44].
To mitigate this issue, recent solutions to EA employ a few labeled pairs as seeds to guide the learning and prediction [8, 14, 27, 37, 47]. In short, they embed the symbolic representations of KGs as low-dimensional vectors in a way such that the semantic relatedness of entities is captured by the geometrical structures of embedding spaces [4], where the seed pairs are leveraged to produce unified entity representations. In the testing stage, they match entities based on the unified entity embeddings. They are coined as embedding-based EA methods, which have exhibited state-of-the-art performance on existing benchmarks.
To be more specific, the embedding-based EAFootnote 1 pipeline can be roughly divided into two major stages, i.e., representation learning and matching KGs in entity embedding spaces (or embedding matching for short). While the former encodes the KG structures into low-dimensional vectors and establishes connections between independent KGs via the calibration or transformation of (seed) entity embeddings [44], the latter computes pairwise scores between source and target entities based on such embeddings and then makes alignment decisions according to the pairwise scores. Although this field has been actively explored, existing efforts are mainly devoted to the representation learning stage [24, 26, 61], while embedding matching has not raised many attentions until very recently [29, 53]. The majority of existing EA solutions adopt a simple algorithm to realize this stage, i.e., DInf, which first leverages common similarity metrics such as cosine similarity to calculate the pairwise similarity scores between entity embeddings and then matches a source entity to its most similar target entity according to the pairwise scores [47]. Nevertheless, it is evident that such an intuitive strategy can merely reach local optimums for individual entities and completely overlooks the (global) interdependence among the matching decisions for different entities [55].
To address the shortcomings of DInf, advanced strategies are devised [12, 44, 49, 53, 55, 56]. While some of them inject the modeling of global interdependence into the computation of pairwise scores [12, 44, 53], some directly improve the alignment decision-making process by imposing collective matching constraints [49, 55, 56]. These efforts demonstrate the significance of matching KGs in entity embedding spaces from at least three major aspects: (1) It is an indispensable step of EA, which takes as input the entity embeddings (generated by the representation learning stage), and outputs matched entity pairs. (2) Its performance is crucial to the overall EA results, e.g., an effective algorithm can improve the alignment results by up to 88% [53]. (3) It empowers EA with explainability, as it unveils the decision-making process of alignment. We use the following example to further illustrate the significance of the embedding matching process.
Example Figure 4.1 presents three representative cases of EA. The KG pairs to be aligned are first encoded into embeddings via the representation learning models. Next, the embedding matching algorithms produce the matched entity pairs based on the embeddings. In the most ideal case where two KGs are identical, e.g., case (a), with an ideal representation learning model, equivalent entities would be embedded into exactly the same place in the low-dimensional space, and using the simple DInf algorithm would attain perfect results. Nevertheless, in the majority of practical scenarios, e.g., case (b) and (c), the two KGs have high structure heterogeneity. As thus, even an ideal representation learning model might generate different embeddings for equivalent entities. In this case, adopting the simple DInf strategy is likely to produce false entity pairs, such as \((u_5, v_3)\) in case (b).
Worse still, as pointed out in previous works [44, 59], existing representation learning methods for EA cannot fully capture the structural information (possibly due to their inner design mechanisms, or their incapability of dealing with scarce supervision signals). Under these settings, e.g., case (c), the distribution of entity embeddings in the low-dimensional space would become irregular, where the simple embedding matching algorithm DInf would fall short, i.e., producing incorrect entity pairs \((u_3, v_1)\) and \((u_5, v_1)\). As thus, in these practical cases, an effective embedding matching algorithm is crucial to inferring the correct matches. For instance, by exploiting the collective embedding matching algorithm that imposes the 1-to-1 alignment constraint, the correct matches, i.e., \((u_3, v_3)\) and \((u_5, v_5)\), are likely to be restored.
While the study on matching KGs in entity embedding spaces is rapidly progressing, there is no systematic survey or comparison of these solutions [44]. We do notice that there are several survey papers covering embedding-based EA frameworks [44, 52, 57,58,59], whereas they all briefly introduce the embedding matching module (mostly only mentioning the DInf algorithm). In this chapter, we aim to fill in this gap by surveying current solutions for matching KGs in entity embedding spaces and providing a comprehensive empirical evaluation of these methods with the following features:
(1) Systematic Survey and Fair Comparison
Albeit essential to the alignment performance, existing embedding matching strategies have yet not been compared directly. Instead, they are integrated with representation learning models and then evaluated and compared with each other (as a whole). This, however, cannot provide a fair comparison of the embedding matching strategies themselves, since the difference among them can be offset by other influential factors, such as the choices of representation learning models or input features. Therefore, in this chapter, we exclude irrelevant factors and provide a fair comparison of current matching algorithms for KGs in entity embedding spaces at both theoretical and empirical levels.
(2) Comprehensive Evaluation and Detailed Discussion
To fully appreciate the effectiveness of embedding matching strategies, we conduct extensive experiments on a wide range of EA settings, i.e., with different representation learning models, with various input features, and on datasets at different scales. We also analyze the complexity of these algorithms and evaluate their efficiency/scalability under each experimental setting. Based on the empirical results, we discuss to reveal strengths and weaknesses.
(3) New Experimental Settings and Insights
Through empirical evaluation and analysis, we discover that the current mainstream evaluation setting, i.e., 1-to-1 constrained EA, oversimplifies the real-life alignment scenarios. As thus, we identify two experimental settings that better reflect the challenges in practice, i.e., alignment with unmatchable entities, as well as a new setting of non-1-to-1 alignment. We compare the embedding matching algorithms under these challenging settings to provide further insights.
Contributions
We make the following contributions:
-
We systematically and comprehensively survey and compare state-of-the-art algorithms for matching KGs in entity embedding spaces (Sect. 4.3).
-
We evaluate and compare the state-of-the-art embedding matching algorithms on a wide range of EA datasets and settings, as well as reveal their strengths and weaknesses. The codes of these algorithms are organized and integrated into an open-source library, EntMatcher, publicly available at https://github.com/DexterZeng/EntMatcher (Sect. 4.4).
-
We identify experimental settings that better mirror real-life challenges and construct a new benchmark dataset, where deeper insights into the algorithms are obtained via empirical evaluations (Sect. 4.5).
-
Based on our evaluation and analysis, we provide useful insights into the design trade-offs of existing works and suggest promising directions for the future development of matching KGs in entity embedding spaces (Sect. 4.6).
2 Preliminaries
In this section, we first present the task formulation of EA and its general framework. Next, we introduce the studies related to alignment inference and clarify the scope of this chapter. Finally, we present the key assumptions of embedding-based EA.
2.1 Task Formulation and Framework
Task Formulation
A KG \(\mathcal {G}\) is composed of triples \(\{(s, p, o)\}\), where \(s, o \in \mathcal {E}\) represent entities and \(p \in \mathcal {P}\) denotes the predicate (relation). Given a source KG \(\mathcal {G}_s\), a target KG \(\mathcal {G}_t\), the task of EA is formulated as discovering new (equivalent) entity pairs \(\mathcal {M} = \{(u, v)| u \in \mathcal {E}_s, v \in \mathcal {E}_t, u\Leftrightarrow v\}\) by using pre-annotated (seed) entity pairs \(\mathcal {S}\) as anchors, where \(\Leftrightarrow \) represents the equivalence between entities and \(\mathcal {E}_s\) and \(\mathcal {E}_t\) denote the entity sets in \(\mathcal {G}_s\) and \(\mathcal {G}_t\), respectively.
General Framework
The pipeline of state-of-the-art embedding-based EA solutions can be divided into two stages, i.e., representation learning and embedding matching, as shown in Fig. 4.2. The general algorithm can be found in Algorithm 1.
Algorithm 1: General algorithm of embedding-based EA
The majority of studies on EA are devoted to the representation learning stage. They first utilize KG embedding techniques such as TransE [4] and GCN [20] to capture the KG structure information and generate entity structural representations. Next, based on the assumption that equivalent entities from different KGs possess similar neighboring KG structures (and in turn similar embeddings), they leverage the seed entity pairs as anchors and progressively project individual KG embeddings into a unified space through training, resulting in the unified entity representations \(\boldsymbol {E}\).Footnote 2 There have already been several survey papers concentrating on representation learning approaches for EA, and we refer the interested readers to these works [2, 44, 57, 59].
Next, we introduce the embedding matching process –the focus of this chapter– as well as its related works.
2.2 Related Work and Scope
Matching KGs in Entity Embedding Spaces
After obtaining the unified entity representations \(\boldsymbol {E}\) where equivalent entities from different KGs are assumed to have similar embeddings, the embedding matching stage (also frequently referred to as alignment inference stage [44]) produces alignment results by comparing the embeddings of entities from different KGs. Concretely, it first calculates the pairwise scores between source and target entity embeddings according to a specific metric.Footnote 3 The pairwise scores are then organized into matrix form as \(\boldsymbol {S}\). Next, according to the pairwise scores, various matching algorithms are put forward to align entities. The most common algorithm is Greedy, described in Algorithm 2. It directly matches a source entity to the target entity that possesses the highest pairwise score according to \(\boldsymbol {S}\). Over the last few years, advanced solutions [12, 15, 28, 29, 44, 49, 53, 55, 56, 60] are devised to improve the embedding matching performance, and in this work, we focus on surveying and comparing these algorithms for matching KGs in entity embedding spaces.
Algorithm 2: Greedy\((\mathcal {E}_s,\mathcal {E}_t, \boldsymbol {S})\)
1 for \(u \in \mathcal {E}_s\) do
2 \( v^* = \operatorname *{\arg \max }_{v\in \mathcal {E}_t}\boldsymbol {S}(u, v)\);
3 \(\mathcal {M} \gets \mathcal {M}\cup \{(u,v^*)\}\);
4 return \(\mathcal {M}\);
Matching KGs in Symbolic Spaces
Before the emergence of embedding-based EA, there have already been many conventional frameworks that match KGs in symbolic spaces [17, 41, 42]. While some are based on equivalence reasoning mandated by OWL semantics [17], some leverage similarity computation to compare the symbolic features of entities [42]. However, these solutions are not comparable to algorithms for matching KGs in entity embedding spaces, as (1) they cover both the representation learning and embedding matching stages in embedding-based EA and (2) the required inputs are different from those of embedding matching algorithms. Thus, we do not include them in our experimental evaluation, while they have already been compared in the survey papers covering the overall embedding-based EA frameworks [44, 59].
The matching of relations (or ontology) between KGs has also been studied by prior symbolic works [41, 42]. Nevertheless, compared with entities, they are usually in smaller amounts, of various granularities [36], and under-explored in embedding-based approaches [51]. Hence, in this work, we exclude relevant studies on this topic and focus on the matching of entities.
The task of entity resolution (ER) [9, 16, 35], also known as entity matching, deduplication, or record linkage, can be regarded as the general case of EA [59]. It assumes that the input is relational data, and each data object usually has a large amount of textual information described in multiple attributes. Nevertheless, in this article, we focus on EA approaches, which strive to align KGs and mainly rely on graph representation learning techniques to model the KG structure and generate entity structural embeddings for alignment. Therefore, the discussion and comparison with ER solutions is beyond the scope of this work.
Matching Data Instances via Deep Learning
Entity matching (EM) between databases has also been greatly advanced by utilizing pre-trained language models for expressive contextualization of database records [10, 33]. These deep learning (DL)-based EM solutions devise end-to-end neural models to learn to classify an entity pair into matching or non-matching and then feed the test entity pairs into the trained models to obtain classification results [5, 25, 33]. Nevertheless, this procedure is different from the focus of our study, as both of its training and testing stage involve representation learning and matching. Besides, these solutions are not suitable for matching KGs in entity embedding space, since (1) they require adequate labeled data to train the neural classification models, but the training data in EA is much less than the testing ones, which could result in the overfitting issue; (2) they would suffer from severe class imbalance in EA, where an entity and all of its nonequivalent entities in another KG would constitute many negative samples, while there is usually one positive sample for this entity; and (3) they depend on the attributive text information between data records for training, while EA underlines the use of KG structure, which could provide much less useful features for model training. In the experiment, we adapt DL-based EM models to tackle EA, and the results are not promising. This will be further discussed in Sect. 4.4.3.
Existing Surveys on EA
There are several survey papers covering EA frameworks [44, 52, 57,58,59]. Some articles provide high-level discussion of embedding-based EA frameworks, experimentally evaluate and compare these works, and offer guidelines for potential practitioners [44, 58, 59]. Specifically, Zhao et al. propose a general EA framework to encompass existing works and then evaluate them under a wide range of settings. Nevertheless, they only briefly mention DInf and SMat in the embedding matching stage [59]. Sun et al. survey EA approaches and develop an open-source library to evaluate existing works. However, they merely introduce DInf, SMat, and Hun. and overlook the comparison among these algorithms. Besides, they point out that current approaches put in their main efforts in learning expressive embeddings to capture entity features but ignore the alignment inference (i.e., embedding matching) stage [44]. Zhang et al. empirically evaluate state-of-the-art embedding-based EA methods in an industrial context and particularly investigate the influence of the sizes and biases in seed mappings. They evaluate each method as a whole and do not mention the embedding matching process [58].
Two recent survey papers include the latest efforts on embedding-based EA and give more self-contained explanation on each technique. Zhang et al. provide a tutorial-type survey, while for embedding matching, they merely introduce the nearest neighbor search strategy, i.e., DInf [57]. Zeng et al. mainly introduce representation learning methods and their applications on EA, but they neglect the embedding matching stage [52].
In all, existing EA survey articles focus on the representation learning process and briefly introduce the embedding matching module (mostly only mentioning the DInf algorithm), while in this work, we systematically survey and empirically evaluate the algorithms designed for the embedding matching process in KG alignment and present comprehensive results and insightful discussions.
Scope of This Work
This study aims to survey and empirically compare the algorithms for matching KGs in entity embedding spaces, i.e., various implementations of Embedding_Matching( ) in Algorithm 1, on a wide range of EA experimental settings.
2.3 Key Assumptions
Notably, existing embedding-based EA solutions have a fundamental assumption; that is, the equivalent entities in different KGs possess similar (ideally, isomorphic) neighboring structures. Under such an assumption, effective representation learning models would transform the structures of equivalent entities into similar entity embeddings. As thus, based on the entity embeddings, the embedding matching stage would assign higher (resp., lower) pairwise similarity scores to the equivalent (resp., nonequivalent) entity pairs and finally make accurate alignment decisions via the coordination according to pairwise scores.
Besides, current EA evaluation settings assume that the entities in different KGs conform to the 1-to-1 constraint. That is, each \(u \in \mathcal {E}_s\) has one and only one equivalent entity \(v \in \mathcal {E}_t\) and vice versa. However, we contend that this assumption is in fact impractical and provides detailed experiments and discussions in Sect. 4.5.2.
3 Alignment Inference Algorithms
In this section, we introduce the algorithms for alignment inference, i.e., Embedding_Matching( ) in Algorithm 1.
3.1 Overview
We first provide the overview and comparison of matching algorithms for KGs in entity embedding spaces in Table 4.1. As mentioned in Sect. 4.2, embedding matching comprises two stages—pairwise score computation and matching. The baseline approach DInf adopts existing similarity metrics to calculate the similarity between entity embeddings and generate the pairwise scores in the first stage, and then it leverages Greedy for matching. In pursuit of better alignment performance, more advanced embedding matching strategies are put forward. While some (i.e., CSLS, RInf, and Sink.) optimize the pairwise score computation process and produce more accurate pairwise scores, some (i.e., Hun., SMat, and RL) take into account the global alignment dynamics, rather than greedily pursue the local optimum for each entity, during the matching process, where more correct matches could be generated according to the coordination under the global constraint.
We further identify two notable characteristics of matching KGs in entity embedding spaces, i.e., whether the matching leverages the 1-to-1 constraint, and the direction of the matching. Regarding the former, Hun. and SMat explicitly exert the 1-to-1 constraint on the matching process. RL relaxes the strict 1-to-1 constraint by allowing non-1-to-1 matches. The greedy strategies, however, normally do not take into consideration this constraint, except for Sink., which implicitly implements the 1-to-1 constraint in a progressive manner when calculating the pairwise scores. As for the direction of matching, Greedy only considers a single direction at a time and overlooks the influence from the reverse direction. As thus, the resultant source-to-target alignment results are not necessarily equal to the target-to-source ones. By improving the pairwise score computation, CSLS, RInf, and Sink. are actually modeling and integrating the bidirectional alignments, whereas they still adopt Greedy to produce final results. For non-greedy methods, Hun. and SMat fully consider the bidirectional alignments and produce a matching agreed by both directions, while RL is unidirectional.
Next, we describe these methods in detail.Footnote 4
3.2 Simple Embedding Matching
DInf is the most common implementation of Embedding_Matching( ), described in Algorithm 3. Assume both KGs contain n entities. The time and space complexity of DInf is \(O(n^2)\).
Algorithm 3: \(\textsf { DInf} (\mathcal {E}_s,\mathcal {E}_t, \boldsymbol {E})\)
1 Derive similarity matrix \(\boldsymbol {S}\) based on \(\boldsymbol {E}\);
2 \(\mathcal {M} \gets \) Greedy\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {S})\);
3 return \(\mathcal {M}\);
3.3 CSLS Algorithm
The cross-domain similarity local scaling (CSLS) algorithm [23] is introduced to mitigate the hubness and isolation issues of entity embeddings in EA [44]. The hubness issue refers to the phenomenon where some entities (known as hubs) frequently appear as the top one most similar entities of other entities in the vector space, while the isolation issue means that there exist some outliers isolated from any point clusters. As thus, CSLS increases the similarity associated with isolated entity embeddings and conversely decreases the ones of vectors lying in dense areas [23]. Formally, the CSLS pairwise score between source entity u and target entity v is:
where \(\boldsymbol {S}\) is the similarity matrix derived from \(\boldsymbol {E}\) using similarity metrics, \(\phi (u) = \frac {1}{k}\sum _{v^\prime \in \mathcal {N}_u}\boldsymbol {S}(u,v^\prime )\) is the mean similarity score between the source entity u and its top-k most similar entities \(\mathcal {N}_u\) in the target KG, and \(\phi (v)\) is defined similarly. The mean similarity scores of all source and target entities are denoted in vector form as \(\phi _s\) and \(\phi _t\), respectively. To generate the matched entity pairs, it further applies Greedy on the CSLS matrix (i.e., \(\boldsymbol {S}_{\textsf { CSLS} }\)). Algorithm 4 describes the detailed procedure of CSLS.
Algorithm 4: CSLS\((\mathcal {E}_s,\mathcal {E}_t, \boldsymbol {E},k)\)
1 Derive similarity matrix \(\boldsymbol {S}\) based on \(\boldsymbol {E}\);
2 Calculate the mean values of top-k similarity scores of entities in \(\mathcal {E}_s\) and \(\mathcal {E}_t\), resulting in \(\phi _s\) and \(\phi _t\), respectively;
3 \(\boldsymbol {S}_{\textsf { CSLS} } = 2\boldsymbol {S} - \phi _s -\phi _t^\top \);
4 \(\mathcal {M} \gets \) Greedy\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {S}_{\textsf { CSLS} })\);
5 return \(\mathcal {M}\);
Complexity
The time and space complexity is \(O(n^2)\). Practically, it requires more time and space than DInf, as it needs to generate the additional CSLS matrix.
3.4 Reciprocal Embedding Matching
Zeng et al. [53] formulate EA task as the reciprocal recommendation process [38] and offer a reciprocal embedding matching strategy RInf to model and integrate the bidirectional preferences of entities when inferring the matching results. Formally, it defines the pairwise score of source entity u toward target entity v as:
where \(\boldsymbol {S}\) is the similarity matrix derived from \(\boldsymbol {E}\), \(0\leq p_{u,v} \leq 1\), and a larger \(p_{u,v}\) denotes a higher degree of preference. As such, the matrix forms of the source-to-target and target-to-source preference scores are denoted as \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\), respectively. Next, it converts the preference matrix \(\boldsymbol {P}\) into the ranking matrix \(\boldsymbol {R}\) and then averages the two ranking matrices, resulting in the reciprocal preference matrix \(\boldsymbol {P}_{s \leftrightarrow t}\) that encodes the bidirectional alignment information. Finally, it adopts Greedy to generate the matched entity pairs.
Algorithm 5: \(\textsf { RInf} (\mathcal {E}_s,\mathcal {E}_t, \boldsymbol {E})\)
1 Derive similarity matrix \(\boldsymbol {S}\) based on \(\boldsymbol {E}\);
2 for \(u \in \mathcal {E}_s\) do
3 for \(v \in \mathcal {E}_t\) do
4 Calculate \(p_{u,v}\) and \(p_{v,u}\) (cf. Eq. (4.2));
5 Collect the preference scores, resulting in \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\);
6 Convert \(\boldsymbol {P}_{s,t}\) and \(\boldsymbol {P}_{t,s}\) into \(\boldsymbol {R}_{s,t}\) and \(\boldsymbol {R}_{t,s}\), respectively;
7 \(\boldsymbol {P}_{s \leftrightarrow t} = (\boldsymbol {R}_{s,t} + \boldsymbol {R}_{t,s}^\top )/2\);
8 \(\mathcal {M} \gets \) Greedy\((\mathcal {E}_s,\mathcal {E}_t, -\boldsymbol {P}_{s \leftrightarrow t})\);
9 return \(\mathcal {M}\);
Complexity
Algorithm 5 describes the detailed procedure of RInf. The time complexity is \(O(n^2\lg n)\) [53]. The space complexity is \(O(n^2)\). Practically, it requires more space than DInf and CSLS, due to the computation of similarity, preference, and ranking matrices. Noteworthily, two variant methods, i.e., RInf-wr and RInf-pb, are proposed to reduce the memory and time consumption brought by the reciprocal modeling. More details can be found in [53].
3.5 Embedding Matching as Assignment
Some very recent studies [29, 49] propose to model the embedding matching process as the linear assignment problem. They first use similarity metrics to calculate pairwise similarity scores based on \(\boldsymbol {E}\). Then they adopt the Hungarian algorithm [21] to solve the task of assigning source entities to target entities according to the pairwise scores. The objective is to maximize the sum of the pairwise similarity scores of the final matched entity pairs while observing the 1-to-1 assignment constraint. In this work, we use the Hungarian algorithm implemented by Jonker and Volgenant [18] and denote it as Hun.\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {E})\).
Besides, the Sinkhorn operation [31] (or Sink. for short) is also adopted to solve the assignment problem [12, 15, 29], which converts the similarity matrix \(\boldsymbol {S}\) into a doubly stochastic matrix \(\boldsymbol {S}_{sinkhorn}\) that encodes the entity correspondence information. Specifically,
where \(Sinkhorn^0(\boldsymbol {S}) = \exp (\boldsymbol {S})\) and \(\Gamma _c\) and \(\Gamma _r\) refer to the column and row-wise normalization operators of a matrix. Since the number of iterations l is limited, the Sinkhorn operation can only obtain an approximate 1-to-1 assignment solution in practice [29]. Then \(\boldsymbol {S}_{sinkhorn}\) is forwarded to Greedy to obtain the alignment results.
Algorithm 6: Sink.\((\mathcal {E}_s,\mathcal {E}_t, \boldsymbol {E}, l)\)
1 Derive similarity matrix \(\boldsymbol {S}\) based on \(\boldsymbol {E}\);
2 \(\boldsymbol {S}_{sinkhorn} = Sinkhorn^l(\boldsymbol {S})\) (cf. Eq. (4.3));
3 \(\mathcal {M} \gets \) Greedy\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {S}_{sinkhorn})\);
4 return \(\mathcal {M}\);
Complexity
For Hun., the time complexity is \(O(n^3)\), and the space complexity is \(O(n^2)\). Algorithm 5 describes the procedure of Sink.. The time complexity of Sink. is \(O(ln^2)\) [29], and the space complexity is \(O(n^2)\). In practice, both algorithms require more space than DInf, since they need to store the intermediate results.
3.6 Stable Embedding Matching
In order to consider the interdependence among alignment decisions, the embedding matching process is formulated as the stable matching problem [13] by Zeng et al. [55] and Zhu et al. [60]. It is proved that for any two sets of members with the same size, each of whom provides a ranking of the members in the opposing set, there exists a bijection of the two sets such that no pair of two members from the opposite side would prefer to be matched to each other rather than their assigned partners [11]. Specifically, these works first produce the similarity matrix \(\boldsymbol {S}\) based on \(\boldsymbol {E}\) using similarity metrics. Next, they generate the rankings of members in the opposing set according to the pairwise similarity scores. Finally, they use the Gale-Shapley algorithm [40] to solve the stable matching problem. This procedure is denoted as SMat\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {E})\).
Complexity
SMat has time complexity of \(O(n^2\lg n)\) (since for each entity, the ranking of entities in the opposite side needs to be computed) and space complexity of \(O(n^2)\).
3.7 RL-Based Embedding Matching
The embedding matching process is cast to the classic sequence decision problem by [56]. Given a sequence of source entities (and their embeddings), the goal of the sequence decision problem is to decide to which target entity each source entity aligns. It devises a reinforcement learning (RL)-based framework to learn to optimize the decision-making for all entities, rather than optimize every single decision separately. More details can be found in the original paper [56], and we use RL\((\mathcal {E}_s,\mathcal {E}_t,\boldsymbol {E})\) to denote it.
Complexity
Note that it is difficult to deduce the time complexity for this neural RL model. Instead, we provide the empirical time costs in the experiments. The space complexity is \(O(n^2)\).
4 Main Experiments
In this section, we compare the algorithms for matching KGs in entity embedding spaces on the mainstream EA evaluation setting (1-to-1 alignment).
4.1 EntMatcher: An Open-Source Library
To ensure comparability, we re-implemented all compared algorithms using Python under a unified framework and established an open-source library, EntMatcher.Footnote 5 The architecture of EntMatcher library is presented in the blue block of Fig. 4.3, which takes as input unified entity embeddings \(\boldsymbol {E}\) and produces the matched entity pairs. It has the following three major features:
Loosely Coupled Design
There are three independent modules in EntMatcher, and we have implemented the representative methods in each module. Users are free to combine the techniques in each module to develop new approaches, or to implement their new designs by following the templates in modules.
Reproduction of Existing Approaches
To support our experimental study, we tried our best to re-implement all existing algorithms by using EntMatcher. For instance, the combination of cosine similarity, CSLS, and Greedy reproduces the CSLS algorithm in Sect. 4.3.3; and the combination of cosine similarity, None, and Hun. reproduces the Hun. algorithm in Sect. 4.3.5. The specific hyper-parameter settings are elaborated in Sect. 4.4.2.
Flexible Integration with Other Modules in EA
EntMatcher is highly flexible, which can be directly called during the development of standalone EA approaches. Besides, users may also use EntMatcher as the backbone and call other modules. For instance, to conduct the experimental evaluations in this work, we implemented the representation learning and auxiliary information modules to generate the unified entity embeddings \(\boldsymbol {E}\), as shown in the white blocks of Fig. 4.3. More details are elaborated in the next subsection. Finally, EntMatcher is also compatible with existing open-source EA libraries (that mainly focus on representation learning) such as OpenEAFootnote 6 and EAkit.Footnote 7
4.2 Experimental Settings
Current EA evaluation setting assumes that the entities in source and target KGs are 1-to-1 matched (cf. Sect. 4.2.3). Although this assumption simplifies the real-word scenarios where some entities are unmatchable or some might be aligned to multiple entities on the other side, it indeed reflects the core challenge of EA. Therefore, following existing literature, we mainly compare the embedding matching algorithms under this setting and postpone the evaluation on the challenging real-life scenarios to Sect. 4.5.
Datasets
We used popular EA benchmarks for evaluation: (1) DBP15K, which comprises three multilingual KG pairs extracted from DBpedia [1], English to Chinese (\({\mathtt {DBP15K}_{\mathtt {ZH-EN}}}\)), English to Japanese (\({\mathtt {DBP15K}_{\mathtt {JA-EN}}}\)), and English to French (\({\mathtt {DBP15K}_{\mathtt {FR-EN}}}\)); (2) SRPRS, which is a sparser dataset that follows real-life entity distribution, including two multilingual KG pairs extracted from DBpedia, English to French (\({\mathtt {SRPRS}_{\mathtt {EN-FR}}}\)) and English to German (\({\mathtt {SRPRS}_{\mathtt {EN-DE}}}\)), and two mono-lingual KG pairs, DBpedia to Wikidata [46] (\({\mathtt {SRPRS}_{\mathtt {DBP-WD}}}\)) and DBpedia to YAGO [43] (\({\mathtt {SRPRS}_{\mathtt {DBP-YG}}}\)); and (3) DWY100K, which is a larger dataset consisting of two mono-lingual KG pairs: DBpedia to Wikidata (D-W) and DBpedia to YAGO (D-Y). The detailed statistics can be found in Table 4.2, where the numbers of entities, relations, triples, gold links, and the average entity degree are reported. Regarding the gold alignment links, we adopted 70% as test set, 20% for training, and 10% for validation.
Hardware Configuration and Hyper-Parameter Setting
Our experiments were performed on a Linux server that is equipped with an Intel Core i7-4790 CPU running at 3.6GHz, NVIDIA GeForce GTX TITAN X GPU, and 128 GB RAM. We followed the configurations presented in the original papers of these algorithms and tuned the hyper-parameters on the validation set. Specifically, for CSLS, we set k to 1, except on the non-1-to-1 setting where we set it to 5. Similarly, regarding RInf, we changed the maximum operation in Eq. (4.2) to top-k average operation on the non-1-to-1 setting, where k is set to 5. As to Sink., we set l to 100. For RL, we found the hyper-parameters in the original paper could already produce the best results and directly adopted them. The rest of the approaches, i.e., DInf, Hun., and SMat, do not contain hyper-parameters.
Evaluation Metric
We utilized F1 score as the evaluation metric, which is the harmonic mean between precision and recall, where the precision value is computed as the number of correct matches divided by the number of matches found by a method, and the recall value is computed as the number of correct matches found by a method divided by the number of gold matches. Note that recall is equivalent to the Hits@1 metric used in some previous works.
Representation Learning Models
Since representation learning is not the focus of this work, we adopted two frequently used models, i.e., RREA [30] and GCN [47]. Concretely, GCN is one of the simplest models, which uses graph convolutional networks to learn the structural embeddings, while RREA is one of the best-performing solutions, which leverages relational reflection transformation to obtain relation-specific entity embeddings.
Auxiliary Information for Alignment
Some works leverage the auxiliary information in KGs (e.g., entity attributes, descriptions, and pictures) to complement the KG structure. Specifically, these auxiliary information are first encoded into low-dimensional vectors and then fused with structural embeddings to provide more accurate entity representations for the subsequent embedding matching stage [44]. Although EA underlines the use of graph structure for alignment [59], for a more comprehensive evaluation, we examined the influence of auxiliary information on the matching results by following previous works and using entity name embeddings to facilitate alignment [34, 59]. We also combined these two channels of information with equal weights to generate the fused similarity matrix for matching.Footnote 8
Similarity Metric
After obtaining the unified entity representations \(\boldsymbol {E}\), a similarity metric is required to produce pairwise scores and generate the similarity matrix \(\boldsymbol {S}\). Frequent choices include the cosine similarity [6, 30, 45], the Euclidean distance [7, 24], and the Manhattan distance [48, 50]. In this work, we followed mainstream works and adopted the cosine similarity.
4.3 Main Results and Comparison
We first evaluate with only structural information and report the results in Table 4.3, where R- and G- refer to using RREA and GCN to generate the structural embeddings, respectively, and DBP and SRP denote DBP15K and SRPRS, respectively. Next, we supplement with name embeddings and report the results in Table 4.4, where N- and NR- refer to only using the name embeddings and fusing name embeddings with RREA structural representations, respectively. Note that, on existing datasets, all the entities in the test set can be matched, and all the algorithms are devised to find a target entity for each test source entity. Hence, the number of matches found by a method equals to the number of gold matches, and consequently the precision value is equal to the recall value and the F1 score [56].
Overall Performance
First, we do not delve into the embedding matching algorithms and directly analyze the general results. Specifically, using RREA to learn structural representations can bring better performance compared with using GCN, showcasing that representation learning strategies are crucial to the overall alignment performance. When introducing the entity name information, it observes that this auxiliary signal alone can already provide very accurate signal for alignment. This is because the equivalent entities in different KGs of current datasets share very similar or even identical names. After fusing the semantic and structural information, the alignment performance is further lifted, with most of the approaches hitting over 0.9 in terms of the F1 score.
Effectiveness Comparison of Embedding Matching Algorithms
From the tables, it is evident that: (1) Overall, Hun. and Sink. attain much better results than the other strategies. Specifically, Hun. takes full account of the global matching constraints and strives to reach a globally optimal matching given the objective of maximizing the sum of pairwise similarity scores. Moreover, the 1-to-1 constraint it exerts aligns with present evaluation setting where the source and target entities are 1-to-1 matched. Sink., on the other hand, implicitly implements the 1-to-1 constraint during pairwise score computation and still adopts Greedy to produce final results, where there might exist non-1-to-1 matches; (2) DInf attains the worst performance. This is because it directly adopts the similarity scores that suffer from the hubness and isolation issues [44]. Besides, it leverages Greedy, which merely reaches the local optimum for each entity. (3) The performance of RInf, CSLS, SMat, and RL are well matched. RInf and CSLS improve upon DInf by mitigating the hubness issue and enhancing the quality of pairwise scores. SMat and RL, on the other hand, improve upon DInf by modeling the interactions among matching decisions for different entities.
Furthermore, we conduct a deeper analysis of these approaches and identify the following patterns:
Pattern 1. If for source entities, their highest pairwise similarity scores are close, RInf and CSLS (resp., SMat and RL) would attain relatively better (resp., worse) performance. Specifically, in Table 4.3 where RInf consistently (CSLS sometimes) attains superior results than SMat and RL, the average standard deviation (STD) values of the top five pairwise similarity scores of source entities (cf. Fig. 4.4) are very small, unveiling that the top scores are close and difficult to differentiate. In contrast, in Table 4.4 where SMat and RL outperform RInf and CSLS, the corresponding STD values are relatively large. This is because RInf and CSLS aim to make the scores more distinguishable, and hence they are more effective in cases where the top similarity scores are very close (i.e., low STD values). On the contrary, when the top similarity scores are already discriminating (e.g., Table 4.4), RInf and CSLS become less useful, while SMat and RL can still make improvements by using the global constraints to enforce the deviation from local optimums.
Pattern 2. On sparser datasets, the superiority of Sink. and Hun. over the rest of the methods becomes less significant. This is based on the observation that on SRPRS, other matching algorithms (RInf in particular) attain much closer performance to Sink. and Hun.. Such a pattern could be attributed to the fact that, on sparser datasets, entities normally have fewer connections with others, i.e., lower average entity degree (in Table 4.2), where representation learning strategies might fail to fully capture the structural signals for alignment and the resultant pairwise scores become less accurate. These inaccurate scores could mislead the matching process and hence limit the effectiveness of the top-performing methods, i.e., Sink. and Hun.. In other words, sparser KG structures are more likely to (partially) break the fundamental assumption on KG structure similarity (cf. Sect. 4.2.3).
Efficiency Analysis
We compare the time and space efficiency of these methods on the medium-sized datasets in Fig. 4.5. Since the costs on KG pairs from the same dataset are very similar, we report the average time and space costs under each setting in the interest of space.
Specifically, it observes that: (1) The simple algorithm DInf is the most efficient approach. (2) Among the advanced approaches, CSLS is the most efficient one, closely following DInf. (3) The efficiency of RInf and Hun. are equally matched. While Hun. consumes relatively less memory space than RInf, its time efficiency is less stable and tends to run slower on datasets with less accurate pairwise scores. (4) The space efficiency of Sink. is close to RInf and Hun., whereas it has much higher time costs, which largely depends on the value of l. (5) RL is the least time-efficient approach, while SMat is the least space-efficient algorithm. RL requires more time on datasets with less accurate pairwise scores where its pre-processing module fails to produce promising results [56]. The memory space consumption of SMat is high, as it needs to store a large amount of intermediate matching results. In all, we can conclude that generally, advanced embedding matching algorithms require more time and memory space, among which the methods incorporating global matching constraints tend to be less efficient.
Comparison with DL-Based EM Approaches
We utilize the deepmatcher Python package [33], which provides built-in neural networks and utilities that can train and apply state-of-the-art deep learning models for entity matching, to address EA. Specifically, we use the structural and name embeddings to replace the attributive text inputs in deepmatcher, respectively, and then train the neural model with labeled data. For each positive entity pair, we randomly sample ten negative ones. In the testing stage, for each source entity, we feed the entity pairs constituting it and all the target entities into the trained classifier and regard the entity pair with the highest predicted score as the result.
In the final results, only several entities are correctly aligned, showing that DL-based EM approaches cannot handle EA, which can be ascribed to the insufficient labeled data, imbalanced class distribution, and the lack of attributive text information, as discussed in Sect. 4.2.2.
4.4 Results on Large-Scale Datasets
Next, we provide the results on the relatively larger dataset, i.e., DWY100K, which can also reflect the scalability of these algorithms. The results are presented in Table 4.5.Footnote 9 The general pattern is similar to that on G-DBP (i.e., using GCN on DBP15K), where Sink. and Hun. obtain the best results, followed by RInf. The performance of CSLS and RL is close, outperforming DInf by over 20%.
We compare the efficiency of these algorithms in Table 4.5, where \(\bar {T}\) refers to the average time cost and Mem. denotes whether the memory space required by the model can be covered by our experimental environment.Footnote 10 It observes that, given larger datasets, most of the performant algorithms have poor efficiency and scalability (e.g., RInf, Sink., and Hun.). Note that in [53], two variants of RInf, i.e., RInf-wr and RInf-pb, are proposed to improve its scalability at the cost of a small performance drop, which is empirically validated in Table 4.5. This also reveals that more scalable matching algorithms for KGs in entity embedding spaces should be devised.
4.5 Analysis and Insights
We provide further experiments and discussions in this subsection.
On Efficiency and Scalability
The simple algorithm DInf is the most efficient and scalable one, as it merely involves the most basic computation and matching operations. CSLS is slightly less efficient than DInf due to the update of pairwise similarity scores. It also has good scalability. Although RInf adopts a similar idea to CSLS, it involves an additional ranking process, which brings much more time and memory consumption, making it less scalable. Sink. repeatedly conducts the normalization operation, and thus its time efficiency is mainly up to the l value. Its scalability is also limited by the memory space consumption since it needs to store intermediate results, as revealed in Table 4.5.
Regarding the methods that exert global constraints, Hun. is efficient on medium-sized datasets, while it is not scalable due to the high time complexity and memory space consumption. SMat is space-inefficient even on the medium-sized datasets, making it not scalable. In comparison, RL has more stable time and space costs and can scale to large datasets, and the main influencing factor is the accuracy of pairwise scores. This is because RL has a pre-processing step that filters out confident matched entity pairs and excludes them from the time-consuming RL learning process [56]. More confident matched entity pairs would be filtered out if the pairwise scores are more accurate.
On Effectiveness of Improving Pairwise Score Computation
We compare and discuss the strategies for improving the pairwise score computation, i.e., CSLS, RInf, and Sink.
Both CSLS and RInf aim to mitigate the hubness and isolation issues in the raw pairwise scores (from different starting points). Particularly, we observe that, by setting k (in Eq. (4.1)) of CSLS to 1, the difference between RInf and CSLS is reduced to the extra ranking process of RInf, and the results in Table 4.3 and 4.4 validate that this ranking process can consistently bring better performance. This is because the ranking operation can amplify the difference among the scores and prevent such information from being lost after the bidirectional aggregation [53]. However, it is noteworthy that the ranking process brings much more time and memory consumption, as can be observed from the empirical results.
Then we analyze the influence of k value in CSLS. As shown in Fig. 4.6, a larger k leads to worse performance. This is because a larger k implies a smaller \(\phi \) value in Eq. (4.1) (where the top-k highest scores are considered and averaged), and the resultant pairwise scores become less distinctive. This also validates the effectiveness of the design in RInf (cf. Eq. (4.2)), where only the maximum value is considered to compute the preference score. Nevertheless, in Sect. 4.5.2, we reveal that setting k to 1 is only useful in the 1-to-1 alignment setting.
As for Sink., it adopts an extreme approach to optimize the pairwise scores, which encourages each source (resp., target) entity to have only one positive pairwise score with a target (resp., source) entity and 0’s with the rest of the target (resp., source) entities. Thus, it is in fact progressively and implicitly implementing the 1-to-1 alignment constraint during the pairwise score computation process with the increase of l and is particularly useful in present 1-to-1 evaluation settings of EA. In Fig. 4.7, we further examine the influence of l in Eq. (4.3) on the alignment results of Sink., which meets our expectation that the larger the l value, the better the distribution of the resultant pairwise scores fits the 1-to-1 constraint, and thus the higher the alignment performance. Nevertheless, a larger l also implies longer processing time. Therefore, by tuning on the validation set, we set l to 100 to reach the balance between effectiveness and efficiency.
On Effectiveness of Exerting Global Constraints
Next, we compare and discuss the methods that exert global constraints on the embedding matching process, i.e., Hun., SMat, and RL.
It is evident that Hun. is the most performant approach, as it fits well with the present EA setting and can secure an optimal solution toward maximizing the sum of pairwise scores. Specifically, the current EA setting has two notable assumptions (cf. Sect. 4.2.3). With these two assumptions, EA can be transformed into the linear assignment problem, which aims to maximize the sum of pairwise scores under the 1-to-1 constraint [29]. As thus, the algorithms for solving the linear assignment problem, e.g., Hun., can attain remarkably high performance on EA. However, these two assumptions do not necessarily hold on all occasions, which could influence the effectiveness of Hun.. For instance, as revealed in Pattern 2, on sparse datasets (e.g., SRPRS), the neighboring structures of some equivalent entities are likely to be different, where the effectiveness of Hun. is limited. In addition, the 1-to-1 alignment constraint is not necessarily true in practice, which will be discussed in Sect. 4.5.
In comparison, SMat merely aims to attain a stable matching, where the resultant entity pairing could be sub-optimal under present evaluation setting. RL, on the other hand, relaxes the 1-to-1 constraint and only deviates slightly from the greedy matching, and hence the results are not very promising.
Overall Comparison and Conclusion
Finally, we compare the algorithms all together and draw the following conclusions under the 1-to-1 alignment setting: (1) The best performing methods are Hun. and Sink.. Nevertheless, they have low scalability. (2) CSLS and RInf achieve the best balance between effectiveness and efficiency. While CSLS is more efficient, RInf is more effective. (3) SMat and RL tend to attain better results when the accuracy of the pairwise scores is high. Nevertheless, they require relatively more time.
Combining Embedding Matching Algorithms
As described above, CSLS, RInf, and Sink. mainly improve the computation of pairwise scores, while Hun., SMat, and RL exert the global constraints during the matching process. Thus, by using the EntMatcher library, we aim to investigate whether the combination of these strategies would lead to better matching performance.
The results are reported in Table 4.6, where we can observe that: (1) Hun. is already effective given the raw pairwise scores, and using CSLS or RInf to improve the pairwise scores would not change and even bring down the performance (2) For SMat and RL, using CSLS or RInf to improve the raw pairwise scores would consistently lead to better results (3) Looking from the other side, while applying Hun., SMat, and RL upon CSLS improves its performance, such additional operations bring down the results of RInf This is because, by modeling entity preference and converting to rankings, RInf has already lost the information contained in the original pairwise scores, and exerting global constraints upon the reciprocal preference scores is no longer beneficial.
Hence, we can conclude that, generally speaking, combining the algorithms (designed for different embedding matching stages) would lead to better alignment results, except for the combination with RInf and Hun..
Alignment Results Analysis
We further analyze the sets of matched entity pairs produced by the compared algorithms. Specifically, we examine the difference of the correct results found by different methods and report the pairwise difference ratios in the heat map of Fig. 4.8. The difference ratio is defined as the \(|\mathcal {C}-\mathcal {R}|/|\mathcal {C}|\), where \(\mathcal {C}\) and \(\mathcal {R}\) denote the correct aligned entity pairs produced by the corresponding algorithms in the column and row, respectively.
From Fig. 4.8a, we can observe that: (1) The elements in the matrix (except those in secondary diagonal) are above 0, showing that these matching algorithms produce complementary correct matches (2) The results of Hun. and Sink., CSLS and RInf, and SMat and RL are similar; that is, the algorithms in each pair produce very similar correct matches (i.e., with low difference ratios and light colors) (3) The columns of Hun., Sink., and RInf have relatively darker colors, revealing that they tend to discover the matches that other methods fail to detect
We further select three representative methods, i.e., RInf, Hun., and SMat, and provide a more detailed analysis in Fig. 4.8b. It is obvious that these approaches do produce complementary results, and it calls for an ensemble framework that integrates the alignment results produced by different matching algorithms.
5 New Evaluation Settings
In this section, we conduct experiments on settings that can better reflect real-life challenges.
5.1 Unmatchable Entities
Current EA literature largely overlooks the unmatchable issue, where a KG contains entities that the other KG does not contain. For instance, when aligning YAGO 4 and IMDB, only 1% of entities in YAGO 4 are film-related and possibly have equivalent entities in IMDB, while the other 99% of entities in YAGO 4 necessarily have no match in IMDB [59]. Hence, we aim to evaluate the embedding matching algorithms in terms of dealing with unmatchable entities.
Datasets and Evaluation Settings
Following [54], we adapt the KG pairs in DBP15K to include unmatchable entities, resulting in DBP15K+. More specific construction procedure can be found in [54].
As for the evaluation metric, we follow the main experimental setting and adopt the F1 score. Unlike 1-to-1 alignment, there exist unmatchable entities in this adapted dataset, and the precision and recall values are not necessarily equivalent, since some methods would also align unmatchable entities.
Noteworthily, the original setting of SMat and Hun. requires that the numbers of entities on the two sides are equal. Thus, we add the dummy nodes on the side with fewer entities to restore such a setting and then apply SMat and Hun.. The corresponding results are reported in Table 4.7.
Alignment Results
It reads that Hun. attains the best results, followed by SMat. The superior results are partially due to the addition of dummy nodes, which could mitigate the unmatchable issue to a certain degree. The results RInf and Sink. are close, outperforming CSLS and RL. DInf still achieves the worst performance.
Besides, by comparing the results on DBP15K+ and those on the original dataset DBP15K (cf. Table 4.3), we observe that: (1) After including the unmatchable entities, for all methods, the F1 scores drop. This is because most of current embedding matching algorithms are greedy, i.e., retrieving a target entity for each source entity (including the unmatchable ones), which leads to a very low precision. For the rest of the methods, e.g., Hun. and SMat, the unmatchable entities also mislead the matching process and thus affect the final results. (2) Unlike on DBP15K where the performance of Sink. and Hun. is close, on DBP15K+, Hun. largely outperforms Sink., as Hun. does not necessarily align a target entity to each source entity and has a higher precision. (3) Overall, existing algorithms for matching KGs in entity embedding spaces lack the capability of dealing with unmatchable entities.
5.2 Non-1-to-1 Alignment
Next, we study the setting where the source and target entities do not strictly conform to the 1-to-1 constraint, so as to better appreciate these matching algorithms for KGs in entity embedding spaces. Non-1-to-1 alignment is common in practice, especially when two KGs contain entities in different granularity, or one KG is noisy and involves duplicate entities. To the best of our knowledge, we are among the first attempts to identify and investigate this issue.
Dataset Construction
Present EA benchmarks are constructed according to the 1-to-1 constraint. Thus, in this work, we establish a new dataset that involves non-1-to-1 alignment relationships. Specifically, we obtain the pre-annotated linksFootnote 11 between Freebase [3] and DBpedia [1] and preserve the entities that are involved in 1-to-many, many-to-1, and many-to-many alignment relationships. Then, we retrieve the relational triples that contain these entities from respective KGs, which also introduces new entities.
Next, we detect the links among the newly added entities and add them into the alignment links. Finally, the resultant dataset, FB_DBP_MUL, contains 44,716 entities, 164,882 triples, and 22,117 gold links, among which 20,353 are non-1-to-1 links and 1,764 are 1-to-1 links.Footnote 12 The specific statistics are also presented in Table 4.2.
Evaluation Settings
To keep the integrity of the links among entities, we sample the training, validation, and test sets from the gold links according to the principle that the links involving the same entity should not be distributed among different sets. The size of the final training, validation, and test sets is approximately 7:1:2. We compare the entity pairs produced by embedding matching algorithms against the gold test links and report the precision (P), recall (R), and F1 values.
Alignment Results
It is evident from Table 4.8 that, compared with 1-to-1 alignment, the results change significantly on the new dataset. Specifically: (1) RInf and CSLS attain the best F1 scores, whereas the results are not very promising (e.g., with F1 score lower than 0.1 when using GCN). (2) Sink. and Hun. achieve much worse results compared with the performance on 1-to-1 alignment datasets. (3) The results of SMat and RL are even inferior to those of the simple baseline DInf. The main reason accounting for these changes is that the non-1-to-1 alignment links pose great challenges to existing embedding matching algorithms. Specifically, for DInf, CSLS, RInf, Sink., and RL, they only align one target entity (that possesses the highest score) to a given source entity, but fail to discover other alignment links that also involve this source entity. For SMat and Hun., they impose the 1-to-1 constraint during the matching process, which falls short on the non-1-to-1 setting, thus leading to inferior results. Therefore, it calls for the study on embedding matching algorithms targeted at non-1-to-1 alignment.
Discussion of the k Value in CSLS and RInf
In Fig. 4.9, we report the performance of CSLS and RInf given varying k values on FB_DBP_MUL. It shows that, generally, a larger k leads to better results. This is because, on the non-1-to-1 setting, an entity is likely to be matched to several entities on the other side, where it is more appropriate to consider the top-k values, rather than the sole maximum value, when refining the pairwise scores.
6 Summary and Future Direction
In this section, we summarize the observations and insights made from our evaluation and provide possible future research directions.
(1) The Investigation into Matching KGs in Embedding Spaces Has Not Yet Made Substantial Progress
Although there are a few algorithms tailored for matching KGs in embedding spaces, e.g., CSLS, RInf, and RL, under the most popular EA evaluation setting (with 1-to-1 alignment constraint), they are outperformed by the classic general matching algorithms, i.e., Hun.. Hence, there is still much room for improving matching KGs in embedding spaces.
(2) No Existing Embedding Matching Algorithm Prevails Under All Experimental Settings
The strategies designed to solve the linear assignment problem attain the best performance under the 1-to-1 setting, while they fall short on more practical and challenging scenarios since the new settings (e.g., non-1-to-1 alignment) no longer align with the conditions of these optimization algorithms. Similarly, although the methods for improving the computation of pairwise scores achieve superior results in the non-1-to-1 alignment scenario, they are outperformed by other solutions under the unmatchable setting. Therefore, each evaluation setting poses its own challenge to the embedding matching process, and currently there is no consistent winner.
(3) The Adaptation from General Matching Algorithms Requires Careful Design
Among the embedding matching algorithms, Hun. and SMat are general matching algorithms that have been applied to many other related tasks. Although directly adopting these general strategies to tackle EA is simple and effective, they might well fall short in some scenarios, as the alignment on KGs possesses its own challenges, e.g., the matching is not necessarily 1-to-1 constrained, or the pairwise scores are inaccurate. Thus, it is suggested to take full account of the characteristics of the alignment settings when adapting other general matching algorithms to cope with matching KGs in entity embedding spaces.
(4) The Scalability and Efficiency Should Be Brought to the Attention
Existing advanced embedding matching algorithms have poor scalability, due to the additional resource-consuming operations that contribute to the alignment performance, such as the ranking process in RInf and the 1-to-1 constraint exerted by Hun. and SMat. Besides, the space efficiency is also a critical issue. As shown in Sect. 4.4.4, most of the approaches have rather high memory costs given large-scale datasets. Therefore, considering that in practice there are much more entities, the scalability and efficiency issues should be considered during the algorithm design.
(5) The Practical Evaluation Settings Are Worth Further Investigation
Under the unmatchable and non-1-to-1 alignment settings, the performance of existing algorithms is not promising. A possible future direction is to introduce the notion of probability and leverage the probabilistic reasoning frameworks [19, 39], which have higher flexibility, to produce the alignment results.
7 Conclusion
This paper conducts a comprehensive survey and evaluation of matching algorithms for KGs in entity embedding spaces. We evaluate seven state-of-the-art strategies in terms of effectiveness and efficiency on a wide range of datasets, including two experimental settings that better mirror real-life challenges. We identify the strengths and weaknesses of these algorithms under different settings. We hope the experimental results would be valuable for researchers and practitioners to put forward more effective and scalable embedding matching algorithms.
Notes
- 1.
In the rest of the paper, we use EA to refer to embedding-based EA solutions and conventional EA for the early solutions.
- 2.
Indeed there are a few exceptions, which instead learn a mapping function between individual embedding spaces [44]. However, the subsequent steps still require mapping between spaces and operate on a “unified” one, e.g., target entity embeddings.
- 3.
Under certain metrics such as cosine similarity (resp., Euclidean distance), the larger (resp., smaller) the pairwise scores, the higher the probability that two entities are equivalent. In this work, w.l.o.g., we adopt the former expression and consider that higher pairwise scores are preferred.
- 4.
- 5.
The codes are publicly available at https://github.com/DexterZeng/EntMatcher.
- 6.
- 7.
- 8.
Note that we only reported the results of fusing entity name embeddings with RREA structural embeddings. The results of combining entity name embeddings with GCN embeddings exhibited similar patterns and were omitted in the interest of space.
- 9.
We cannot provide the results of SMat, as it requires extremely large memory space and cannot work under our experimental environment.
- 10.
Note that for algorithms with memory space costs exceeding our experimental environment (except for SMat), there is additional swap area in the hard drive for them to finish the program (which usually takes much longer time).
- 11.
- 12.
FB_DBP_MUL is publicly available at https://github.com/DexterZeng/EntMatcher.
References
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives. Dbpedia: A nucleus for a web of open data. In ISWC, pages 722–735, 2007.
M. Berrendorf, E. Faerman, V. Melnychuk, V. Tresp, and T. Seidl. Knowledge graph entity alignment with graph convolutional networks: Lessons learned. In ECIR, volume 12036, pages 3–11, 2020.
K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, pages 1247–1250, 2008.
A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
U. Brunner and K. Stockinger. Entity matching with transformer architectures - A step forward in data integration. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, pages 463–473. OpenProceedings.org, 2020.
Y. Cao, Z. Liu, C. Li, Z. Liu, J. Li, and T. Chua. Multi-channel graph neural network for entity alignment. In ACL, pages 1452–1461, 2019.
M. Chen, Y. Tian, K. Chang, S. Skiena, and C. Zaniolo. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In IJCAI, pages 3998–4004, 2018.
M. Chen, Y. Tian, M. Yang, and C. Zaniolo. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In IJCAI, pages 1511–1517, 2017.
V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis. An overview of end-to-end entity resolution for big data. ACM Comput. Surv., 53(6):127:1–127:42, 2021.
A. Doan, P. Konda, P. S. G. C., Y. Govind, D. Paulsen, K. Chandrasekhar, P. Martinkus, and M. Christie. Magellan: toward building ecosystems of entity matching solutions. Commun. ACM, 63(8):83–91, 2020.
J. Doerner, D. Evans, and A. Shelat. Secure stable matching at scale. In SIGSAC Conference, pages 1602–1613, 2016.
M. Fey, J. E. Lenssen, C. Morris, J. Masci, and N. M. Kriege. Deep graph matching consensus. In ICLR, 2020.
D. Gale and L. S. Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962.
C. Ge, X. Liu, L. Chen, B. Zheng, and Y. Gao. Largeea: Aligning entities for large-scale knowledge graphs. CoRR, abs/2108.05211, 2021.
C. Ge, X. Liu, L. Chen, B. Zheng, and Y. Gao. Make it easy: An effective end-to-end entity alignment framework. In SIGIR, pages 777–786, 2021.
C. Ge, P. Wang, L. Chen, X. Liu, B. Zheng, and Y. Gao. Collaborem: A self-supervised entity matching framework using multi-features collaboration. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2021.
E. Jiménez-Ruiz and B. C. Grau. Logmap: Logic-based and scalable ontology matching. In ISWC, pages 273–288. Springer, 2011.
R. Jonker and A. Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing, 38(4):325–340, 1987.
A. Kimmig, A. Memory, R. J. Miller, and L. Getoor. A collective, probabilistic approach to schema mapping. In ICDE, pages 921–932, 2017.
T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In ICLR. OpenReview.net, 2017.
H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1–2):83–97, 1955.
S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In SIGKDD, pages 572–580. ACM, 2013.
G. Lample, A. Conneau, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. In ICLR, 2018.
C. Li, Y. Cao, L. Hou, J. Shi, J. Li, and T. Chua. Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In EMNLP, pages 2723–2732, 2019.
Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan. Deep entity matching with pre-trained language models. Proc. VLDB Endow., 14(1):50–60, 2020.
X. Lin, H. Yang, J. Wu, C. Zhou, and B. Wang. Guiding cross-lingual entity alignment via adversarial knowledge embedding. In ICDM, pages 429–438, 2019.
B. Liu, H. Scells, G. Zuccon, W. Hua, and G. Zhao. Activeea: Active learning for neural entity alignment. In EMNLP, pages 3364–3374, 2021.
X. Mao, W. Wang, Y. Wu, and M. Lan. Boosting the speed of entity alignment 10 ×: Dual attention matching network with normalized hard sample mining. In WWW, pages 821–832, 2021.
X. Mao, W. Wang, Y. Wu, and M. Lan. From alignment to assignment: Frustratingly simple unsupervised entity alignment. In EMNLP, pages 2843–2853, 2021.
X. Mao, W. Wang, H. Xu, Y. Wu, and M. Lan. Relational reflection entity alignment. In CIKM, pages 1095–1104, 2020.
G. E. Mena, D. Belanger, S. W. Linderman, and J. Snoek. Learning latent permutations with gumbel-sinkhorn networks. In ICLR, 2018.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In ICML, volume 48, pages 1928–1937, 2016.
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10–15, 2018, pages 19–34. ACM, 2018.
T. T. Nguyen, T. T. Huynh, H. Yin, V. V. Tong, D. Sakong, B. Zheng, and Q. V. H. Nguyen. Entity alignment for knowledge graphs with multi-order convolutional networks. IEEE Transactions on Knowledge and Data Engineering, pages 1–1, 2020.
G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv., 53(2):31:1–31:42, 2020.
H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3):489–508, 2017.
S. Pei, L. Yu, and X. Zhang. Improving cross-lingual entity alignment via optimal transport. In S. Kraus, editor, IJCAI, pages 3231–3237, 2019.
L. A. S. Pizzato, T. Rej, T. Chung, I. Koprinska, and J. Kay. RECON: a reciprocal recommender for online dating. In RecSys, pages 207–214. ACM, 2010.
J. Pujara, H. Miao, L. Getoor, and W. W. Cohen. Large-scale knowledge graph identification using PSL. In AAAI, 2013.
A. E. Roth. Deferred acceptance algorithms: history, theory, practice, and open questions. Int. J. Game Theory, 36(3–4):537–569, 2008.
P. Shvaiko and J. Euzenat. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng., 25(1):158–176, 2013.
F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS: probabilistic alignment of relations, instances, and schema. Proc. VLDB Endow., 5(3):157–168, 2011.
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697–706, 2007.
Z. Sun, Q. Zhang, W. Hu, C. Wang, M. Chen, F. Akrami, and C. Li. A benchmarking study of embedding-based entity alignment for knowledge graphs. Proc. VLDB Endow., 13(11):2326–2340, 2020.
B. D. Trisedya, J. Qi, and R. Zhang. Entity alignment between knowledge graphs using attribute embeddings. In AAAI, pages 297–304, 2019.
D. Vrandecic and M. Krötzsch. Wikidata: a free collaborative knowledgebase. Commun. ACM, 57(10):78–85, 2014.
Z. Wang, Q. Lv, X. Lan, and Y. Zhang. Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, pages 349–357, 2018.
Y. Wu, X. Liu, Y. Feng, Z. Wang, R. Yan, and D. Zhao. Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, pages 5278–5284, 2019.
K. Xu, L. Song, Y. Feng, Y. Song, and D. Yu. Coordinated reasoning for cross-lingual knowledge graph alignment. In AAAI, pages 9354–9361, 2020.
H. Yang, Y. Zou, P. Shi, W. Lu, J. Lin, and X. Sun. Aligning cross-lingual entities with multi-aspect information. In EMNLP, pages 4430–4440, 2019.
J. Yang, D. Wang, W. Zhou, W. Qian, X. Wang, J. Han, and S. Hu. Entity and relation matching consensus for entity alignment. In CIKM, pages 2331–2341. ACM, 2021.
K. Zeng, C. Li, L. Hou, J. Li, and L. Feng. A comprehensive survey of entity alignment for knowledge graphs. AI Open, 2:1–13, 2021.
W. Zeng, X. Zhao, X. Li, J. Tang, and W. Wang. On entity alignment at scale. VLDB J., page in press, 2021.
W. Zeng, X. Zhao, J. Tang, X. Li, M. Luo, and Q. Zheng. Towards entity alignment in the open world: An unsupervised approach. In DASFAA 2021, volume 12681, pages 272–289, 2021.
W. Zeng, X. Zhao, J. Tang, and X. Lin. Collective entity alignment via adaptive features. In ICDE, pages 1870–1873. IEEE, 2020.
W. Zeng, X. Zhao, J. Tang, X. Lin, and P. Groth. Reinforcement learning-based collective entity alignment with adaptive features. ACM Trans. Inf. Syst., 39(3):26:1–26:31, 2021.
R. Zhang, B. D. Trisedya, M. Li, Y. Jiang, and J. Qi. A comprehensive survey on knowledge graph entity alignment via representation learning. CoRR, abs/2103.15059, 2021.
Z. Zhang, H. Liu, J. Chen, X. Chen, B. Liu, Y. Xiang, and Y. Zheng. An industry evaluation of embedding-based entity alignment. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020 - Industry Track, Online, December 12, 2020, pages 179–189. International Committee on Computational Linguistics, 2020.
X. Zhao, W. Zeng, J. Tang, W. Wang, and F. Suchanek. An experimental study of state-of-the-art entity alignment approaches. IEEE TKDE, pages 1–1, 2020.
R. Zhu, M. Ma, and P. Wang. RAGA: relation-aware graph attention networks for global entity alignment. In PAKDD, volume 12712, pages 501–513, 2021.
Y. Zhu, H. Liu, Z. Wu, and Y. Du. Relation-aware neighborhood matching model for entity alignment. In AAAI, pages 4749–4756, 2021.
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Zhao, X., Zeng, W., Tang, J. (2023). Recent Advance of Alignment Inference Stage. In: Entity Alignment. Big Data Management. Springer, Singapore. https://doi.org/10.1007/978-981-99-4250-3_4
Download citation
DOI: https://doi.org/10.1007/978-981-99-4250-3_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4249-7
Online ISBN: 978-981-99-4250-3
eBook Packages: Computer ScienceComputer Science (R0)