Abstract
Over the last few years, there are a pile of research devoted to learning better KG representations to facilitate entity alignment. Thus, in this chapter, we summarize recent progress in the representation learning stage of EA and also provide a detailed empirical evaluation to reveal the strengths and weaknesses of current solutions.
You have full access to this open access chapter, Download chapter PDF
1 Overview
To better understand current advanced representation learning methods, we propose a general framework to describe these methods, which includes six modules, i.e., pre-processing, messaging, attention, aggregation, post-processing, and loss function. In pre-processing, the initial entity and relation representations are generated. Then, KG representations are obtained via a representation learning network, which usually consists of three steps, i.e., messaging, attention, and aggregation. Among them, messaging aims to extract the features of the neighboring elements, attention aims to estimate the weight of each neighbor, and aggregation integrates the neighboring information with attention weights. Through the post-processing operation, the final representations are obtained. The whole model is then optimized by the loss function in the training stage.
More specifically, we summarize ten representative methods in terms of these modules in Table 3.1.
-
In the pre-processing module, there are mainly two ways to obtain the initial representations, some methods utilize pre-trained model to embed names or descriptions into initial representations, while some methods generate the initial structural representations through GNN-based networks.
-
In the messaging module, linear transformation is the most frequently used strategy, which makes use of a learnable matrix to transform neighboring features. Other methods include extracting neighboring features by concatenating multi-head messages, directly utilizing neighboring representations, etc.
-
In the attention module, the main focus is the computation of similarity. Most of the methods concatenate the representations and multiply a learnable attention vector to calculate attention weights. Besides, some use inner product of entity representations to compute similarity.
-
In the aggregation module, almost all methods aggregate 1-hop neighboring entity or relation information, while a few works propose to combine multi-hop neighboring information. Some use a set of randomly chosen entities, i.e., anchor set, to obtain position-aware representations.
-
In the post-processing module, most of the methods enhance final representations by concatenating the outputs of all layers of GNN. Besides, some methods propose to combine the features adaptively via strategies such as the gate mechanism [10].
-
In terms of the loss function, the majority of methods utilize the margin-based loss during training. Some additionally add the TransE [1] loss, while some improve the margin loss using LogSumExp and normalization operation, or utilizing the Sinkhorn [3] algorithm to calculate the loss.
2 Models
We use Eq. (3.1) to characterize the core procedure of representation learning:
where \(\mathbf {Messaging}\) aims to extract the features of neighboring elements, \(\mathbf {Attention}\) aims to estimate the weight of each neighbor, and \(\mathbf {Aggregation}\) integrates the neighborhood information with attention weights.
Next, we briefly introduce recent advance of representation learning for EA in terms of the modules mentioned in Table 3.1.
2.1 ALiNet
It aims to aggregate multi-hop structural information for learning entity representations [12].
Aggregation
This work devises a multi-hop aggregation strategy. For 2-hop aggregation, Aggregate is denoted as:
where \(\mathcal {N}_2\) denotes the 2-hop neighbors.
Then, it aggregates the multi-hop aggregation results to generate the entity representation. Aggregating 1-hop and 2-hop information is denoted as:
where \(g(\boldsymbol {h}_{i,2}^l) = \sigma (\boldsymbol {M}\boldsymbol {h}_{i,2}^l + \boldsymbol {b})\), which is the gate to control the influences of different hops. \(\boldsymbol {M}\) and \(\boldsymbol {b}\) are learnable parameters.
Attention
Regarding the attention weight, it assumes that not all distant entities contribute positively to the characterization of the target entity representation, and the softmax function is used to produce the attention weights:
where \(c_{ij}^l = LeakyReLU((\boldsymbol {M}_1^l\boldsymbol {h}_i^l)^T \boldsymbol {M}_2^l\boldsymbol {h}_j^l)\), and \(\boldsymbol {M}_1, \boldsymbol {M}_2\) are two learnable matrices.
Messaging
The extraction of the features of neighboring entities is implemented as a simple linear transformation: \(\mathbf {Messaging}(i,j) = \boldsymbol {W}_q^l \boldsymbol {h}_j^{l-1}\), where \(\boldsymbol {W}_q\) denotes the weight matrix for the q-hop aggregation.
Post-processing
The representations of all layers are concatenated to produce the final entity representation:
Loss Function
The loss function is formulated as:
where \(\mathcal {A}^-\) is the set of negative samples, obtained through random sampling. \(||\cdot ||\) denotes the L2 norm. \([\cdot ]_+ = \max (0,\cdot )\).
2.2 MRAEA
It proposes to utilize the relation information to facilitate the entity representation learning process [8].
Pre-processing
Specifically, it first creates an inverse relation for each relation, resulting in the extended relation set \(\mathcal {R}\). Then, it generates the initial features for entities by averaging and concatenating the embeddings of neighboring entities and relations:
where the embeddings of entities and relations are randomly initialized.
Aggregation
The aggregation is a simple combination of the extracted features and the weights:
where \(\sigma \) is implemented as ReLU.
Attention
It augments the common self-attention mechanism to include relation features:
where \(\mathcal {M}_{i,j}\) represents the set of linked relations that connect \(e_i\) to \(e_j\). Noteworthily, it also adopts the multi-head attention mechanism to obtain the representation.
Messaging
The features of neighboring entities are the corresponding features from the pre-processing stage.
Post-processing
Finally, the outputs from different layers are concatenated to produce the final entity representations:
Loss Function
The loss function is formulated as:
where \(dis(\cdot , \cdot )\)is the Manhattan distance between two entity representations. \(e_i^\prime \) and \(e_j^\prime \) represent the negative instances.
2.3 RREA
It proposes to use relational reflection transformation to aggregate features for learning entity representations [9].
Aggregation
The entity representations are denoted as:
where \(\mathcal {N}_{e_i}^e\) and \(\mathcal {R}_{ij}\) represent the neighboring entity and relation sets, respectively.
Attention
\(\mathbf {Attention}(i,j,k)\) denotes the weight coefficient computed by:
where \(\beta _{ijk}^l = \boldsymbol {v}^T [\boldsymbol {h}_{e_i}^{l} || \boldsymbol {M}_{r_k}\boldsymbol {h}_{e_j}^{l}|| \boldsymbol {h}_{r_k}]\). \(\boldsymbol {v}\) is a trainable vector. \(\boldsymbol {M}_{r_k}\) is the relational reflection matrix of \(r_k\). We leave out the details of relational reflection matrix in the interest of space, which can be found in the original paper.
Messaging
The features of neighboring entities are the corresponding features from the pre-processing stage:
where \(\boldsymbol {M}_{r_k}\) is the relational reflection matrix of \(r_k\).
Post-processing
Then, the outputs from different layers are concatenated to produce the output vector:
Finally, it concatenates the entity representation with its neighboring relation embeddings to obtain the final entity representation:
Loss Function
The loss function is formulated as:
where \(dis(\cdot , \cdot )\)is the Manhattan distance between two entity representations. \(e_i^\prime \) and \(e_j^\prime \) represent the negative instances generated by nearest neighbor sampling.
2.4 RPR-RHGT
This work introduces a meta path-based similarity framework for EA [2]. It considers the paths that frequently appear in the neighborhoods of pre-aligned entities to be reliable. We omit the generation of these reliable paths in the interest of space, which can be found in Sect. 3.3 of the original paper.
Pre-processing
Specifically, it first generates relation embeddings by aggregating the representations of neighboring entities:
where \(\mathcal {H}_r\) and \(\mathcal {T}_r\) denote the set of head entities and tail entities that are connected with relation r.
Aggregation
The entity representation is obtained by averaging the messages from neighborhood entities with the attention weights:
where \(\oplus \) denotes the overlay operation.
Attention
The multi-head attention is computed as:
where \(K^i(h) = K\_Linear^i(\boldsymbol {e}_h^{l-1})\), \(Q^i(t) = Q\_Linear^i(\boldsymbol {e}_t^{l-1})\), \(RN(h)\) represents the neighborhood entities of h, \(\boldsymbol {a}\) denotes the learnable attention vector, \(h_n\) is the number of attention heads, and \(d/h_n\) is the dimension per head.
Messaging
The multi-head message passing is computed as:
where \(V\_Linear^i\) is a linear projection of the tail entity, which is then concatenated with the relation representation.
Post-processing
This work also combines the structural representations with name features using the residual connection:
where \(A\_Linear\) and \(N\_Linear\) are linear projections. Correspondingly, based on the relation structure \(\mathcal {T}_{rel}\) and path structure \(\mathcal {T}_{path}\), it generates the relation-based embeddings \(\boldsymbol {E}_{rel}\) and the path-based embeddings \(\boldsymbol {E}_{path}\).
Loss Function
Finally, the margin-based ranking loss function is used to formulate the overall loss function:
where the distance is measured by the Manhattan distance and \(\theta \) is the hyper-parameter that controls the weights of relation loss and path loss.
2.5 RAGA
It proposes to adopt the self-attention mechanism to spread entity information to the relations and then aggregate relation information back to entities, which can further enhance the quality of entity representations [17].
Pre-processing
In the pre-processing module, the pre-trained vectors are used as input and then forwarded to a two-layer GCN with highway network to encode structure information. We leave out the implementation details in the interest of space, which can be found in Sect. 4.2 in the original paper.
Aggregation
In RAGA, there are three main GNN networks. Denote the initial representation of entity i as \(\boldsymbol h_i\), which is generated in pre-processing module. The first GNN network obtains relation representation by aggregating all of its connected head entities and tail entities. For relation k, the aggregation of its connected head entities is computed as follows:
where \(\sigma \) is the ReLU activation function, \(\mathcal H_{r_k}\) is the set of head entities for relation \(r_k\), and \(\mathcal T_{e_ir_k}\) is the set of tail entities for head entity \(e_i\) and relation \(r_k\). The aggregation of all tail entities \(\boldsymbol r_k^t\) can be computed through a similar process, and the relation representation is obtained as \(\boldsymbol r_k=\boldsymbol r_k^h+\boldsymbol r_k^t\).
Then, the second GNN network generates relation-aware entity representation through aggregating relation information back to entities. For entity i, the aggregation of all its outward relation embeddings is computed as follows:
where \(\mathcal {T}_{e_i}\) is the set of tail entities for head entity \(e_i\) and \(\mathcal {R}_{e_ie_j}\) is the set of relations between head entity \(e_i\) and tail entity \(e_j\). The aggregation of inward relation embeddings \(\boldsymbol h_i^t\) is computed through a similar process. Then the relation-aware entity representations \(\boldsymbol {h}_i^{rel}\) can be obtained by concatenation: \(\boldsymbol h_i^{rel}=\left [\boldsymbol h_i\Vert \boldsymbol {h}_i^h\Vert \boldsymbol {h}_i^t\right ]\).
Finally, the third GNN takes as input the relation-aware entity representations and makes aggregation to produce the final entity representations:
Attention
Corresponding to three GNN networks, there are three attention computations in RAGA. In the first GNN, to compute the attention weights, representations of head entity and tail entity are linearly transformed, respectively, and then concatenated:
where \(\boldsymbol a_1\) is the learnable attention vector.
In the second GNN, representations of entity and its neighboring relations are directly concatenated:
where \(\boldsymbol a_2\) is the learnable attention vector.
The computation of attention in the third GNN, i.e., \(\mathbf {Attention}_3\), is similar to Eq. (3.28), which concatenates entity and its neighboring entity instead of relation.
Messaging
Only the first GNN utilizes linear transformation as the messaging approach:
where \(\boldsymbol W\) can refer to \(\boldsymbol W^h\) or \(\boldsymbol W^t\) depending on the aggregation of head or tail entities.
Post-processing
The final enhanced entity representation is the concatenation of outputs of the second and the third GNNs:
Loss Function
The loss function is formulated as:
where \(T_{e_i,e_j}^{\prime }\) is the set of negative sample for \(e_i\) and \(e_j\), \(\lambda \) is the margin, and \(dis()\) is defined as the Manhattan distance.
2.6 Dual-AMN
Dual-AMN proposes to utilize both intra-graph and cross-graph information for learning entity representations [7]. It constructs a set of virtual nodes, i.e., proxy vectors, through which the messaging and aggregation between graphs are conducted.
Aggregation
Dual-AMN uses two GNN networks to learn intra-graph and cross-graph information, respectively. Firstly, it utilizes relation projection operation in RREA to obtain intra-graph embeddings:
where \(\sigma \) is the tanh activation function and \(\boldsymbol h_{e_i}^l\) represents the output of l-th layer. Then the multi-hop embeddings are obtained by concatenation:
Secondly, it constructs a set of virtual nodes \(\mathcal S_p=\{\boldsymbol q_1,\boldsymbol q_2,\dots ,\boldsymbol q_n\}\), namely, the proxy vectors, which are randomly initialized. The cross-graph aggregation is computed as:
Attention
For intra-graph information learning, the attention weights are calculated as:
where \(\boldsymbol {v}^T\) is a learnable attention vector and \(\boldsymbol h_{r_k}\) is the representation of relation \(r_k\), which is randomly initialized by He_initializer [4].
For cross-graph information learning, the attention weights are computed by the similarity between entity and proxy vectors:
Messaging
For the first GNN, the messaging is the same as RREA, which utilizes a relational reflection matrix to transform neighbor embeddings.
For the second GNN, the features of neighboring entities are represented as the difference between entity and proxy vectors:
Post-processing
For the final entity embeddings, the gate mechanism is used to combine intra-graph and cross-graph representations:
where \(\boldsymbol M\) and \(\boldsymbol b\) are the gate weight matrix and gate bias vector.
Loss Function
Firstly, it calculates the original margin loss as follows:
Inspired by batch normalization [5] which reduces the internal covariate shift, it proposes to use a normalization step that fixes the mean and variance of sample losses from \(l_o(e_i,e_j,e_j^{\prime })\) to \(l_n(e_i,e_j,e_j^{\prime })\) and reduces the dependence on the scale of the hyper-parameter. Finally, the overall loss function is defined as follows:
where P is the set of positive samples and \(E_1\) and \(E_2\) are the sets of entities in two knowledge graphs, respectively.
2.7 ERMC
This work proposes to jointly model and align entities and relations and meanwhile retain their semantic independence [14].
Pre-processing
For pre-processing, it obtains names or descriptions of entities and relations as the inputs for BERT [6] and adds an MLP layer to construct initial representations, which are denoted as \(\boldsymbol x^{e(0)}\) and \(\boldsymbol x^{r(0)}\) for each entity and relation, respectively.
Aggregation
Given an entity e, the model first aggregates the embeddings of entities that point to e:
where \(\sigma (\cdot )\) contains normalization, dropout, and activation operations. Similarly, the model aggregates the embeddings of entities that e points to, the embeddings of relations that point to e, and the embeddings of relations that e points to, producing \(\boldsymbol h_{\mathcal N_i^r}^{e(l+1)}\), \(\boldsymbol h_{\mathcal N_o^e}^{e(l+1)}\), and \(\boldsymbol h_{\mathcal N_o^r}^{e(l+1)}\), respectively. The model also aggregates the embeddings of entities that point to a relation r or r points to, so as to produce the relation embeddings \(\boldsymbol h_{\mathcal N_i^e}^{r(l+1)}\) and \(\boldsymbol h_{\mathcal N_o^e}^{r(l+1)}\), respectively.
Messaging
Given an entity e, the messaging process of the entities that point to e is implemented as a simple linear transformation: \(\mathbf {Messaging}(i)=\boldsymbol W_{e_i}^{e(l)}\boldsymbol x^{e_i(l)}\), where \(\boldsymbol x^{e_i(l)}\) is the node representation in the last layer and \(\boldsymbol W_{e_i}^{e(l)}\) is a learnable weight matrix that aggregates the inward entity features. The messaging process of other operations is implemented similarly.
Post-processing
The final representation of entity e is formulated as follows:
And the final representation of relation r is formulated similarly:
The graph embedding \(\boldsymbol H\in \mathbb R^{(|E|+|R|)\times d}\) is the concatenation of all entities and relations’ representations.
Loss Function
Denote \(\boldsymbol H_s\) and \(\boldsymbol H_t\) as the representations of two graphs, respectively. The similarity matrix is computed as:
where \(s_{i,j}\in \boldsymbol S\) is a real number that denotes the correlation between entity \(e_s^i\) (from source graph) and \(e_t^j\) (from target graph), or the correlation between relation \(r_s^i\) (from source graph) and \(r_t^j\) (from target graph). The other elements are set to \(-\infty \) to mask the correlation between entity and relation across different graphs. The final loss function is formulated as follows:
where \((e_s^i,e_t^j)\) and \((r_s^i,r_t^j)\) are pre-aligned entity and relation pairs and \(\lambda \in [0,1]\) is a hyper-parameter.
2.8 KE-GCN
It combines GCNs and advanced KGE methods to learn the representations, where a novel framework is put forward to realize the messaging and aggregation modules in representation learning [15].
Aggregation
Denoting \(\boldsymbol h_v^l\) as the embedding of entity v at layer l, the entity updating rules are:
where \(\mathcal N_{\mathrm {in}}(v)=\{(u,r)\vert u\stackrel {r}{\rightarrow }v\}\) is the set of inward entity-relation neighbors of entity v, while \(\mathcal N_{\mathrm {out}}(v)=\{(u,r)\vert u\stackrel {r}{\leftarrow }v\}\) is the set of outward neighbors of v. \(\boldsymbol W_0^l\) is a linear transformation matrix. \(\sigma (\cdot )\) denotes the activation function for the update. The embedding of relation is updated through a similar process.
Messaging
It considers GCN as an optimization process, where the messaging process is implemented as a partial derivative:
where \(\boldsymbol h_r^l\) represents the embedding of relation r at layer l and \(\boldsymbol W_r^l\) is a relation-specific linear transformation matrix. \(f(\boldsymbol h_u^l,\boldsymbol h_r^l,\boldsymbol h_v^l)\) is the scoring function that measures the plausibility of triple \((u,r,v)\). Thus, \(\boldsymbol m_v^{l+1}+\boldsymbol W_0^l\boldsymbol h_v^l\) in Eq. (3.46) can be regarded as the gradient ascent to maximize the sum of scoring function. For example, if \(f(\boldsymbol h_u^l,\boldsymbol h_r^l,\boldsymbol h_v^l)=(\boldsymbol h_u^l)^T\boldsymbol h_v^l\), Eq. (3.47) becomes equivalent to the common linear transformation \(\boldsymbol W_r^l\boldsymbol h_u^l\).
Loss Function
Denote the training set as \(S=\{(u,v)\}\); this model utilizes margin-based ranking loss for optimization:
where \(S_{(u,v)}^{\prime }\) denotes the set of negative entity alignments constructed by corrupting \((u,v)\), i.e., replacing u or v with a randomly chosen entity in graph. \(\gamma \) represents the margin hyper-parameter separating positive and negative entity alignments.
2.9 RePS
It encodes position and relation information for aligning entities [13].
Aggregation
Firstly, to encode position information, k subsets of nodes (referred to as anchor sets) are randomly sampled. An \(i^{th}\) anchor set is a collection of \(l_i\) number of nodes (anchors). Then for entity v, the aggregation process is formulated as:
where \(\boldsymbol h_v^l\) represents the embedding of entity v from layer l, \(\psi _i\) is the \(i^{th}\) anchor set, and \(g(\boldsymbol X)=\sigma (\boldsymbol W_1\boldsymbol X+\boldsymbol b_1)\), where \(\boldsymbol W_1\) and \(\boldsymbol b_1\) are trainable parameters and \(\sigma \) is the activation function.
To encode relation information, a simple relation-specific GNN is used:
where \(c_v\) is the learnable coefficient for entity v and \(\mathcal N_v\) is the set of neighboring entities of v. \(f(\boldsymbol X)=\boldsymbol W_2\boldsymbol X+\boldsymbol b_2\), where \(\boldsymbol W_2\) and \(\boldsymbol b_2\) are learnable parameters.
Messaging
To ensure similar entities in two graphs have similar representations, the relation-enriched distance function is defined as follows:
where \(f(r,\mathcal {K}\mathcal {G}_i)\) is the frequency of relation r in \(\mathcal {K}\mathcal {G}_i\) and \(P_q(u,v)\) is the list of relations in the \(q^{th}\) path between u and v. Thus, \(pd(u,v)\) aims to find the shortest path between u and v, where the relations appear less frequently. Then the messaging function is formulated as follows:
where \(\psi _{i,j}\) is the jth entity in ith anchor set.
For relation-aware embedding, it sums up the neighboring representations with relation-specific weights:
where \(c_{r_{v,i}}\) is the learnable coefficient for relation r connecting v and i.
Post-processing
The final representation of v is computed as:
where \(g(\boldsymbol h_{v_p}^l)=\sigma (\boldsymbol W_3\boldsymbol h_{v_p}^l+\boldsymbol b_3)\) learns the relative importance. \(\boldsymbol W_3\) and \(\boldsymbol b_3\) are trainable parameters and \(\sigma \) is the activation function.
Loss Function
It introduces a novel knowledge-aware negative sampling (KANS) technique to generate hard negative samples. For each tuple \((v,v')\) in S, the negative instances for v are sampled from set \(\Phi _v\), where \(\Phi _v\) is the set of entities which share at least one (relation, tail) pair or (relation, head) pair with \(v'\). The model is trained by minimizing the following loss:
where \(\beta \) is a weighing parameter and \(\gamma \) is the margin.
2.10 SDEA
SDEA utilizes BiGRU to capture correlations among neighbors and generate entity representations [16].
Pre-processing
It devises an attribute embedding module to capture entity associations via entity attributes. Specifically, given an entity \(e_i\), it concatenates the names and descriptions of its attributes, denoted as \(S(e_i)\). Then \(S(e_i)\) is fed into BERT model to generate attribute embedding \(\boldsymbol H_a(e_i)\). The details of implementation can be found in Section III of the original paper, which is omitted in the interest of space.
Aggregation
It aggregates the neighboring information utilizing attention mechanism:
Since SDEA treats neighborhood as a sequence, t actually represents t-th neighboring entity of \(e_i\), and \(\mathbf {Messaging}()\) is computed through a BiGRU.
Attention
SDEA computes attention via simple inner product:
where \(\hat {\boldsymbol h}\) is the global attention representation, which is obtained after feeding the output of the last unit of the BiGRU, denoted as \(\boldsymbol h_n\), into an MLP layer.
Messaging
Different from other models, SDEA captures correlation between neighbors in messaging module, and all neighbors of entity \(e_i\) are regarded as an input sequence of the BiGRU model. Given entity \(e_i\), let \(\boldsymbol x_t\) denote the t-th input embedding (i.e., the attribute embedding of \(e_i\)’s t-th neighbor, as described in pre-processing module) and \(\boldsymbol h_t\) denote the output t-th hidden unit. The process of BiGRU is formulated as follows:
where \(\boldsymbol r_t\) is the reset gate that drops the unimportant information and \(\boldsymbol z_t\) is the update gate that combines the important information. \(\boldsymbol {W}, \boldsymbol U, \boldsymbol b\) are learnable parameters. \(\tilde {\boldsymbol h}_t\) is the hidden state. \(\sigma \) is the sigmoid function and \(\phi \) is the hyperbolic tangent. \(\odot \) is the Hadamard product.
For BiGRU, there are outputs of two directions \(\overleftarrow {\boldsymbol h_t}\) and \(\overrightarrow {\boldsymbol h_t}\), and the final output of BiGRU, namely, the output of messaging module, is the sum of two directions: \(\mathbf {Messaging}(i)=\overleftarrow {\boldsymbol h_t}+\overrightarrow {\boldsymbol h_t}\).
Post-processing
After obtaining the attribute embedding \(\boldsymbol H_a(e_i)\) and the relational embedding \(\boldsymbol H_r(e_i)\), they are concatenated and forwarded to another MLP layer, resulting in \(\boldsymbol H_m(e_i)=MLP([\boldsymbol H_a(e_i)\Vert \boldsymbol H_r(e_i)])\). Finally, \(\boldsymbol H_a(e_i)\), \(\boldsymbol H_r(e_i)\), and \(\boldsymbol H_m(e_i)\) are concatenated to produce \(\boldsymbol H_{ent}(e_i)=[\boldsymbol H_r(e_i)\Vert \boldsymbol H_a(e_i)\Vert \boldsymbol H_m(e_i)]\), which is used in alignment stage.
Loss Function
The model uses the following margin-based ranking loss as the loss function to train attribute embedding module:
where D is the training set; \(\boldsymbol H_a\) and \(\boldsymbol H_a^{\prime }\) are attribute embeddings of source graph and target graph, respectively; and \(\beta >0\) is the margin hyper-parameter used for separating positive and negative pairs.
The training of relation embedding module uses a margin-based ranking loss similar to Eq. (3.59), where the embedding \(\boldsymbol H_a(e_i)\) is replaced by \([\boldsymbol H_r(e_i)\Vert \boldsymbol H_m(e_i)]\).
3 Experiments
In this section, we first conduct overall comparison experiment to reveal the effectiveness of state-of-the-art representation learning methods. Then we conduct further experiments in terms of the six modules of representation learning, so as to examine the effectiveness of various strategies.
3.1 Experimental Setting
Dataset
We use the most frequently used DBP15K dataset [11] for evaluation.
Baselines
For overall comparison, we select seven models, including AliNet [12], MRAEA [8], RREA [9], RAGA [17], SDEA [16], Dual-AMN [7], and RPR-RHGT [2]. We collect their source codes and reproduce the results in the same setting. Specifically, to make a fair comparison, we modify and unify the alignment part of these models, forcing them to utilize L1 distance and greedy algorithm for alignment inference. We omit the comparison with the remaining models, as they do not provide the source codes and our implementations cannot reproduce the results. For ablation and further experiments, we choose RAGA as the base model.
Parameters and Metrics
Since there are various kinds of hyper-parameters for different models, we just unify the common parameters, such as the margin \(\lambda =3\) in margin loss function, and number of negative samples \(k=5\). For other parameters, we keep the default settings in the original papers.
Following existing studies, we use Hits@k (\(k=1\), 10) and mean reciprocal rank (MRR) as the evaluation metrics. The higher the Hits@k and MRR, the better the performance. In experiments, we report the average performance of three independent runs as the final result.
3.2 Overall Results and Analysis
Firstly, we compare the overall performance of seven advanced models in Table 3.2, where the best results are highlighted in bold, and the second best results are underlined.
From the results, it can be observed that:
-
No model achieves state-of-the-art performance over all three KG pairs. This indicates that current advanced models have advantages and disadvantages in different situations.
-
SDEA achieves the best performance on ZH-EN and FR-EN, and RPR-RHGT leads on JA-EN. Considering that both of the two models leverage pre-trained model to obtain initial embeddings and devise novel approaches to extract neighboring features, we may draw primary conclusion that utilizing pre-trained model benefits the representation learning process, and effective messaging approach is important to the overall results.
-
RAGA achieves the second best performance on JA-EN and FR-EN, and Dual-AMN attains the second best result on ZH-EN. Notably, RAGA also leverages the pre-trained model, which further validates the effectiveness of using pre-trained model for initialization. Dual-AMN uses proxy vectors that can help capture cross-graph information and hence improve representation learning.
-
AliNet performs the worst over three datasets. As AliNet is the only model that aggregates 2-hop neighboring entities, it may indicate directly incorporating 2-hop neighboring information benefits little, which can also be observed in the further experiments on aggregation module.
3.3 Further Experiments
To compare various strategies in each module of representation learning, we conduct further experiments using the RAGA model.
3.3.1 Pre-processing Module
RAGA takes pre-trained embeddings as input, which are forwarded to a two-layer GCN with highway network to generate initial representations. To examine the effectiveness of pre-trained embeddings and structural embeddings, we remove them, respectively, and then make comparison. Table 3.3 shows the results, where “w/o Pre-trained” represents removing pre-trained embeddings, “w/o GNN” represents removing GCN, and “w/o Both” represents removing the whole pre-processing module.
The results show that removing the structural features and the pre-trained embeddings significantly degrades the performance, and the model that completely removes the pre-processing module achieves the worst result. Hence, it is important to extract useful features to initialize the embeddings. Additionally, we can also observe that the semantic features in the pre-trained model are more useful than the structural vectors, which verifies the effectiveness of the prior knowledge contained in the pre-trained embeddings. Using structural embeddings for initialization is less effective, as the subsequent steps in representation learning also aim to extract the structural features to produce meaningful representations.
3.3.2 Messaging Module
For the messaging module, linear transformation is the most widely used approach. RAGA only utilizes linear transformation in its first GNN and does not use transformation in the other two GNNs. Thus, we design two variants: one that eliminates the linear transformation in the first GNN (“-Linear Transform”), resulting in a model without linear transformation at all, and the other one that adds linear transformation in the other two GNNs (“\(+\)Linear Transform”), resulting in a model that is fully equipped with linear transformation.
The results are presented in Table 3.4. Besides, we also report their convergence rates in Fig. 3.1.
It is evident that adding linear transformation in the rest of the GNNs improves the performance of RAGA, especially on JA-EN and FR-EN datasets, where Hits@1 improves by 1.1% and 1.2%, respectively. Additionally, when removing linear transformation, the performance drops significantly. Furthermore, Fig. 3.1 shows that linear transformation can also boost the convergence of model, possibly due to the introduction of extra parameters.
3.3.3 Attention Module
For attention module, there are two popular implementations, i.e., inner product and concatenation. To compare the two approaches, we replace the concatenation computation of RAGA with inner product computation (i.e., “-Inner product,” by changing \(\boldsymbol {v}^T[\boldsymbol e_i\Vert \boldsymbol e_j]\) to \((\boldsymbol M_1\boldsymbol e_i)^T(\boldsymbol M_2\boldsymbol e_j)\), where \(\boldsymbol M_1, \boldsymbol M_2\) are learnable transformation matrices), and remove the attention mechanism, (i.e., “w/o Attention,” where we do not compute attention coefficient and just take average operation), respectively, and then report the results.
As it is shown in Table 3.5, the two variant models perform almost the same as the original model. Considering the influence of the initial representation generated in the pre-processing module, we remove the pre-trained vectors of the pre-processing module and then conduct the same comparison. As shown in Table 3.6, removing the attention mechanism drops the performance, so we may draw a preliminary conclusion that the attention mechanism can play a better role in the absence of prior knowledge. As for the two strategies of attention computation, inner product performs better than concatenation on ZH-EN dataset but worse on JA-EN and FR-EN datasets, which indicates these two approaches make different contributions on different datasets.
3.3.4 Aggregation Module
For the aggregation module, as RAGA incorporates both 1-hop neighbors and relation information to update entity representations, we examine two variants, i.e., adding two hop neighboring information (“-2hop”) and removing relation representation (“w/o rel.”). The results are shown in Table 3.7.
We can observe that the performance of the model decreases significantly after removing the relation representation learning. This shows that the integration of relation representations can indeed enhance the learning ability of the model. Besides, the performance of the model decreases slightly after adding the information of 2-hop neighboring entities, which might indicate that the 2-hop neighboring information can bring some noises, as not all entities are useful for aligning the target entity.
3.3.5 Post-processing Module
RAGA concatenates the relation-aware entity representation and the 1-hop aggregation results to produce the final representation. We examine two variants, i.e., “-highway’ ’that replaces concatenation with highway network [10], and “w/o post-processing” that removes the relation-aware entity representation (Table 3.8).
From the experimental results, it can be seen that removing post-processing module decreases the performance, which indicates that the relation-aware representations can indeed enhance the final representations and improve the alignment performance. After replacing the concatenation operation with highway network, the performance decreases on JA-EN dataset and increases on FR-EN dataset, which indicates that the two post-processing strategies do not have absolute advantages and disadvantages.
3.3.6 Loss Function Module
For the loss function, RAGA employs margin-based loss in training. We consider two other popular choices, i.e., TransE-based loss and margin-based + TransE loss. Specifically, TransE-based loss is formulated as \(l_E=\frac 1k\sum _k\Vert h_k+r_k-t_k\Vert _1\), where \((h_k, r_k, t_k)\) is randomly sampled.
From the results in Table 3.9, it can be seen that the model performance decreases after using or adding the TransE loss. This is mainly because the TransE assumption is not universal. For example, in the RAGA model used in this experiment, the representation of the relation is actually obtained by adding the head entity and the tail entity, which is in conflict with the TransE assumption.
4 Conclusion
In this chapter, we survey recent advance in the representation learning stage of EA. We propose a general framework of GNN-based representation learning models, which consists of six modules, and summarize ten recent works in terms of these modules. Extensive experiments are conducted to show the overall performance of each method and also reveal the effectiveness of the strategies in each module.
References
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. NIPS, 26, 2013.
Weishan Cai, Wenjun Ma, Jieyu Zhan, and Yuncheng Jiang. Entity alignment with reliable path reasoning and relation-aware heterogeneous graph transformer. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1930–1937. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NIPS, 26, 2013.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, pages 1026–1034, 2015.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. PMLR, 2015.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
Xin Mao, Wenting Wang, Yuanbin Wu, and Man Lan. Boosting the speed of entity alignment 10\(\times \): Dual attention matching network with normalized hard sample mining. In WWW 2021, pages 821–832, 2021.
Xin Mao, Wenting Wang, Huimin Xu, Man Lan, and Yuanbin Wu. Mraea: an efficient and robust entity alignment approach for cross-lingual knowledge graph. In WSDM, pages 420–428, 2020.
Xin Mao, Wenting Wang, Huimin Xu, Yuanbin Wu, and Man Lan. Relational reflection entity alignment. In CIKM, pages 1095–1104, 2020.
Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
Zequn Sun, Wei Hu, and Chengkai Li. Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC(1), pages 628–644. Springer, 2017.
Zequn Sun, Chengming Wang, Wei Hu, Muhao Chen, Jian Dai, Wei Zhang, and Yuzhong Qu. Knowledge graph alignment network with gated multi-hop neighborhood aggregation. In AAAI, volume 34, pages 222–229, 2020.
Anil Surisetty, Deepak Chaurasiya, Nitish Kumar, Alok Singh, Gaurav Dhama, Aakarsh Malhotra, Ankur Arora, and Vikrant Dey. Reps: Relation, position and structure aware entity alignment. In WWW 2022, pages 1083–1091, 2022.
Jinzhu Yang, Ding Wang, Wei Zhou, Wanhui Qian, Xin Wang, Jizhong Han, and Songlin Hu. Entity and relation matching consensus for entity alignment. In CIKM, pages 2331–2341, 2021.
Donghan Yu, Yiming Yang, Ruohong Zhang, and Yuexin Wu. Knowledge embedding based graph convolutional network. In WWW, pages 1619–1628, 2021.
Ziyue Zhong, Meihui Zhang, Ju Fan, and Chenxiao Dou. Semantics driven embedding learning for effective entity alignment. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pages 2127–2140. IEEE, 2022.
Renbo Zhu, Meng Ma, and Ping Wang. Raga: relation-aware graph attention networks for global entity alignment. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part I, pages 501–513. Springer, 2021.
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
Zhao, X., Zeng, W., Tang, J. (2023). Recent Advance of Representation Learning Stage. In: Entity Alignment. Big Data Management. Springer, Singapore. https://doi.org/10.1007/978-981-99-4250-3_3
Download citation
DOI: https://doi.org/10.1007/978-981-99-4250-3_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-4249-7
Online ISBN: 978-981-99-4250-3
eBook Packages: Computer ScienceComputer Science (R0)