Learning Representations for Hyper-Relational Knowledge Graphs

Harry Shomer

^{1}

Wei Jin

^{1}

Juanhui Li

^{1}

Yao Ma

^{2}

Jiliang Tang

^{1}

^{1}

Department of Computer Science, Michigan State University, Michigan, USA

^{2}

Department of Computer Science, New Jersey Institute of Technology, New Jersey, USA
{shomerha, jinwei2, lijuanh1, tangjili}@msu.edu
yao.ma@njit.edu

Abstract

Knowledge graphs (KGs) have gained prominence for their ability to learn representations for uni-relational facts. Recently, research has focused on modeling hyper-relational facts, which move beyond the restriction of uni-relational facts and allow us to represent more complex and real-world information. However, existing approaches for learning representations on hyper-relational KGs majorly focus on enhancing the communication from qualifiers to base triples while overlooking the flow of information from base triple to qualifiers. This can lead to suboptimal qualifier representations, especially when a large amount of qualifiers are presented. It motivates us to design a framework that utilizes multiple aggregators to learn representations for hyper-relational facts: one from the perspective of the base triple and the other one from the perspective of the qualifiers. Experiments demonstrate the effectiveness of our framework for hyper-relational knowledge graph completion across multiple datasets. Furthermore, we conduct an ablation study that validates the importance of the various components in our framework. The code to reproduce our results can be found at https://github.com/HarryShomer/QUAD.

1 Introduction

Figure 1: An example of a hyper-relational KG. The blue circles represent entities and the arrows represent relations. The dashed arrows represent qualifier relations while the solid arrows are base relations. The ??? entity represents the potential occupation of Richard Bachman (Novelist).

Knowledge graphs (KGs) are a collection of facts represented in a structured and graphical format. Facts are represented as a triple $(h, r, t)$ that connects two entities $h$ and $t$ (i.e. nodes) with a relation $r$ . KGs have become very popular recently with applications in language representation liu2020k, question answering huang2019knowledge, and recommendation wang2019kgat.

In traditional triple-based KGs, facts are represented as binary relations between entities, which often fall short in representing the complex nature of the data. To address this shortcoming, hyper-relational KGs are introduced by moving from representing uni-relational facts to representing facts with N-ary relations. In hyper-relational KGs, triples are often associated with relation-entity pairs, which are known as qualifiers. Qualifiers help qualify a given fact by providing more supporting information and are defined as an (entity, relation) pair that belongs to a triple. An example is shown in Figure 1. In Figure 1 the triple (Stephen King, Author Of, The Running Man) contains a single qualifier pair (Under Pseudonym, Richard Bachman). This qualifier pair helps provide more context to the base triple by telling us that Stephen King published the novel under the pseudonym Richard Bachman. Research on hyper-relational KGs guan_nalp; galkin2020message; yu_hytransformer; wang_gran have focused on learning representations for such hyper-relational graphs and examining how the addition of qualifier pairs can help boost the performance of knowledge graph completion.

However, the majority of existing approaches only consider the impact from the qualifiers on base triples while overlooking the flow of information from the base triples to the qualifiers. This can lead to suboptimal performance especially when a large amount of qualifiers are presented. To illustrate its importance, we use Figure 1 as an example. In (Richard Bachman, Occupation, Novelist), suppose the occupation of Richard Bachman is missing and we are trying to predict (Richard Bachman, Occupation, ?). If there is no information spreading from the base triple to the qualifier entity Richard Bachman, predicting the missing link would be very hard as no knowledge of Stephen King is transferred to the qualifier entity. Prior work such as NaLP guan_nalp and HINGE rosso_hinge both struggle to achieve this transfer of information due to the simplicity of their convolutional-based frameworks. StarE galkin2020message ignores such flow of information from the base triples to the qualifiers. Although transformer-based frameworks yu_hytransformer; wang_gran model the mutual influence between base triples and qualifiers via self attention, they inevitably ignore the structured nature of KGs.

Therefore, in this work, we aim to investigate the novel problem of learning representations for hyper-relational KGs by encouraging the mutual influence between base triples and qualifiers. Essentially, we are faced with the following challenge: how to properly enhance the influence from base triples to qualifiers while maintaining effective impact from qualifiers to base triples. To address it, we propose a novel framework - QUalifier Aggregated Hyper-RelationAl KnowleDge Graphs (QUAD), which encourages influence in both directions. Specifically, QUAD utilizes two aggregators from different perspectives - a base aggregator and a qualifier aggregator. The base aggregator aggregates information for the base entities from the qualifiers while the qualifier aggregator performs the aggregation from the qualifier perspective. Inside the qualifier aggregator, we further propose the concept of a “qualifier triple" that allows us to easily aggregate information from a base triple through the qualifier relation. Following both aggregations, the entity and relation representations are passed to the decoder to perform knowledge base completion. We show that our framework can achieve strong performance on multiple benchmark datasets. Our contributions can be summarized as follows:

We propose a novel architecture for hyper-relational knowledge graphs by introducing a graph encoder that aggregates information from the perspective of the qualifier entities.
We further show that several representative hyper-relational KG methods can be unified as the special cases of QUAD.
Extensive experiments demonstrate the effectiveness of our framework against numerous baselines on multiple hyper-relational knowledge graphs.

Figure 2: An overview of QUAD. It takes the set of entity and relation embeddings and encodes them using two aggregators. We then mask both the base and qualifier separately and pass the the statement to the decoder to predict the most likely entity $^e$ for the mask.

2 Related Work

2.1 Knowledge Graph Embedding

Knowledge graph embeddings (KGE) use embeddings to represent the latent features of entities and relations in a KG. Many KGE techniques also employ a score function that produce a score for a given triple $(h, r, t)$ measuring the plausibility of the triple being true ji2021survey. Many different frameworks have been proposed in the literature. bordes2013translating proposed using a translational scoring function to model triples. yang2014embedding use a bilinear scoring function to score triples utilizing a diagonal matrix to model relations. RotatE sun2019rotate scores a triple as a translation in a complex space and ConvE dettmers2018convolutional does so using a convolutional neural network. Other works have focused on modeling knowledge graphs using graph neural networks. RGCN schlichtkrull2018modeling extends GCN kipf2016semi by using a relation specific weight matrix when aggregating an entity’s neighbors. However, RGCN suffers from over-parameterization when there are many relations. To alleviate this concern, CompGCN vashishth2019composition proposes to use direction specific weight matrices when aggregating instead of relational weight matrices.

2.2 Hyper-Relational KGs

Several methods have been proposed that model hyper-relational facts as N-ary facts. wen2016_transh propose m-Transh, a method that builds on TransH wang2014knowledge by transforming each hyper-relational fact using a star-to-clique conversion. RAE zhang2018_rae builds upon m-TransH wen2016_transh by further considering the relatedness between two entities. NaLP guan_nalp uses a convolutional-based framework to compute a relatedness vector for a triple and its qualifier pairs that can be used for prediction.

More recently, there has been some work that model hyper-relational KGs from a purely hyper-relational viewpoint. rosso_hinge propose doing so using a convolutional framework. For a given hyper-relational fact, the base triple is convolved by itself and with each specific qualifier pair. The resulting feature vectors are then combined and used for prediction. StarE galkin2020message extends CompGCN vashishth2019composition by encoding the qualifiers for a specific triple and combining it with the base relation of the triple. Using StarE galkin2020message as their foundation, yu_hytransformer replace the GNN aggregation with layer normalization layers ba2016layer to improve performance. Additionally they mask the qualifier entities when training as a form of self-supervised learning (SSL). Lastly, wang_gran propose GRAN, a transformer based architecture that employs edge-specific attention biases and masks all entities and relations in the sequence. A missing element in previous work is the lack of a clear flow of information from the base triples to the qualifiers. To address this, we propose a graph encoder to aggregate information from the perspective of the qualifiers, thereby creating better qualifier encodings.

3 Preliminaries

3.1 Knowledge Graphs

Let $G = {V, R, E}$ be a knowledge graph with nodes (i.e. entities) $V$ , edges $E$ , and relations $R$ . For $e \in E$ , it represents a directed edge where two entities $u \in V$ and $v \in V$ are connected by a relation $r \in R$ . We also denote the edge as a triple $(v, r, u)$ . Further, we use $N_{v}$ to denote the neighboring entities and relations of a node $v$ .

CompGCN vashishth2019composition, a popular method for modeling knowledge graphs, utilizes a direction specific weight matrix $W_{λ (r)}$ and a function $ϕ$ that combines the neighboring entity $h_{u}$ and relation $h_{r}$ of a given edge

h_{v}^{(k + 1)} = f ⎛ ⎝ \sum (u, r) \in N_{v} W_{λ (r)}^{k} ϕ (h_{u}^{k}, h_{r}^{k}) ⎞ ⎠,

(1)

where $f$ is a non-linear function (e.g. ReLU) and $λ (r)$ represents the direction of relation that can be one of: a standard, inverse, or self-loop relation. Several function are proposed for modeling the interaction of the embedding in $ϕ$ including subtraction, multiplication, and the cross correlation vashishth2019composition. The relation embedding is updated through a transformation by a weight matrix $W_{r e l}$ ,

h_{r}^{k + 1} = W_{r e l} h_{r}^{k} .

(2)

3.2 Hyper-Relational Knowledge Graphs

A hyper-relational knowledge graph can be seen as an extension of a standard knowledge graph where there is a set of qualifier pairs ${(q v_{i}, q r_{i})}$ , where $q v \in V$ and $q r \in R$ , associated with each triple $(v, r, u)$ . For simplicity we use $q$ to represent the set of neighboring qualifier pairs $(q v_{i}, q r_{i}) \in Q_{(v, r, u)}$ associated with a triple $(v, r, u)$ . We can therefore represent a hyper-relational fact as $(v, r, u, q)$ . We refer to hyper-relational facts as statements. Furthermore, we can represent the neighborhood for a qualifier entity $q v$ as $(v, r, u, q r) \in N_{q v}$ .

A representative example for modeling hyper-relational KGs is StarE galkin2020message. It proposes a formulation based on CompGCN to incorporates an embedding $h_{q}$ that is an encoded representation of the qualifier pairs for the base triple $(v, r, u)$ ,

h_{v} = f ⎛ ⎝ \sum (u, r) \in N_{v} W_{λ (r)} ϕ (h_{u}, γ (h_{r}, h_{q})) ⎞ ⎠ .

(3)

The embedding $h_{q}$ is combined with the relation embedding $h_{r}$ through a function $γ$ that performs a weighted sum. yu_hytransformer show that replacing the graph aggregation with layer normalization can achieve comparable if not better performance than other frameworks. Furthermore they show that masking the qualifier entities during training can raise the test performance on knowledge graph completion.

3.3 Knowledge Graph Completion

Knowledge graph completion (i.e. link prediction) masks one of the two entities belonging to the triple and attempts to predict the correct entity. For example, given $(v, r, u)$ we would try to predict the correct entity for both $(v, r, ?)$ and $(?, r, u)$ . This is defined similarly for hyper-relational KGs where we are also provided with the qualifier information for the triple. Therefore, given $(v, r, u, q)$ we would try to predict the correct entity for both $(v, r, ?, q)$ and $(?, r, u, q)$ .

4 The Proposed Framework

In this section, we propose a new framework QUAD for learning representations of hyper-relational knowledge graphs. An overview of QUAD is shown in Figure 2 that consists of two main components. The first component, detailed in Section 4.1, encodes the graph using two separate aggregations while the second component detailed in Section 4.2 decodes a given hyper-relational fact for a particular downstream task. In detail, we first pass the initial entity and relation embeddings to the encoder, which is comprised of two graph encoders that aggregate information from the base entities and qualifier entities, respectively. Once encoded, we create a separate sample for each statement masking each of the base and qualifier entities. Each sample then gets passed to the decoder to predict the most likely entity $^e$ for the mask.

4.1 The Encoder

In this subsection, we define the encoder used in QUAD. The encoder is composed of two neighborhood aggregations: one that aggregates information for the base entities and one that does so for the qualifier entities. We refer to these two aggregators as the base aggregator and qualifier aggregator, respectively. The initial entity $E$ and relation $R$ embeddings are first passed to the base aggregator and then the qualifier aggregator. The encoded entity embedding $^E$ and relation embedding $^R$ can be expressed as follows:

	$E^{'}, R^{'}$	$= Base-Agg (E, R, G),$		(4)
	$^E,^R$	$= Qual-Agg (E^{'}, R^{'}, G),$		(5)

where $Base-Agg (\cdot)$ and $Qual-Agg (\cdot)$ are the aggregation functions in base aggregator and qualifier aggregator, respectively. In the following, we introduce the details of both aggregators.

4.1.1 The Base Aggregator

The base aggregator aims to aggregate information for the base entities. Concretely, it takes $E$ and $R$ as input and aggregates the neighborhood information for a given base entity $v$ using a function $ψ_{v}$ :

(6)

where $h_{q}$ is the encoded representation for all the qualifier pairs belonging to the base triple $(v, r, u)$ . We utilize CompGCN vashishth2019composition as the Aggregate( $\cdot$ ) function. We can then rewrite Eq. (6) as follows,

h_{v} = f ⎛ ⎝ \sum (u, r) \in N (v) W_{λ (r)} ψ_{v} (h_{u}, h_{r}, h_{q}) ⎞ ⎠,

(7)

where the encoded qualifier representation $h_{q}$ is defined as in StarE galkin2020message as:

(8)

with $W_{q}$ as the projection matrix, $h_{q r}$ as the qualifier relation embedding, and $h_{q v}$ as the qualifier entity embedding. Note that the relation $h_{r}$ is updated through a linear transformation as shown in Eq. (2). StarE implements $ψ_{v}$ in Eq. (7) by combining the encoded qualifier information $h_{q}$ with the triple’s relation, i.e., $ψ_{v} (h_{u}, h_{r}, h_{q}) = ϕ (h_{u}, γ (h_{r}, h_{q}))$ as in Eq. (3). However, it is restricted by the assumption that qualifier information should be incorporated into the base relation embedding. Instead, we remove this restriction and combine the qualifier information with the output of the composition function $ϕ$ as follows:

(9)

where $α \in [0, 1]$ is a hyperparameter that balances the contribution of the base triple information and the encoded qualifiers. In this way, the qualifier information encoded in $h_{q}$ can directly interact with both the base entity and relation.

4.1.2 The Qualifier Aggregator

The previously introduced base aggregator only considers the aggregation of information from the qualifiers to the base triple but not vice versa. Encoding base triple information in the qualifiers is advantageous as it can help learn better representations for the qualifiers. For example, in Figure 1 the base aggregation doesn’t consider the flow of information from the triple (Stephen King, Author Of, The Running Man) to the the qualifier entity Richard Bachman resulting in limited information about the author being encoded in the qualifier entity. This makes it difficult to infer new facts where Richard Bachman is a base entity. To encode more information for a qualifier entity, we can aggregate information from its neighbors, i.e., the base triples it belongs to and the qualifier relation connecting them to those triples. Using our previous example, the qualifier entity Richard Bachman would aggregate information from the base triple (Stephen King, Author Of, The Running Man) and the qualifier relation Under Pseudonym.

Hence, the Qual-Agg( $\cdot$ ) function is designed to aggregate the neighborhood information for the qualifier entity $q v$ as follows:

	$h_{q v} = Aggregate ($	$ψ_{q} (h_{v}, h_{r}, h_{u}, h_{q r}),$
		$\forall (v, r, u, q r) \in N_{q v}),$		(10)

where $h_{q r}$ is the qualifier relation embedding and the function $ψ_{q}$ is used to combine the neighboring embeddings. For $ψ_{q}$ , we hope that the base triple embeddings $(h_{v}, h_{r}, h_{u})$ should be considered as a whole. Since the qualifiers serve as additional context for explaining the whole triple, we should aggregate information from the whole triple instead of treating $h_{v}, h_{r}, h_{u}, h_{q r}$ individually. To achieve this goal, we first consider that a qualifier entity $q v$ is linked to some base triple $t = (v, r, u)$ by the qualifier relation $q r$ . This can be seen as analogous to a standard triple where the base triple is the head, the qualifier relation is the relation, and the qualifier entity is the tail entity. For convenience, we can write this in a triple notation as $(t, q r, q v)$ , which we refer to as a qualifier triple. An example found in Figure 1 would be $t =$ (Stephen King, Author Of, The Running Man), $qr=\emphUnderPseudonym$ and $qv=\emphRichardBachman$ . Using these ideas, we can view $ψ_{q}$ as a function of the base triple $t$ and the qualifier relation $q r$ :

ψ_{q} (h_{v}, h_{r}, h_{u}, h_{q r}) = ϕ (h_{t}, h_{q r}),

(11)

where the embedding $h_{t}$ is the encoded representation of the base triple $t$ and $ϕ$ is defined similarly to the composition function used in CompGCN vashishth2019composition. The base triple representation $h_{t}$ is formulated as the linear projection of the concatenated embeddings of the base triple:

h_{t} = Linear (Concat (h_{v}, h_{r}, h_{u})) .

(12)

4.2 The Decoder

Using the encoded representations of the entities and relations, i.e., $^E$ and $^R$ , each hyper-relational fact is passed to a decoder to make the final prediction. To decode each fact we utilize a transformer vaswani2017attention that employs an architecture similar to CoKE wang2019coke extended to include qualifier information. For a given input sample $S$ , we mask the entity token we are trying to predict. Eq. (13) is an example where we mask the object entity:

S = (h_{v}, h_{r}, [M a s k], h_{q r 1}, h_{q v 1}, \dots) .

(13)

After being passed through the transformer, we extract the masked embedding and pass it through a fully-connected layer and score it against all possible entities. This is then passed through a sigmoid function with the highest scoring entity being chosen as our prediction.

4.2.1 Model Training

As the downstream task is to perform knowledge graph completion for entities in the base triples, we mask the head and tail entities for the fact $(h_{v}, h_{r}, h_{u}, q)$ where $q$ is the set of qualifier pairs associated with the triple. Then we try to predict the masked entities and minimize the loss $L_{base}$ using binary cross-entropy loss.

To further enhance the learned representations of the embeddings and exploit the qualifier information, we include an auxiliary task that masks and attempts to predict the qualifier entities for each fact as proposed in Hy-Transformer yu_hytransformer. We refer to this loss as $L_{qual}$ and minimize it using the binary cross-entropy loss. Since the downstream task is solely to predict the missing base entities, we introduce a hyperparameter $β \in [0, 1]$ that balances the contribution of the qualifier entity loss. The loss can now be written as thus:

L = L_{base} + β L_{qual} .

(14)

4.3 Parallel Architecture

We also consider a version of QUAD that combines the entity representations encoded by the two aggregation schemes in parallel instead of sequentially. Under this setting, both aggregate functions take the initial entity embedding matrix $E$ as input. The encoded entity representations outputted by the two encoders are then combined via a weight matrix $W_{p}$ and passed to the decoder. The relation embeddings are still passed sequentially as we found that this performed best. We believe that this version of our framework may be better at balancing the contribution of the base and qualifier aggregation for some datasets. We formulate the parallel encoding scheme as follows where $E_{b a s e}$ and $R_{b a s e}$ are the output of $Base-Agg (\cdot)$ and $E_{q u a l}$ the output of $Qual-Agg (\cdot)$ :

	$E_{b a s e}, R_{b a s e} = Base-Agg (E, R, G),$		(15)
	$E_{q u a l},^R = Qual-Agg (E, R_{b a s e}, G),$		(16)
	$^E = W_{p} Concat (E_{b a s e}, E_{q u a l}) .$		(17)

4.4 Relationship to Other Frameworks

\adjustbox

max width= Method WD50K (13.6) Wikipeople (2.6) JF17K (45.9) MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 NaLP-Fix 0.177 0.131 0.264 0.420 0.343 0.556 0.245 0.185 0.358 HINGE 0.243 0.176 0.377 0.476 0.415 0.585 0.449 0.361 0.624 StarE 0.349 0.271 0.496 0.491 0.398 0.648 0.574 0.496 0.725 Hy-Transformer 0.356 0.281 0.498 0.501 0.426 0.634 0.582 0.501 0.742 QUAD 0.348 0.270 0.497 0.466 0.365 0.624 0.582 0.502 0.740 QUAD (Parallel) 0.349 0.275 0.489 0.497 0.431 0.617 0.596 0.519 0.751

Table 1: Knowledge Graph Completion Results for Multiple Datasets

\adjustbox

max width= Method WD50K (33) WD50K (66) WD50K (100) MRR H@1 H@10 MRR H@1 H@10 MRR H@1 H@10 NaLP-Fix 0.204 0.164 0.277 0.334 0.284 0.423 0.458 0.398 0.563 HINGE 0.253 0.190 0.372 0.378 0.307 0.512 0.492 0.417 0.636 StarE 0.331 0.268 0.451 0.481 0.420 0.594 0.654 0.588 0.777 Hy-Transformer 0.343 - - 0.515 - - 0.699 0.637 0.812 QUAD 0.349 0.286 0.470 0.515 0.456 0.623 0.703 0.638 0.820 QUAD (Parallel) 0.346 0.287 0.459 0.510 0.454 0.615 0.693 0.628 0.812

Table 2: Knowledge Graph Completion Results on WD50K Splits

In this subsection we provide a unified view on several popular hyper-relational KG methods by showing that they are subsets of our framework.

4.4.1 StarE

StarE galkin2020message uses a GNN to encode the KG and a transformer to perform knowledge graph completion. The GNN encoder is detailed in Eq. (3). We can show that StarE is a special case of QUAD where:

Information is only aggregated from the perspective of the base entities. $Qual-Agg (\cdot)$ is therefore equal to the identity function.
The function $ψ_{v}$ in the base aggregator is defined as $ϕ (h_{u}, α ⊙ h_{r} + (1 - α) ⊙ h_{q})$ .
StarE only masks the base entities so $β = 0$ .

4.4.2 Hy-Transformer

Hy-Transformer yu_hytransformer modifies StarE by replacing the GNN encoder with layer normalization layers. They also introduce an auxiliary task that masks the qualifier entities during training. We demonstrate that Hy-Transformer is a special case of QUAD where:

Both aggregate functions are equal to layer normalization layers for the entity and relation embeddings such that: $^E = Layer-Norm (E)$ and $^R = Layer-Norm (R)$ .
The loss balancing hyperparameter $β = 1$ .

4.4.3 Gran

GRAN wang_gran uses a transformer with edge-specific attention biases. We demonstrate that GRAN is a special case of QUAD where:

Both aggregate functions are equal to the identity function.
The loss balancing hyperparameter $β = 1$ .
The transformer (decoder) considers edge-specific biases. The transformer used in QUAD is equivalent to the GRAN-complete model where the biases for the key and value matrices are set to zero $(e_{i j}^{K}, e_{i j}^{V}) = (0, 0)$ .

5 Experiment

In this section, we conduct experiments to demonstrate the effectiveness of our proposed framework QUAD. We first introduce the experimental settings and then compare the results of QUAD against the baselines on numerous benchmark datasets. Next we perform an ablation study to determine the importance of each component in our framework. Lastly, we perform some additional experiments to assess the impact of the loss balancing term on the performance.

5.1 Experimental Settings

5.1.1 Datasets

We consider three datasets for our experiments including JF17K wen2016_transh, Wikipeople guan_nalp, and WD50K galkin2020message. An issue with WD50K and Wikipeople is that only a small percentage of triples contain qualifiers, being 13.6% for WD50K and 2.6% for Wikipeople. We therefore also measure the performance on the WD50K splits introduced by galkin2020message that contain a higher percentage of triples with qualifiers. The three splits are WD50K (33), WD50K (66), and WD50K (100) with the number in parentheses representing the percentage of triples with qualifiers.

5.1.2 Baselines

We compare the results of our framework with other prominent hyper-relational baselines including NaLP-Fix rosso_hinge, HINGE rosso_hinge, StarE galkin2020message, and Hy-Transformer yu_hytransformer. Note that we do not include GRAN wang_gran as one baseline since (1) similar to Hy-Transformer, it is also a transformer-based method; and (2) in addition to the auxiliary task, GRAN also masks the relations and we can incorporate such component to the proposed framework that we leave as one future work.

5.1.3 Evaluation Metrics

To evaluate the performance on the test set we report the mean reciprocal rank (MRR) and the percentage of top 1 and 10 hits (H@1 and H@10) when performing knowledge graph completion. We utilize the filtered setting introduced by bordes2013translating.

5.2 Performance Comparison on Benchmarks

5.2.1 Performance on WD50K, Wikipeople and JF17K

In this subsection, we evaluate QUAD on the benchmark datasets and compare its performance to the aforementioned baselines. We first evaluate performance on the WD50K, Wikipeople, and JF17K datasets. The method QUAD is the original formulation of our framework while the method QUAD (Parallel) is the alternative formulation presented in Section 4.3. The results are shown in Table 1. For each dataset we include the percentage of triples with qualifiers in parentheses.

Evaluating the results in Table 1 we observe that the performance of QUAD varies by dataset. For JF17K it achieves the best performance for all three metrics including a 2.4% increase in MRR over the second best performing model. On Wikipeople its performance is similar Hy-Trasnformer while for WD50K it is slightly below state of the art. This is due to both WD50K and Wikipeople containing a low percentage of triples with qualifier pairs with 13.6% and 2.6%, respectively. On JF17K, which has a much higher percentage of qualifiers at 45.9%, QUAD is able to outperform the baseline models. We therefore believe that datasets with a higher ratio of qualifiers is where QUAD shows its value.

5.2.2 Varying the ratio of qualifiers

To test this hypothesis, we evaluate QUAD on the WD50K subsets introduced in Section 5.1.1. The percentage of triples with qualifier pairs for the three subsets is approximately 33%, 66%, and 100%, respectively. The results are presented in Table 2. From the results, we see state-of-the-art performance across all the three datasets for each of the three evaluation metrics. Interestingly, as opposed to the results shown in Table 1 the original (non-parallel) version of QUAD performs the best. We believe that the strong performance on the datasets with a more substantive percentage of qualifiers shows the utility of our framework and that the relatively poorer performance on the Wikipeople and WD50K datasets result from them containing too few triples with qualifiers.

5.3 Ablation Study

In this subsection, we conduct an ablation study to determine the importance for each component in QUAD. We do so by considering three versions of QUAD: (1) One that doesn’t mask the qualifier entities, (2) one without the qualifier aggregation component and (3) one without (1) and (2). Evaluating the results on the three ablated frameworks will help us ascertain both the individual and cumulative effect those two components have on the performance. Of most importance is the impact of the novel qualifier aggregation introduced in this paper. We report the results of this study on the WD50K (100) dataset under the original (non-parallel) setting. The results can be found in Table 3.

Method	MRR	H@1	H@10
w/o Qual Agg & Mask	0.658	0.591	0.781
w/o Qual Mask	0.677	0.613	0.794
w/o Qual Agg	0.696	0.628	0.820
QUAD	0.703	0.638	0.820

Table 3: Ablation Study on WD50K (100)

Evaluating the results in Table 3 we see that the removal of either or both of the two components leads to a degradation in performance. Ablating the qualifier aggregation, the main contribution of our framework, causes a 1% reduction in MRR and a 1.6% drop in H@1. Furthermore, removing just the qualifier masking results in a 3.7% reduction in MRR and ablating both components leads to the largest drop with a 6.4% drop in MRR. These results validate the importance of the qualifier aggregation as its removal leads to a degradation in performance.

5.4 Effect of Loss Balancing Hyperparameter

In this subsection, we study the effect that the loss balancing hyperparameter $β$ has on the performance. To do so we train and evaluate multiple versions of our framework with values of $β$ in the set ${0, 0.25, 0.5, 0.75, 1}$ . We note that a value of $β = 0$ is equivalent to no qualifier masking while a value of $β = 1$ is equivalent to the loss used in Hy-Transformer yu_hytransformer. We perform this study on the three WD50K splits and JF17K to determine the impact the loss balancing hyperparameter has on a variety of datasets. We utilize the parallel setting when training on JF17K and our original formulation on the WD50K splits. For simplicity we report only the MRR performance in Table 4.

$β$	WD50K (33)	WD50K (66)	WD50K (100)	JF17K
0	0.33	0.491	0.677	0.592
0.25	0.346	0.507	0.690	0.596
0.50	0.348	0.514	0.697	0.592
0.75	0.348	0.514	0.701	0.589
1	0.349	0.515	0.703	0.583

Table 4: Loss Hyperparameter Study (MRR)

The results in Table 4 shows multiple trends. For WD50K (100) there is a clear relationship with the value of $β$ and the performance. The higher the value of $β$ the better the performance. This is likely due to the fact that every triple in the dataset has at least one qualifier pair. For both WD50K (33) and (66) the MRR improves as we increase $β$ from 0 to 0.50 but we see little to no improvement when increasing the value from 0.50 to 1. Lastly, for JF17K the MRR is maximized when $β = 0.25$ . Furthermore, for increasing values of $β$ beyond 0.25 we see a noticeable drop in performance resulting in a MRR 2.2% lower at $β = 1$ compared to $β = 0.25$ . This shows the importance of balancing the magnitude of the auxiliary loss, as depending on the dataset balancing them equally ( $β = 1$ ) can potentially lead to a large degradation in performance.

6 Conclusion

In this paper we introduce our framework QUAD for learning representations for hyper-relational knowledge graphs. Our framework is motivated by better encoding qualifier information for a given hyper-relational fact. To this point we design a novel qualifier aggregation module to learn better encoded representations for the qualifiers. We also introduce a hyperparameter to balance the contribution of the auxiliary task in the loss. Experiments show that our framework performs well on several benchmark datasets as compared to competitive baselines. Further experiments validate the importance of the various components in our framework and the need to balance the auxiliary loss. For future work we plan on exploring how additional auxiliary information such as text description or literals help learn better representations of hyper-relational KGs.

References

Appendix A Infrastructure

All experiments were done on one 32G Tesla V100 GPU and implemented using Pytorch Geometric torch_geometric.

Appendix B Datasets

We conducted experiments on three datasets. These are JF17K wen2016_transh, Wikipeople guan_nalp, and WD50K galkin2020message. Table 5 details the statistics for each.

Dataset	Statements	#Entities	#Relations
Wikipeople (2.6)	369,866	34,839	375
JF17K (45.9)	100,947	28,645	501
WD50K (33)	102,107	38,124	475
WD50K (66)	49,167	27,347	494
WD50K (100)	31,314	18,792	279
WD50K (13.6)	236,507	47,156	532

Table 5: Datasets. We note the approximate percentage of triples with qualifiers in parentheses.

Appendix C Hyperparameters

QUAD is trained for 500 epochs with a learning rate of 1e-4. Furthermore the embedding dimension is 200 and the number of layers for the Base-Agg encoder is 2. For the transformer we set the number of layers to 2, the number of heads to 4, dropout to 0.1, and the hidden dimension is tuned from {512, 768}. The batch size is tuned from {128, 256}, $α$ is tuned from [0, 1] in steps of 0.1, beta is tuned from {0.25, 0.5, 0.75, 1}, the number of layers for the Qual-Agg from {1, 2}, the encoding dropout from {0.1, 0.2, 0.3}, the learning rate decay from {None, 0.996, 0.9975, 0.999}, and the label smoothing from {0.1, 0.2, 0.4, 0.6, 0.8}. The Adam optimizer is used in all the experiments kingma2014adam. For the composition function $ϕ$ we utilize the RotatE scoring function sun2019rotate. Under the parallel setting we tune an additional dropout layer after we combine the representation of the two encoders from {0.2, 0.3}.

Table 6 holds the hyperparameters for the general datasets and Table 7 for the three WD50K splits. The values are the same for the parallel and non-parallel versions of QUAD with the exception of the inclusion of the parallel dropout hyperparameter.

Hyperparameter	JF17K	Wikipeople	WD50K
Batch Size	128	128	128
Alpha	0.8	0.8	0.7
Beta	0.25	0	0.5
Qual Agg Layers	2	2	1
Encoder Dropout	0.1	0.1	0.2
Transformer Dim	768	512	768
LR Decay	0.999	0.999	0.9975
Label Smoothing	0.6	0.2	0.2
Parallel Dropout	0.2	0.2	0.2

Table 6: Hyperparameter values for the general datasets.

Hyperparameter	(33)	(66)	(100)
Batch Size	128	128	256
Alpha	0.7	0.7	0.6
Beta	1	1	1
Qual Agg Layers	1	1	1
Encoder Dropout	0.2	0.2	0.2
Transformer Dim	768	768	768
LR Decay	0.9975	0.9975	None
Label Smoothing	0.2	0.2	0.1
Parallel Dropout	0.3	0.3	0.3

Table 7: Hyperparameter values for the WD50K splits.