Structure and inference in hypergraphs with node attributes

Badalyan, Anna; Ruggeri, Nicolò; De Bacco, Caterina

doi:10.1038/s41467-024-51388-5

Structure and inference in hypergraphs with node attributes

Article
Open access
Published: 16 August 2024

Volume 15, article number 7073, (2024)
Cite this article

Download PDF

You have full access to this open access article

From

View current issue

Structure and inference in hypergraphs with node attributes

Download PDF

951 Accesses
2 Altmetric
Explore all metrics

Abstract

Many networked datasets with units interacting in groups of two or more, encoded with hypergraphs, are accompanied by extra information about nodes, such as the role of an individual in a workplace. Here we show how these node attributes can be used to improve our understanding of the structure resulting from higher-order interactions. We consider the problem of community detection in hypergraphs and develop a principled model that combines higher-order interactions and node attributes to better represent the observed interactions and to detect communities more accurately than using either of these types of information alone. The method learns automatically from the input data the extent to which structure and attributes contribute to explain the data, down weighing or discarding attributes if not informative. Our algorithmic implementation is efficient and scales to large hypergraphs and interactions of large numbers of units. We apply our method to a variety of systems, showing strong performance in hyperedge prediction tasks and in selecting community divisions that correlate with attributes when these are informative, but discarding them otherwise. Our approach illustrates the advantage of using informative node attributes when available with higher-order data.

Inference of hyperedges and overlapping communities in hypergraphs

Article Open access 24 November 2022

Annotated hypergraphs: models and applications

Article Open access 30 January 2020

How choosing random-walk model and network representation matters for flow-based community detection in hypergraphs

Article Open access 11 June 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Over recent years, systems where units interact in groups of two or more have been increasingly investigated. Such higher-order interactions have been observed in a wide variety of systems, including cellular networks¹, drug recombination², ecological communities³ and functional mapping of the human brain⁴.

These systems can be better described by hypergraphs, where hyperedges encode interactions among an arbitrary number of units^5,6. Often, research in this area solely considers the topology of hypergraphs, that is, a set of nodes and their higher-order interactions. Many hypergraph datasets, however, include attributes that describe properties of nodes, such as the age of an individual, their job title in the context of workplace interactions, or the political affiliation of a voter. In this work, we consider how to extend the analysis of hypergraphs to incorporate this extra information.

We focus on the relevant task of community detection, where the goal is to cluster nodes in a hypergraph. Community detection algorithms solely based on interactions tend to cluster nodes based on notions of affinity between communities, cluster separation, or other arguments similar to those classically utilized on graphs⁷. However, one can assume that relevant information about the communities and the hyperedge formation mechanism is additionally contained in the attributes accompanying a dataset.

For instance, students in a school have been observed to interact more likely in groups that involve individuals in the same classes⁸. A similar observation was also made for dyadic networks, where incorporating node attributes helped in community detection and other related inference tasks, e.g., prediction of missing information^{9,10,11,12,13}.

Several tools have been developed for community detection in higher-order data^14,15,16,17. Methods based on statistical inference have established themselves as effective tools in this direction, as they are both mathematically principled and have a high computational efficiency^18,19,20.

Here, we build on these approaches to incorporate node attributes into a community detection framework for higher-order interactions. More precisely, we follow the principles behind generative models for networks, which incorporate community structure by means of latent variables that are inferred directly from the observed interactions^21,22,23 and extend them to incorporate extra information on nodes.

The model we propose has several desirable features. It is flexible, as it can be applied to both weighted and unweighted hypergraphs, it can incorporate different node attributes, categorical or binary, and it outputs overlapping communities, where nodes can belong to multiple groups simultaneously. Furthermore, the model does not assume any a priori correlation structure between the attributes and the communities. Rather, it infers such a connection directly from the data. The extent of this contribution can vary based on the dataset. In the favorable case where attributes are correlated well with the communities, our model exploits such additional information to improve community detection. This is particularly beneficial in situations where data is sparse or when data availability is limited to an incomplete set of observations. In less favorable situations where correlation is low (for instance when the attributes do not align with the mechanism generating higher-order interactions), the model can nevertheless either discard or downweigh this information.

In some cases, a system can be explained well by different community divisions. Our model allows selecting a particular community structure guided by the desired attribute, provided that it is informative, as measured automatically by fitting the data. This allows a practitioner to focus the analysis of group interactions on some particular node characteristic.

Finally, our model is computationally efficient, as it scales to large hypergraphs and large hyperedge sizes. This feature is particularly relevant in the presence of higher-order interaction, where the increased computational complexity limits the range of models that can be practically implemented into viable algorithms.

Few works are available that investigate community detection in hypergraphs in presence of node attributes^24,25,26, but they are limited to clustering nodes without providing additional probabilistic estimates. Furthermore, they can be computationally burdening, or they typically rely on stronger assumptions about the nature of the data (e.g., assume real-valued weights) or the communities (e.g., nodes can only belong to one group).

Results

The model

We propose a probabilistic model that incorporates both the structure of a hypergraph, i.e., the interactions observed in the data, and additional attributes (or covariates) on the nodes. These two types of information, which we call structural and attribute information, have been previously shown to be informative in modeling community structure in networks, when there is correlation to be exploited^9,10,11,12.

We denote a hypergraph as H = (V, E, A), where V = {1, …, N} is a set of nodes, E is a set of observed hyperedges whose elements e ∈ E are arbitrary sets of two or more nodes in V, and A is a vector containing the weights of edges. In this work, we assume that weights are positive and integer quantities. Denoting Ω as the set of all possible hyperedges, we have that A_e is the weight of edge e when e ∈ E, otherwise A_e = 0 if e ∈ Ω⧹E. Given these definitions, the observed edge set E can equivalently be represented as E = {e ∈ Ω ∣ A_e > 0}. We represent the covariates on nodes as a matrix $X\in {{\mathbb{R}}}^{N\times Z}$, where Z is the number of attributes, with entries equal to 1 if the node i has attribute z and 0 otherwise. We note that a node can have several types of covariates, e.g., gender and age, which are then one-hot encoded as attributes.

We model the presence of structural information A and covariate information X probabilistically, assuming a joint probability of these two types of information that is mediated by a set of latent variables $\theta=\left\{w,\beta,u\right\}$. Here w, β are specific to each of the two distinct types of information, while the quantity u is a latent variable shared between the two. The presence of a shared u is a key to allow coupling the two types of information and extracting valuable insights about the system. Formally, we assume

$$P(A,X\,| \,\theta )={P}_{A}(A\,| \,w,u)\,{P}_{X}(X\,| \,\beta,u).$$

(1)

This factorization assumes conditional independence between A and X, given the parameters θ, and is analogous to related approaches on graphs^9,10. The factorization in Eq. (1) presents various advantages. First, the parameters in θ can provide interpretable insights about the mechanism driving hyperedge formation, as we show below. In our case, we focus on community structure, hence we model u to represent the community memberships of nodes. Second, it allows for efficient inference of the model parameters θ, as we show in the Methods section. Third, it allows predicting both A and X, which is relevant for example in the case of corrupted or missing data.

Having introduced the main structure of the model, we now describe the expressions of the two factors of the joint probability distribution in Eq. (1).

Modeling structural information

We model the structural information A by assuming that latent communities control the interactions observed. For this, we utilize the Hy-MMSBM probabilistic model¹⁹, which assumes mixed memberships where nodes can belong to multiple communities. This model flexibly captures various community structures (e.g., assortative, core periphery etc.), scales to large hyperedge sizes and allows incorporating covariates flexibly without compromising the efficiency of its computational complexity, as we explain in the Methods section.

Assuming K overlapping communities, u is an N × K non-negative membership matrix, which describes the community membership for each node i = 1, …, N. A symmetric and non-negative K × K affinity matrix w controls the density of hyperedges between nodes in different communities. The hypergraph is modeled as a product of Poisson distributions as:

$${P}_{A}(A| u,w)={\prod}_{e\in \Omega }\,{{{\rm{Pois}}}}\left({A}_{e};\frac{{\lambda }_{e}}{{k}_{e}}\right),$$

(2)

where

$${\lambda }_{e}={\sum}_{i < j:i,j\in e}{u}_{i}^{T}w{u}_{j}={\sum}_{i < j:i,j\in e}{\sum}_{k,q=1}^{K}{u}_{ik}{u}_{jq}{w}_{kq}.$$

(3)

The term k_e is a normalization constant, which can take on any positive value. In all our experiments we set its value to ${k}_{e}=\frac{| e| (| e| -1)}{2}\left(\begin{array}{c}N-2\\ | e| -2\end{array}\right)$, with ∣e∣ being the size of the hyperedge. Other parametrizations of the likelihood P_A(A∣u, w) are possible, e.g., using different generative models for hypergraphs with community structure^18,20, but it is not guaranteed that these would yield closed-form expressions and computationally efficient algorithms when incorporating additional attribute information in the probabilistic model. Similarly, in Eq. (2) we assumed conditional independence between hyperedges given the latent variables, a standard assumption in these types of models. Such a condition could in principle be relaxed following the approaches of refs. ^27,28,29. We do not explore this here.

Modeling attribute information

We model the covariates X assuming that the community memberships u regulate how these are assigned to nodes. We then assume that a K × Z matrix β with entries β_kz regulates the contribution of attribute z to the community k. This parameter plays a similar role for the matrix X as the matrix w does for the vector A. We combine the matrix β with the community assignment u via a matrix product that yields the following Bernoulli probabilities:

$${\pi }_{iz}={\sum}_{k=1}^{K}{u}_{ik}\,{\beta }_{kz}.$$

(4)

We assume that attributes are conditionally independent given the parameters π, which allows flexibly modeling several discrete attributes at a time. This is implemented by assuming that each entry X_iz is extracted from a Bernoulli distribution with parameter π_iz as:

$${P}_{X}(X| u,\beta )={\prod}_{i=1}^{N}{\prod}_{z=1}^{Z}{\pi }_{iz}^ {{\, x}_{iz}}{(1-{\pi }_{iz})}^{(1-{x}_{iz})}.$$

(5)

To ensure π_iz ∈ [0, 1], we constraint u_ik ∈ [0, 1] and ${\sum}_{k=1}^{K}{\beta }_{kz}=1$, ∀ z.

We focus here on discrete and unordered attributes. This covers many relevant scenarios, including the ones we study in the several real datasets below, e.g., roles of employees in a company or classes of students. Other specific cases could be treated using similar ideas and techniques as the one we propose by suitably modifying the distribution in Eq. (5). We give an example of imposing categorical attributes, when we want to explicitly force that having an attribute of one value does exclude any other possible value, in Supplementary Note C.

Inference of latent variables

Having defined the probabilistic model Eq. (1) and the two distributions Eqs. (2) and (5), our goal is to now infer the latent variables u, w and β, given the observed hypergraph A and the attributes X. To infer these values we consider maximum likelihood estimation and use an efficient expectation-maximization (EM) algorithm that exploits the sparsity of the dataset, as detailed in the Methods section. We combine the log-likelihoods of the two sources of information with a parameter γ that tunes their relative contribution, with extreme values γ = 0 ignoring the attributes and γ = 1 ignoring the structure, similarly to what has been done in attributed network models^9,10,11, or in models for information retrieval from text^30,31. In our experiments, we learn the γ hyperparameter from data via cross-validation.

Overall, the inference routine scales favorably with both the system size and the size of the hyperedges, as each EM iteration has a complexity of $O\left(K(K+Z)(N+| E| )\right)$, which is linear in the number of nodes and hyperedges. We refer to our model as HyCoSBM and make the code available online at github.com/badalyananna/HyCoSBM.

Detecting communities in synthetic networks

Our first experiments are tests on synthetic networks with known ground-truth community structure and attributes. We generate synthetic hypergraphs using Hy-MMSBM³² as implemented in the library HGX³³. We select parameter settings where inference with Hy-MMSBM is not trivial, to better assess the influence of using attributes, see details in Supplementary Note A. After the networks are created, we generate discrete attributes that match the community membership a fraction ρ of the time, while the remaining fraction 1 − ρ are randomly generated. This allows to vary the extent to which attributes correlate with communities and hence the difficulty of inferring the ground truth memberships. We varied ρ ∈ [0.1, 0.9], with higher values implying that inference of communities is aided by more informative attributes.

As a performance metric, we measure the cosine similarity between the membership vectors recovered by our model and the ground truth ones. In Fig. 1 we can see that, when the attributes are correlated with ground truth communities, HyCoSBM performs better than using either of the two types of information alone. In addition, the performance of HyCoSBM increases monotonically with increasing correlation between attributes and ground truth. Although this is observed also when using attributes alone, the performance of HyCoSBM in recovering the ground truth communities is always higher.

**Fig. 1: Community detection in synthetic hypegraphs.**

This behavior is consistent across different values of K, with larger performance gap between results at low and high ρ at larger K, where there are more choices to select from.

In short, these results demonstrate that the model is successfully using both attribute and structural information to improve community detection.

Results on empirical data

We analyze hypergraphs derived from empirical data drawn from social, political, and biological domains, as detailed in the Methods section. For each hypergraph we describe a different experiment, to illustrate various applications of our method. We select the number of communities K and the hyperparameter γ using 5-fold cross-validation. To assess the impact of using attributes, we compare HyCoSBM with three baselines: (i) Hy-MMSBM, that only utilizes the structural information in the hyperedges to detect mixed-membership communities; (ii) HyCoSBM with γ = 0, which is equivalent to not utilizing the attributes; (iii) HyCoSBM with community assignments u fixed to match the attributes, and only infer the w parameters, which tests how attributes alone perform. Notice that (i) and (ii) differ in that the membership vectors u are unconstrained in Hy-MMSBM, while they are restricted to u_ik ∈ [0, 1] in our model. In (iii) utilizing HyCoSBM and Hy-MMSBM is equivalent, since the two models coincide in the updates for w. The results of the following analyses are summarized in Table 1.

Table 1 AUC scores on real datasets

Full size table

Additionally, in the Supplementary Note D we show the advantage of using a hypergraph representation by comparing against results obtained by running a probabilistic model⁹ valid on attributed pairwise networks on a clique expansion, as example dyadic representation of the datasets considered here (we refer to this approach as Clique-Exp). Notice that models valid only on pairwise data do not have a natural expression to measure the probability of a hyperedge of size larger than two. Hence one has to make an arbitrary choice on how to assign this probability from that obtained on pairwise edges. We show results for an example of this choice in the Supplementary Material. HyCoSBM shows a strong performance in predicting hyperedges, outperforming Clique-Exp in all datasets except two contact datasets of students in schools, where performance is similar. Importantly, Clique-Exp is limited when applied on a biological dataset with large hyperedges, as the corresponding clique expansion contains a much larger number of edges and thus creates a computational bottleneck. Overall, in the datasets considered here, we find no indication that dyadic clique expansions are necessary neither for prediction performance nor for runtime efficiency.

Recovering interactions on contact dataset

In our first experiment we study human contact interactions, using the data obtained from wearable sensor devices in four settings^{8,34,35,36,37}: students in a high school (High School) and a primary school (Primary School), co-workers in a workplace (Workplace) and patients and staff in a hospital (Hospital). Hyperedges represent a group of people that were in close proximity at some point in time. Each dataset contains attributes that describe either the classes, the departments, or the roles the nodes belong to.

We measure the ability of our model to explain group interactions by assessing its performance on a hyperedge prediction task. To this end, we infer the parameters using only a fraction of the hyperedges in the dataset. Then we utilize the held out hyperedges to measure the AUC metric, which represents the fraction of times the model predicts an observed interaction as more likely than a non-observed one (higher values mean better performance).

Models that do not utilize any attribute have been previously shown to perform well on such a task on these datasets^18,19,20 when a large fraction of the dataset was given as input. Here, we vary the amount of structural information available to the algorithms more pronouncedly to assess their robustness in realistic situations where the full data is unavailable and investigate how making use of attributes can compensate for this. To simulate this setting, we delete an increasing fraction of the existing hyperedges (keeping the hypergraph connected) and perform 5-fold cross-validation on the remaining dataset.

The results in Fig. 2 show a significant and monotonic drop in performance for Hy-MMSBM as we decrease the fraction of hyperedges, consequently reducing the amount of structural information available to the algorithm. In contrast, HyCoSBM maintains an almost constant and high performance, all the way down to having access only to 20% of the hyperedges, owing to its usage of the additional attribute information. In addition, even in the favorable setting when all hyperedges are available, HyCoSBM yields higher AUC in Workplace (with γ = 0.995), indicating that incorporating attributes can be beneficial even when robust results are obtained using structural information alone.

**Fig. 2: Predicting interactions in close-proximity datasets with partial observations.**

Focusing on other datasets where HyCoSBM attains AUC similar to that of other algorithms when all the interactions are utilized, we still observe a difference in the types of communities detected. As an example, in the High School dataset the community assignments u inferred via Hy-MMSBM have cosine similarity of 0.59 with the class attribute of the nodes, as opposed to the cosine similarity of 0.94 observed for HyCoSBM.

These different levels of correlation between inferred communities and attributes, together with observing similar AUC (indicating a similar ability to explain the structural information), could be explained by the presence of competing network divisions, as already observed in network datasets^12,38,39. Our model allows selecting among divisions, finding ones that correlate with the attribute of interest.

We highlight that, although the communities inferred by HyCoSBM correlate with the attributes, these two are not equivalent. In fact, we observe several cases where the number of detected communities is not equal to the number of attributes. For example, we observe cases where the model detects fewer communities than the number of attributes available. In Fig. 3 the nodes with attribute SFLE (green) are included within the community formed mainly by DISQ nodes (purple) by our model when 50% of the edges are given in input. This partition achieves higher AUC than the model with community assignments fixed and equal to the attributes. In other cases, our model finds smaller communities within the bigger partitions determined by the attributes. We find such an example in the High School dataset in Supplementary Fig. 1, where HyCoSBM finds finer partitions (K = 11) than the one given by the Z = 9 classes, hierarchically splitting some classes into subgroups. The resulting partition attains a high AUC score. A high number of inferred communities is also observed in Hy-MMSBM, but, in this case, the AUC drops significantly, and the K = 30 communities inferred at 30% of the edges are much more mixed between the classes. In short, the communities inferred by our model do not simply replicate the attribute. Rather, this additional information is used to infer a community structure that better explains the interactions observed in the data.

**Fig. 3: Communities detected in a Workplace dataset from partial observations of close-proximity interactions.**

Performance with uninformative attributes

In the previous sections, we have shown how attribute information can aid the recovery of effective communities and improve inference. In general, though, we cannot expect that any type of attribute added to a network dataset may help explaining the observed structure. This may be the case for instance when an attribute is uncorrelated or weakly correlated with the hyperedges, as in the synthetic experiments described above when ρ is close to 0.1.

In this section we study the performance of HyCoSBM in this adversarial regime and show that, when attributes are uninformative, these are readily discarded by our model to only perform inference based on structural information.

To this end, we feed the sex and has facebook attributes, respectively from the Primary School and High School datasets, into our model. As we show in Fig. 4, the performance of HyCoSBM closely resembles that of the models that do not use any attribute in input, signaling that these attributes are not as informative as class to explain the observed group interactions. This is reinforced by a very low AUC for the model that fixes u as the attributes (red line).

**Fig. 4: AUC on contacts dataset with partial hyperedges: uncorrelated attributes.**

We further illustrate this point in four datasets of US representatives. Here, nodes are representatives (in the House of Representatives or in the Senate) and hyperedges represent co-sponsorship of bills (Bills datasets) or co-participation in a committee during a Congress meeting (Committees datasets). The attribute indicates whether the representative is associated with the Republican or Democratic party (Z = 2). In Table 2 we show that there is no advantage in using this binary attribute to explain the co-sponsorship nor the co-participation patterns, as the AUC is similar to that of models that do not use attribute information in input. As a confirmation, the value of γ obtained via cross-validation is equal to 0 in three out of four cases, and 0.1 in one case, showing that the algorithm tends to discard the attribute information and prefers to rely solely on structural data.

Table 2 AUC scores on co-sponsorship and co-participation datasets of US representatives

Full size table

Improving prediction of Gene-Disease associations

Our next application is on a biological dataset containing Gene-Disease associations⁴⁰. Here, nodes represent genes, and hyperedges represent a combination of genes specific to a disease. For each node, its Disease Pleiotropy Index (DPI) is available as an attribute, indicating the tendency of a gene to be associated with many types of diseases, with Z = 25 possible discrete values. The dataset is highly sparse, as many nodes are present only in one hyperedge. Previous results have shown that inferring missing associations improves sensibly when using all hyperedges in the datasets¹⁹ (with AUC scores up to 0.84), compared to using only hyperedges up to size D = 25¹⁸. In this paragraph, we investigate whether these results can be further improved when additional information is available in the form of the DPI attribute. We find that running HyCoSBM achieves an AUC score of 0.9, indicating that this attribute is informative. Furthermore, we observe that the communities detected by HyCoSBM are similar to those obtained from the attributes, see Fig. 5A, but with a finer division into K = 30 communities, which is larger than the Z = 25 covariate categories.

**Fig. 5: Cosine similarity and AUC in a Gene Disease dataset.**

Recovering core-periphery structure with Enron Email dataset

In this paragraph we focus on the application of our methodology to the Enron Email dataset⁴¹, where nodes represent employees of an organization and hyperedges email exchanges. In particular, the dataset comes with the annotation of nodes being either part of a “core”, which contains employees sending batch emails, or a “periphery”, containing the receivers. A hyperedge represents one email batch, and it contains both the core sender and all the periphery receivers. Here, we focus on the study of the core-periphery attribute to predict higher-order interactions in the data.

In this dataset, the core-periphery generative process behind the data is partially known. Hence, it does not come as a surprise that using the “core” and “periphery” labels improves inference and reconstruction. However, our results using HyCoSBM reveal both a more effective and a more nuanced interpretation than that given by using the labels alone. This is because HyCoSBM does not simply replicate the attributes of the nodes, but rather exploits them to achieve an improved inference. To test this hypothesis, we compare three inference scenarios: the vanilla HyCoSBM inference, a constrained version where the attribute matrix u is fixed and equal to the core-periphery assignments, and the Hy-MMSBM algorithm. These achieve AUC scores of 0.99, 0.95, and 0.91, respectively.

As it can be observed in Fig. 6, Hy-MMSBM finds assortative structure dividing the nodes into 2 groups, which is also the number of attributes. Instead, HyCoSBM divides the nodes into three groups: groups 0 and 1, which interact with each other in a disassortative fashion, and group 2, that behaves assortatively. We also observe that the large majority of nodes that have mixed-membership spread in these three groups are core nodes, while periphery nodes have mainly a non-zero membership in group 1. As a result, HyCoSBM unveils a finer-grain division of the core, revealing patterns within it that cannot be inferred by observing the (hard) membership given by the attributes themselves. This is also shown by the inferred w matrix, where core nodes interact mainly with themselves and partially with periphery nodes when we fix u equal to the labels.

**Fig. 6: Communities and structure detected in the Enron-Email dataset.**

In summary, HyCoSBM effectively leverages the data attributes to inform the inference procedure. It does so by exploiting the additional information to extract informative structure and unveiling finer structure than the one given by the observed attributes alone.

Predicting co-destination patterns in New York City taxi rides

As a final application, we consider a dataset of taxi rides in New York City⁴². We are interested in measuring patterns of similar destinations, based on travel demands. For this, we consider a given time window and a day of the week and build a hypergraph where nodes are dropoff locations and a hyperedge connects dropoffs that where reached by travellers starting from the same pickup location. Data of this form is often used in urban planning and to understand human mobility co-location patterns^43,44. The only node attribute available from the data is the “Borough” type (the basic administrative unit in the city of New York), which we utilize in the following experiments. In addition to the existing five boroughs, the dataset also contains Newark airport as location. We assign it to a 6th attribute.

We study examples of such a network by considering the week from Saturday November 04 2023 until Tuesday November 14 2023, and building two hypergraphs relative to two different time windows: (i) Monday and Tuesday 06-07.11 between 17.00 and 20.00, and (ii) Saturday and Sunday 04-05.11 between 00.00 and 03.00. These two 3-h time windows are selected to consider diverse travel needs. We expect the first one to capture commuters, the second to capture entertainment and nightlife. We obtain two hypergraphs with N = 214 nodes, E = 523 and 476 hyperedges, and maximum hyperedge sizes of 132 and 125, respectively, see Table 3.

Table 3 Statistics on hypergraph obtained from NYC taxi drives

Full size table

We assess how informative HyCoSBM is in representing co-destination taxi trips data by comparing it with other approaches in the task of predicting future co-destination locations. Specifically, we train a model on the two datasets described above, and perform hyperedge prediction on analogous datasets built from taxi rides taking place in subsequent days of the same week, in two different time windows. While we expect travel demands to vary with time and day, we also expect correlations to be exploited because of the intrinsic nature of different destination locations, which could attract similar types of passengers in different times and days. To better test this hypothesis, we devise an experiment where for each existing hyperedge e in a given test set (e.g., for the taxi trips of Wednesday and Thursday between 00.00 and 00.03), we extract a non-existing hyperedge $\hat{e}$ where we make a minimal change to e. Specifically, we select one node i ∈ e at random and switch it with another node j ∈ V⧹e, also selected at random. We refer to this procedure as switch-one-out (SOO). In this way we make the task more difficult as all nodes but one coincide in e and $\hat{e}$. In terms of the Jaccard similarity, we have $J(e,\hat{e})=| e\cap \hat{e}| /| e\cup \hat{e}|=| e-1| /| e+1|$. This construction of the negative test data aims at building challenging comparisons as the prediction of true positives becomes more difficult due to e and $\hat{e}$ being similar. We first run cross-validation on the training datasets and choose the best parameters. Then we analyze the results using test datasets generated from a different day.

Observing Fig. 7, we find that HyCoSBM achieves a strong performance in predicting future co-destinations consistently across test time frames, which is significantly higher than that of the two comparison methods, Hy-MMSBM and Clique-Exp. The performance gap is higher in predicting co-destinations using the time window of Monday and Tuesday between 17.00 and 20.00 as training set, where the other two approaches have much lower AUC, with Hy-MMSBM attaining higher values than Clique-Exp. We find that HyCoSBM detects communities that are partially aligned with the “Borough” attribute (not shown here). Furthermore, we observe that various node are assigned with mixed-membership spread over more than one community, and that several communities comprise nodes from different boroughs.

**Fig. 7: Predicting co-destination taxi rides in New York City.**

Discussion

We have analyzed how node attributes can be used to guide investigations of higher-order data. We focused on the problem of community detection, introducing a mixed-membership probabilistic generative model for hypergraphs. Our model can explicitly incorporate both hyperedges and node attributes, and find more expressive community partitions by exploiting the combination of these information sources.

We have applied our model to a variety of social, political and biological hypergraphs, showing how prediction of missing interactions can be boosted by the addition of informative attributes, in particular in the regime of incomplete or noisy data. We have also illustrated various scenarios where attributes can be used to select between competing divisions, or cases where they are not informative and can be discarded.

There are a number of possible extensions of this work. One could include additional attribute types, such as attributes on hyperedges, continuous variables or vector variables, for instance considering recent approaches for attributed networks⁴⁵. Similarly, one could consider alternative probabilistic expressions for the structural data, but this would require efforts to derive closed form updates and maintain a low computational complexity. On a related note, our model is based on the assumption that attributes and structure are independent conditionally on the latent variables. This approach is rather general, as the latent variables can potentially take on different semantics. It would be interesting to study other types of dependencies between structure and attributes, as well as investigating in more depth the validity of conditional dependence assumptions in both hyperedges and attributes. Finally, our model might be extended to consider dynamical hypergraphs, where communities and interactions can change in time, and assess what role attributes play in this case.

Methods

Inference of the latent variables

The likelihood of HyCoSBM factorizes over all hyperedges e ∈ Ω, and single hyperedges are modeled with a Poisson distribution:

$${P}_{A}({A}_{e}| u,w)={{{\rm{Pois}}}}\left({A}_{e};\frac{{\lambda }_{e}}{{k}_{e}}\right).$$

(6)

Similarly, the probability of attributes factorizes into Bernoulli probabilities:

$${P}_{X}(X| u,\beta )={\prod }_{i=1}^{N}\mathop{\prod}_{z=1}^{Z}{\pi }_{iz}^{{\, x}_{iz}}{(1-{\pi }_{iz})}^{(1-{x}_{iz})}.$$

(7)

Under the Poisson distribution in Eq. (6), it can be shown that the log-likelihood L_A(u, w) of the full hypergraph evaluates to

$${L}_{A}(u,w)=-C{\sum}_{i < j\in V}{u}_{i}^{T}w{u}_{j}+{\sum}_{e\in E}{A}_{e}\log {\sum}_{i < j\in e}{u}_{i}^{T}w{u}_{j},$$

(8)

where $C={\sum}_{d=2}^{D}\left(\begin{array}{c}N-2\\ d-2\end{array}\right)\frac{1}{{\kappa }_{d}}$ and D is the maximum hyperedge size observed¹⁹. Instead, Eq. (7) yields the log-likelihood

$${L}_{X}(u,\beta )= {\sum }_{i=1}^{N}{\sum }_{z=1}^{Z}{x}_{iz}\log \left({\sum }_{k=1}^{K}{u}_{ik}\,{\beta }_{kz}\right)\\ +{\sum }_{i=1}^{N}{\sum }_{z=1}^{Z}(1-{x}_{iz})\log \left({\sum }_{k=1}^{K}(1-{u}_{ik})\,{\beta }_{kz}\right).$$

(9)

As we assumed conditional independence of the network part and the attributes part, the total log-likelihood becomes the sum of those two terms. In practice though, performance improves by introducing a balancing parameter γ ∈ [0, 1] that tunes the relative contribution of the two terms^9,11,30,31, yielding a total log-likelihood as:

$$L(u,w,\beta )=(1-\gamma )\,{L}_{A}(u,w)+\gamma \,{L}_{X}(u,\beta ).$$

(10)

The value of γ is not known a priori, and it can be learned from the data using standard techniques for hyperparameter learning. In our experiments, we utilize cross-validation. The γ parameter is necessary to better balance the contribution of the structural and covariate information, as the magnitude of the two different log-likelihood terms can be on different scales, with the risk of biasing the total likelihood maximization towards one of the two terms. This balancing is also useful when attribute data are somehow more (or less) reliable than structural data, for instance when we believe that one is less (or more) subject to noise. Furthermore, γ is reminiscent of any hyperparameter of approaches that adjust inference based on prior distributions on the community assignments, as done in some attributed network models, e.g., refs. ^12,13.

We note here that the value of γ has a clear interpretation only for the extreme cases of 0 or 1, which discards entirely the contribution of one of the two terms. In all the other intermediate cases, its value is not simply interpreted as a percentage contribution of the attributes over the network. This is because γ balances the magnitudes of two likelihood terms. In general, the network part is much larger than the attribute one, which draws γ to values closer to 1, e.g., 0.995, to compensate for the difference in scales. This does not necessarily mean that the network information is barely used, but rather that it has to be rescaled to allow the attribute information to be effectively considered.

As a final remark, our definition of X allows modeling several discrete attributes at the same time, and the dimension Z is the total number of values, including all the attribute types. Formally, Z = ∑_p=1,…,Pz_p, where P is the number of attribute types (e.g., age and class would give P = 2), and z_p is the number of discrete values an attribute of type p can take. Alternatively, the presence of more than one attribute can be modeled by considering separate terms L_X, each with a different multiplier γ. While this formulation would allow for tuning the contribution of attributes more specifically, this comes at a price of higher model complexity (in case of using different expressions for the L_X) or higher computational complexity, as one needs to cross-validate more than one type of γ. We do not explore this here.

Variational lower bound

To maximize the total log-likelihood in Eq. (10) we adopt a standard variational approach to lower bound the summation terms inside the logarithm. Introducing the probability distributions ${\rho }_{ijkl}^{(e)},{h}_{izk}$ and ${h}_{izk}^{{\prime} }$ and using Jensen’s inequality $\log {\mathbb{E}}[x]\ge {\mathbb{E}}[\log x]$, we get the following lower bounds:

$${\sum}_{e\in E}{A}_{e}{\sum}_{i < j\in e}\log {\sum}_{k,q=1}^{K}\left({u}_{ik}{u}_{jq}{w}_{kq}\right)\ge {\sum}_{e\in E}{A}_{e}{\sum}_{i < j\in e}{\sum}_{k,q=1}^{K}{\rho }_{ijkq}^{(e)}\log \left(\frac{{u}_{ik}{u}_{jq}{w}_{kq}}{{\rho }_{ijkq}^{(e)}}\right);$$

(11)

$${\sum}_{i=1}^{N}{\sum}_{z=1}^{Z}{x}_{iz}\log \left({\sum}_{k=1}^{K}{u}_{ik}{\beta }_{kz}\right)\ge {\sum}_{i=1}^{N}{\sum}_{z=1}^{Z}{x}_{iz}{\sum}_{k=1}^{K}{h}_{izk}\log \left(\frac{{u}_{ik}{\beta }_{kz}}{{h}_{izk}}\right);$$

(12)

$${\sum}_{i=1}^{N}{\sum}_{z=1}^{Z}(1-{x}_{iz})\log \left({\sum}_{k=1}^{K}(1-{u}_{ik}){\beta }_{kz}\right)\ge {\sum}_{i=1}^{N}{\sum}_{z=1}^{Z}(1-{x}_{iz}){\sum}_{k=1}^{K}{h}_{izk}^{{\prime} }\log \left(\frac{(1-{u}_{ik}){\beta }_{kz}}{{h}_{izk}^{{\prime} }}\right);$$

(13)

with equality reached when

$${\rho }_{ijkq}^{(e)}=\frac{{u}_{ik}{u}_{jq}{w}_{kq}}{{\lambda }_{e}};$$

(14)

$${h}_{izk}=\frac{{\beta }_{kz}{u}_{ik}}{{\sum}_{{k}^{{\prime} }}{\beta }_{{k}^{{\prime} }z}{u}_{i{k}^{{\prime} }}};$$

(15)

$${h}_{izk}^{{\prime} }=\frac{{\beta }_{kz}(1-{u}_{ik})}{{\sum}_{{k}^{{\prime} }}{\beta }_{{k}^{{\prime} }z}(1-{u}_{i{k}^{{\prime} }})};$$

(16)

respectively.

Plugging Eq. (11) into Eq. (8) yields a lower bound ${{{{\mathcal{L}}}}}_{A}$ of the structural log-likelihood

$${{{\mathcal{L}}}}_{A}(u,w,\rho )= -C\sum\limits_{i < j\in e}{u}_{i}^{T}w{u}_{j}\\ + \sum\limits_{e\in E}{A}_{e}\sum\limits_{i < j\in e}{\sum }_{k,q=1}^{K}{\rho }_{ijkq}^{(e)}\log \left(\frac{{u}_{ik}{u}_{jq}{w}_{kq}}{{\rho }_{ijkq}^{(e)}}\right).$$

(17)

Similarly, Eqs. (12)–(13) yield a lower bound ${{{{\mathcal{L}}}}}_{X}$ of the log-likelihood of the attributes:

$${{{\mathcal{L}}}}_{X}(u,\beta,h,{h}^{{\prime} })= {\sum }_{i=1}^{N}{\sum }_{z=1}^{Z}{x}_{iz}{\sum }_{k=1}^{K}{h}_{izk}\log \left(\frac{{u}_{ik}{\beta }_{kz}}{{h}_{izk}}\right)\\ + {\sum }_{i=1}^{N}{\sum }_{z=1}^{Z}(1-{x}_{iz}){\sum }_{k=1}^{K}{h}_{izk}^{{\prime} }\log \left(\frac{(1-{u}_{ik}){\beta }_{kz}}{{h}_{izk}^{{\prime} }}\right),$$

(18)

so that

$${{{\mathcal{L}}}}:=(1-\gamma ){{{{\mathcal{L}}}}}_{A}+\gamma {{{{\mathcal{L}}}}}_{X},$$

(19)

is a lower bound of the full log-likelihood.

Expectation-maximization

We now aim to optimize the variational lower bound in Eq. (19) with respect to the model parameters u, w and β. To account for the constraint on β and u, we introduce the Lagrange multipliers λ^(β) and λ^(u) obtaining the following objective:

$${{{{\mathcal{L}}}}}_{constr}:={{{\mathcal{L}}}}-{\sum}_{z=1}^{Z}{\lambda }_{z}^{(\beta )}\left({\sum}_{k=1}^{K}{\beta }_{kz}-1\right)-{\sum}_{i=1}^{N}{\sum}_{k=1}^{K}{\lambda }_{ik}^{(u)}{u}_{ik}.$$

(20)

We proceed as in the EM algorithm⁴⁶, by alternating two optimization steps until convergence. In one step, we maximize Eq. (20) with respect to the model parameters u, w, β and the Lagrange multipliers λ^(β), λ^(u). In the other, we utilize the closed-form updates in Eqs. (14)–(16) for the variational parameters. The procedure is described in detail in Algorithm 1.

Differentiating objective Eq. (20) with respect to the w, β parameters and the multipliers λ^(β) yields the following closed-form updates:

$${w}_{kq}=\frac{{\sum }_{e\in E}{A}_{e}{\sum }_{i < j\in e}\, {\rho }_{ijkq}^{(e)}}{C{\sum }_{i < j\in V}{u}_{ik}{u}_{jq}},$$

(21)

$${\beta }_{kz}=\frac{{\sum }_{i}({x}_{iz}{h}_{izk}+(1-{x}_{iz}){h}_{izk}^{{\prime} })}{{\sum }_{i,{k}^{{\prime} }}({x}_{iz}{h}_{iz{k}^{{\prime} }}+(1-{x}_{iz}){h}_{iz{k}^{{\prime} }}^{{\prime} })}.$$

(22)

Equation (21) is valid when γ ≠ 1 and Eq. (22) is valid when γ ≠ 0.

To obtain the updates for u we distinguish two cases. In the case of γ ≠ 0, differentiating Eq. (20) with respect to u_ik yields the condition:

$${a}_{ik}\,{u}_{ik}^{2}-({a}_{ik}+{b}_{ik}+{c}_{ik})\,{u}_{ik}+{b}_{ik}=0,$$

(23)

where

$${a}_{ik}= (1-\gamma )\,C\sum\limits_{j\in V,j\ne i}{\sum }_{q=1}^{K}{u}_{jq}{w}_{kq},\\ {b}_{ik}= (1-\gamma )\sum\limits_{e\in E:i\in e}{A}_{e}\sum\limits_{j\ne i\in e}{\sum }_{q=1}^{K}{\rho }_{ijkq}^{(e)}+\gamma {\sum }_{z=1}^{Z}{x}_{iz}{h}_{izk},\\ {c}_{ik} = \gamma {\sum }_{z=1}^{Z}(1-{x}_{iz}){h}_{izk}^{{\prime} }.$$

The updated values for u_ik are found by numerically solving Eq. (23). We take the smallest root of Eq. (23), as this is guaranteed to be in (0, 1), as we show in Supplementary Note B. This update automatically yields a value of u_ik in [0, 1], therefore the constraints on u are inactive and we do not need to differentiate with respect to the Lagrange multipliers ${\lambda }_{ik}^{(u)}$.

In the case γ = 0, we differentiate Eq. (20) with respect to both u_ik and the Lagrangian multipliers ${\lambda }_{ik}^{(u)}$ to obtain the update

$${u}_{ik}=\frac{{\sum }_{e\in E:i\in e}{A}_{e}{\sum }_{j\ne i\in e}\mathop{\sum }_{q=1}^{K}{\rho }_{ijkq}^{(e)}}{C{\sum }_{j\in V,j\ne i}\mathop{\sum }_{q=1}^{K}{u}_{jq}{w}_{kq}+{\lambda }_{ik}^{(u)}},$$

(24)

which is exactly the same as those of the Hy-MMSBM model¹⁹, except that in our case we have ${\lambda }_{ik}^{(u)}$ which constrains u_ik ∈ [0, 1]. Thus, our model is as powerful as Hy-MMSBM when γ = 0, but, when the attributes correlate well with the communities, our model can utilize this information to boost performance. In practice, in the latter case, cross-validation would yield γ > 0.

The EM algorithms finds a local maximum for a given starting point, which is not guaranteed to be the global maximum. Therefore, the algorithm is run several times and the best parameters are chosen based on the run that gives the highest log-likelihood.

A pseudocode for the algorithmic implementation is given in Algorithm 1.

Algorithm 1

HyCoSBM: EM algorithm

Inputs: hypergraph A, covariates X, hy perparameters γ and K

Outputs: inferred (u, w, β)

u, w, β ← init(u, w, β) : Randomly initialize the parameters

while convergence not reached do

$\rho,\,h,\,{h}^{{\prime} }\leftarrow update(\rho,\,h,\,{h}^{{\prime} })$ ⊳ Eqs. (14)–(16)

u ← update(u) ⊳Eq. (23) or Eq. (24)

if γ ≠ 1 then

w ← update(w) ⊳Eq. (21)

end if

if γ ≠ 0 then

β ← update(β) ⊳ Eq. (22)

end if

end while

Hyperedge prediction and cross-validation

For all experiments with real datasets we used 5-fold cross-validation with the test AUC as performance metric to select the hyperparameters K and γ. We varied $K\in \left\{2,\ldots,30\right\}$ and γ ∈ [0.0, 1.0]. The set of hyperedges was split into 80% and 20% for training and testing. The AUC is calculated by comparing the Poisson probabilities assigned to a given existing hyperedge against that of a randomly generated hyperedge of the same size. Since comparing all possible pairs of observed-unobserved edges is unfeasible, we estimate the AUC via sampling. For every observed edge in the dataset, we draw an edge of the same size uniformly at random, and compute the relative Poisson probabilities. The resulting Poisson probabilities are saved in a vector R₁ for the observed edges and R₀ for the randomly generated ones. We then compute the AUC as

$$AUC=\frac{\sum ({R}_{1} \, > \, {R}_{0})+0.5\sum ({R}_{1}=={R}_{0})}{| {R}_{1}| },$$

where ∑(R₁ > R₀) stands for the number of times the Poisson probability of the positive hyperedge was higher than the negative one, ∑(R₁ = = R₀) when they were equal, and the total number ∣R₁∣ of comparisons made is equal to the number of hyperedges in the test set.

Data availability

The data that support the findings of this study are publicly available. The contact datasets at http://www.sociopatterns.org/; the political interactions datasets at https://www.cs.cornell.edu/~arb/data/; the gene-disease dataset at⁴⁰; the Enron dataset at⁴¹; the New York City taxi data at⁴².

Code availability

The open source codes and executables are available at github.com/badalyananna/HyCoSBM and at⁴⁷. The code uses the HGX Python library³³.

References

Klamt, S., Haus, U.-U. & Theis, F. Hypergraphs and cellular networks. PLoS Comput. Biol. 5, e1000385 (2009).
Article MathSciNet PubMed PubMed Central Google Scholar
Zimmer, A., Katzir, I., Dekel, E., Mayo, A. E. & Alon, U. Prediction of multidimensional drug dose responses based on measurements of drug pairs. Proc. Natl Acad. Sci. 113, 10442–10447 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Mayfield, M. M. & Stouffer, D. B. Higher-order interactions capture unexplained complexity in diverse communities. Nat. Ecol. Evol. 1, 0062 (2017).
Article Google Scholar
Giusti, C., Ghrist, R. & Bassett, D. S. Two’s company, three (or more) is a simplex: algebraic-topological tools for understanding higher-order structure in neural data. J. Comput. Neurosci. 41, 1–14 (2016).
Article MathSciNet PubMed PubMed Central Google Scholar
Battiston, F. et al. Networks beyond pairwise interactions: structure and dynamics. Phys. Rep. 874, 1–92 (2020).
Article ADS MathSciNet Google Scholar
Battiston, F. et al. The physics of higher-order interactions in complex systems. Nat. Phys. 17, 1093–1098 (2021).
Article CAS Google Scholar
Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
Article ADS MathSciNet Google Scholar
Mastrandrea, R., Fournet, J. & Barrat, A. Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PloS ONE 10, e0136497 (2015).
Article PubMed PubMed Central Google Scholar
Contisciani, M., Power, E. A. & De Bacco, C. Community detection with node attributes in multilayer networks. Sci. Rep. 10, 15736 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, J., McAuley, J., Leskovec, J. 2013 IEEE 13th International Conference on data mining (IEEE, 2013).
Fajardo-Fontiveros, O., Guimerà, R. & Sales-Pardo, M. Node metadata can produce predictability crossovers in network inference problems. Phys. Rev. X 12, 011010 (2022).
CAS Google Scholar
Newman, M. E. & Clauset, A. Structure and inference in annotated networks. Nat. Commun. 7, 11863 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Tallberg, C. A bayesian approach to modeling stochastic blockstructures with covariates. J. Math. Sociol. 29, 1–23 (2004).
Article Google Scholar
Eriksson, A., Edler, D., Rojas, A., de Domenico, M. & Rosvall, M. How choosing random-walk model and network representation matters for flow-based community detection in hypergraphs. Commun. Phys. 4, 1–12 (2021).
Article Google Scholar
Carletti, T., Fanelli, D. & Lambiotte, R. Random walks and community detection in hypergraphs. J. Phys. Complex. 2, 015011 (2021).
Article ADS Google Scholar
Vazquez, A. Finding hypergraph communities: a bayesian approach and variational solution. J. Stat. Mech. Theory Exp. 2009, P07006 (2009).
Article Google Scholar
Zhou, D., Huang, J. & Schölkopf, B. Learning with hypergraphs: clustering, classification, and embedding. Adv. Neural Inform. Process. Syst. 19, 1601–1608 (2006).
Contisciani, M., Battiston, F. & De Bacco, C. Inference of hyperedges and overlapping communities in hypergraphs. Nat. Commun. 13, 7229 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Ruggeri, N., Contisciani, M., Battiston, F. & De Bacco, C. Community detection in large hypergraphs. Sci. Adv. 9, eadg9159 (2023).
Article PubMed PubMed Central Google Scholar
Chodrow, P. S., Veldt, N. & Benson, A. R. Generative hypergraph clustering: from blockmodels to modularity. Sci. Adv. 7, eabh1303 (2021).
Article ADS PubMed Google Scholar
Ball, B., Karrer, B. & Newman, M. E. Efficient and principled method for detecting communities in networks. Phys. Rev. E 84, 036103 (2011).
Article ADS Google Scholar
De Bacco, C., Power, E. A., Larremore, D. B. & Moore, C. Community detection, link prediction, and layer interdependence in multilayer networks. Phys. Rev. E 95, 042317 (2017).
Article ADS PubMed Google Scholar
Goldenberg, A., Zheng, A. X., Fienberg, S. E. & Airoldi, E. M. A survey of statistical network models. Found. Trends Mach. Learn. 2, 129–233 (2010).
Article Google Scholar
Li, Y., Yang, R. & Shi, J. Efficient and effective attributed hypergraph clustering via k-nearest neighbor augmentation. Proc. ACM Manag. Data 1, 1–23 (2023).
Google Scholar
Fanseu Kamhoua, B. et al. Proceedings of the 30th ACM International Conference on Information & Knowledge Management 453–463 (Association for Computing Machinery, New York, NY, USA, 2021).
Du, R., Drake, B. & Park, H. Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization. J. Glob. Optim. 74, 861–877 (2019).
Article MathSciNet Google Scholar
Safdari, H., Contisciani, M. & De Bacco, C. Generative model for reciprocity and community detection in networks. Phys. Rev. Res. 3, 023209 (2021).
Article CAS Google Scholar
Contisciani, M., Safdari, H. & De Bacco, C. Community detection and reciprocity in networks by jointly modelling pairs of edges. J. Complex Netw. 10, cnac034 (2022).
Article MathSciNet Google Scholar
Safdari, H., Contisciani, M. & De Bacco, C. Reciprocity, community detection, and link prediction in dynamic networks. J. Phys. Complex. 3, 015010 (2022).
Article ADS Google Scholar
Hofmann, T. Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’99, 50–57 (Association for Computing Machinery, New York, NY, USA, 1999).
Zhu, Y., Yan, X., Getoor, L. & Moore, C. Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining 1025–1033 (Association for Computing Machinery, New York, NY, USA, 2013).
Ruggeri, N., Battiston, F. & De Bacco, C. Framework to generate hypergraphs with community structure. Phys. Rev. E 109, 034309 (2024).
Article ADS MathSciNet CAS PubMed Google Scholar
Lotito, Q. F. et al. Hypergraphx: a library for higher-order network analysis. J. Complex Netw. 11, cnad019 (2023).
Article MathSciNet Google Scholar
Gemmetto, V., Barrat, A. & Cattuto, C. Mitigation of infectious disease at school: targeted class closure vs school closure. BMC Infect. Dis. 14, 1–10 (2014).
Article Google Scholar
Stehlé, J. et al. High-resolution measurements of face-to-face contact patterns in a primary school. PloS ONE 6, e23176 (2011).
Article ADS PubMed PubMed Central Google Scholar
Génois, M. et al. Data on face-to-face contacts in an office building suggest a low-cost vaccination strategy based on community linkers. Netw. Sci. 3, 326–347 (2015).
Article Google Scholar
Vanhems, P. et al. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PloS ONE8, e73970 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Good, B. H., De Montjoye, Y.-A. & Clauset, A. Performance of modularity maximization in practical contexts. Phys. Rev. E 81, 046106 (2010).
Article ADS MathSciNet Google Scholar
Peel, L., Larremore, D. B. & Clauset, A. The ground truth about metadata and community detection in networks. Sci. Adv. 3, e1602548 (2017).
Article ADS PubMed PubMed Central Google Scholar
Piñero, J. et al. The disgenet knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 48, D845–D855 (2020).
ADS PubMed Google Scholar
Klimt, B. & Yang, Y. European conference on machine learning (Springer, 2004).
City of New York, TLC trip record data, Online; accessed 26 January 2024, https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page (2024).
Yang, D., Zhang, D., Zheng, V. W. & Yu, Z. Modeling user activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE Trans. Syst. Man Cybern. Syst. 45, 129–142 (2015).
Article Google Scholar
Chen, Z. et al. Contrasting social and non-social sources of predictability in human mobility. Nat. Commun. 13, 1922 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Contisciani, M., Hobbhahn, M., Power, E. A., Hennig, P. & Bacco, C. D. Flexible inference in heterogeneous and attributed multilayer networks (2024).
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977).
Article MathSciNet Google Scholar
Badalyan, A. badalyananna/hycosbm: v1.0-alpha, https://doi.org/10.5281/zenodo.12818903 (2024).
Génois, M. & Barrat, A. Can co-location be used as a proxy for face-to-face contacts? EPJ Data Sci. 7, 1–18 (2018).
Article Google Scholar
Fowler, J. H. Connecting the congress: a study of cosponsorship networks. Political Anal. 14, 456–487 (2006).
Article Google Scholar
Fowler, J. H. Legislative cosponsorship networks in the us house and senate. Soc. Netw. 28, 454–465 (2006).
Article Google Scholar
Stewart III, C. & Woon, J. Congressional committee assignments, 103rd to 114th congresses, 1993–2017: House, Tech. Rep. MIT mimeo, https://web.mit.edu/17.251/www/data_page.html (2008).

Download references

Acknowledgements

N.R. acknowledges support from the Max Planck ETH Center for Learning Systems. C.D.B. was supported by the Cyber Valley Research Fund.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Max Planck Institute for Intelligent Systems, Cyber Valley, Tübingen, Germany
Anna Badalyan, Nicolò Ruggeri & Caterina De Bacco
Department of Computer Science, ETH, Zürich, Switzerland
Nicolò Ruggeri

Authors

Anna Badalyan
View author publications
You can also search for this author in PubMed Google Scholar
Nicolò Ruggeri
View author publications
You can also search for this author in PubMed Google Scholar
Caterina De Bacco
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.B. developed the algorithm and performed the experiments. A.B., N.R., and C.D.B. all conceived the research, analyzed the results and wrote the manuscript.

Corresponding authors

Correspondence to Nicolò Ruggeri or Caterina De Bacco.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Badalyan, A., Ruggeri, N. & De Bacco, C. Structure and inference in hypergraphs with node attributes. Nat Commun 15, 7073 (2024). https://doi.org/10.1038/s41467-024-51388-5

Download citation

Received: 07 November 2023
Accepted: 06 August 2024
Published: 16 August 2024
DOI: https://doi.org/10.1038/s41467-024-51388-5
Springer Nature Limited

Associated content

Applied physics and mathematics

Focus 26 January 2021

Structure and inference in hypergraphs with node attributes

From

Abstract

Similar content being viewed by others

Inference of hyperedges and overlapping communities in hypergraphs

Annotated hypergraphs: models and applications

How choosing random-walk model and network representation matters for flow-based community detection in hypergraphs

Explore related subjects

Introduction

Results

The model

Modeling structural information

Modeling attribute information

Inference of latent variables

Detecting communities in synthetic networks

Results on empirical data

Recovering interactions on contact dataset

Performance with uninformative attributes

Improving prediction of Gene-Disease associations

Recovering core-periphery structure with Enron Email dataset

Predicting co-destination patterns in New York City taxi rides

Discussion

Methods

Inference of the latent variables

Variational lower bound

Expectation-maximization

Algorithm 1

Hyperedge prediction and cross-validation

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation