Ph.D. Theses

Discovering Communities by Optimizing Community Quality Metrics

By Mingming Chen
Advisor: Boleslaw K. Szymanski
November 11, 2015

Many networks contain community structure which identifies groups of nodes within which connections are denser than between them. Detecting and characterizing such community structure, which is known as community detection, is one of the fundamental issues in the study of network systems. It has received a considerable attention in the last years. Numerous techniques have been developed for both efficient and effective community detection. The most popular one has been to maximize the community quality metric known as Newman's modularity over all the possible partitions of a network. This metric measures the difference between the fraction of all edges that are within the actual community and a fraction of such edges in a randomized graph with the same number of nodes and the same degree sequence. It is widely used to measure the strength of the community structure detected by the community detection algorithms.

However, modularity maximization suffers from two opposite yet concurrent problems. In some cases, it tends to split large communities into smaller communities. In other cases, it tends to form large communities by merging communities that are smaller than a certain threshold which depends on the total number of edges in the network and on the degree of inter-connectivity between the communities. The latter problem is well-known in the literature as the resolution limit problem. To solve these two problems simultaneously, we propose a new community quality metric, that we termed Modularity Density, as an alternative to modularity. First, we show modularity decreased by Split Penalty, defined as the fraction of edges that connect nodes of different communities, resolves the issue of favoring small communities. Then, we demonstrate that including community densities into modularity and split penalty eliminates the problem of favoring large communities, namely the resolution limit problem.

In addition, modularity can only be used to quantify the quality of disjoint communities. However, it is more realistic to expect that nodes in real-world networks belong to more than one community, resulting in overlapping communities. In the past few years, several overlapping extensions of modularity were proposed to measure the quality of overlapping community structure. However, all these extensions differ just in the way they define the belonging coefficient and belonging function. Yet, there is lack of systematic comparison of different extensions. To fill this gap, we overview overlapping extensions of modularity and generalize them with a uniform definition enabling application of different belonging coefficients and belonging functions to select the best. In addition, we extend localized modularity, modularity density, and eight local community quality metrics to enable their usages for overlapping communities.

We then propose a novel fine-tuned disjoint community detection algorithm that repeatedly attempts to improve the quality metrics by splitting and merging the given community structure. This new algorithm can actually be used to optimize any community quality metric. However, in this thesis, we only consider modularity and modularity density.

Although community detection is one of the fundamental techniques of network science, the community structure of networks discovered by community detection algorithms does not usually represent the reality. The primary reason for this is incompleteness and inaccuracy of current network data collection methods, which may cause datasets to appear less modular than the underlying networks really are. Thus, in this thesis we aim at recovering or improving the network community structure which may be hidden or impaired because of the missing or incorrectly identified extraneous edges. To this end, we introduce a method for improving the network structure. This method uses the scores obtained from different link prediction techniques to replace a certain fraction of low ranking existing links with the top ranked predicted links.

Return to main PhD Theses page