toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

full_length: true

Community Detection (Louvain)

Community detection in graphs means grouping vertices together, so that those vertices that are closer (have more connections) to each other are placed in the same cluster. A commonly used algorithm for community detection is Louvain (https://arxiv.org/pdf/0803.0476.pdf).

Summary of Results

The Gunrock implementation uses sort and segmented reduce to implement the Louvain algorithm, different from the commonly used hash table mapping. The GPU implementation is about ~1.5X faster than the OpenMP implementation, and also faster than previous GPU works. It is still unknown whether the sort and segmented reduce formulation map the problem better than hash table on the GPU. The modularities resulting from the GPU implementation are within small differences as the serial implementation, and are better when the graph is larger. A custom hash table can potentially improve the running time. The GPU Louvain implementation should have moderate scalability across multiple GPUs in an DGX-1.

Summary of Gunrock Implementation

The commonly used approach to implement the Louvain algorithm uses a hash table. However, the memory access pattern that results from a hash table is almost totally random, and not GPU-friendly (more in the Alternative approaches section). Instead of using a hash table to accumulate the values associated with the same key, the Gunrock implementation on GPU tries another method: sort all key-value pairs, and use segmented reduce to accumulate the values in the continuous segments. Because Louvain always visits all edges in the graph, there is no need to use Gunrock frontiers, and the advance operator with the ALL_EDGES advance mode or a simple ForAll loop should be sufficient. The pseudocode is listed below:

m2 <- sum(edge_weights);
//Outer-loop
Do
    // Pass-initialization, assign each vertex to its own community
    For each vertex v in graph:
        current_communities[v] <- v;
        weights_v2any      [v] <- 0;
        weights_v2self     [v] <- 0;
    For each edge e<v, u> in graph: //an advance operator on all edges
        weights_v2any[v] += edge_weights[e];
        if (v == u)
            weights_v2self[v] += edge_weights[e];
    For each vertex v in graph:
        weights_community2any[v] <- weights_v2any[v];
    pass_gain <- 0;

    // Modularity optimizing iterations
    Do
        // Get weights between vertex and community
        For each edge e<v, u> in graph:
            edge_pairs  [e] <- <v, current_communities[u]>;
            pair_weights[e] <- edge_weights[e];
        Sort(edge_pairs, pair_weights)
            by edge_pair.first, then edge_pair.second if tie;
        segment_offsets <- offsets of continuous edge_pairs;
        segment_weights <- SegmentedReduce(pair_weights, segment_offsets, sum);

        // Compute base modularity gains
        // if moving vertices out of current communities
        For each vertex v in graph:
            comm <- current_communities[v];
            w_v2comm <- Find weights from v to comm in segment_weights;
            gain_bases[v] = weights_v2self[v] - w_v2comm
                - (weights_v2any[v] - weights_community2any[comm])
                    * weights_v2any[v] / m2;

        // Find the max gains if moving vertices into adjacent communities
        For each vertex v in graph:
            gains[v] <- 0;
            next_communities[v] <- current_communities[v];
        For each seg<v, comm> segment:
            if (comm == current_communities[v])
                continue;
            gain <- gain_bases[v] + segment_weights[seg]
                - weights_community2any[comm] * weights_v2any[v] / m2;
            atomicMax(max_gains + v, gain);
            seg_gains[seg] <- gain;
        For each seg<v, comm> segment:
            if (seg_gains[seg] != max_gains[v])
                continue;
            next_communities[v] <- comm;

        // Update communities
        For each vertex v in graph:
            curr_comm <- current_communities[v];
            next_comm <- next_communities[v];
            if (curr_comm == next_comm)
                continue;

            atomicAdd(weights_community2any[next_comm],  weights_v2any[v]);
            atomicAdd(weights_community2any[curr_comm], -weights_v2any[v]);
            current_communities[v] <- next_comm;

        iteration_gain <- Reduce(max_gains, sum);
        pass_gain += iteration_gain;
    While iterations stop condition not met
    // End of modularity optimizing iterations

    // Contract the graph
    // renumber occupied communities
    new_communities <- Renumber(current_communities);
    For each edge e<v, u> in graph: //an advance operator on all edges
        edge_pairs[e] <- <new_communities[current_communities[v]],
                          new_communities[current_communities[u]]>
    Sort(edge_pairs, edge_weights) by pair.x, then pair.y if tie;
    segment_offsets <- offsets of continuous edge_pairs;
    new_graph.Allocate(|new_communities|, |segments|);
    new_graph.edges <- first pair of each segments;
    new_graph.edge_values <- SegmentedReduce(edge_weights, segment_offsets, sum);

While pass stop condition not met

How To Run This Application on DARPA's DGX-1

Prereqs/input

CUDA should have been installed; $PATH and $LD_LIBRARY_PATH should have been set correctly to use CUDA. The current Gunrock configuration assumes boost (1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in the Makefiles. DARPA's DGX-1 has both installed when the tests are performed.

git clone --recursive https://github.com/gunrock/gunrock/
cd gunrock
git checkout dev-refactor
git submodule init
git submodule update
mkdir build
cd build
cmake ..
cd ../tests/louvain
make

At this point, there should be an executable louvain_main_<CUDA version>_x86_64 in tests/louvain/bin.

The datasets are assumed to have been placed in /raid/data/hive, and converted to proper matrix market format (.mtx). At the time of testing, ca, amazon, akamai, and pokec are available in that directory. ca and amazon are taken from PNNL's implementation, and originally use 0-based vertex indices; 1 is added to each vertex id to make them proper .mtx files.

The testing is done with Gunrock using the dev-refactor branch at commit 2699252 (Oct. 18, 2018), using CUDA 9.1 with NVIDIA driver 390.30.

Running the application

./bin/louvain_main_9.1_x86_64 --omp-threads=32 --iter-stats --pass-stats \
--advance-mode=ALL_EDGES --unify-segments=true --validation=each --num-runs=10 \
--graph-type=market --graph-file=/raid/data/hive/[DataSet]/[DataSet].mtx \
--jsondir=[LogDir] > [LogDir]/[DataSet].txt 2>&1

For example, when DataSet = akamai, and LogDir = eval/DGX1-P100x1, the command is

./bin/louvain_main_9.1_x86_64 --omp-threads=32 --iter-stats --pass-stats \
--advance-mode=ALL_EDGES --unify-segments=true --validation=each --num-runs=10 \
--graph-type=market --graph-file=/raid/data/hive/akamai/akamai.mtx \
--jsondir=eval/DGX1-P100x1 > eval/DGX1-P100x1/akamai.txt 2>&1

Output

The outputs are in the louvain directory. Look for the .txt files: running time is after Run x elapsed:, and the number of communities and the resulting modularity is in the line that starts with Computed:. There are 12 runs in each .txt file: 1 single thread CPU run for reference, 1 OpenMP multiple-thread (32 threads in the example, may not be optimal) run, and 10 GPU runs.

The output was compared against PNNL's results on the number of communities and modularity for the amazon and ca datasets. Note that PNNL's code does not count dangling vertices in communities. The results listed below use the number of communities minus dangling vertices; the dataset details are can be found in the next section.

The modularity of resulting communities:

DataSet Gunrock GPU OMP (32T) Serial PNNL (8T) PNNL (serial)
amazon 0.908073 0.925721 0.926442 0.923728 0.925557
ca 0.711971 0.730217 0.731292 0.713885 0.727127

Note for these kind of small graphs, more parallelism could hurt the resulting modularity. Multi-thread CPU implementations from both Gunrock and PNNL yield modularities a little less than the serial versions, and the GPU implementation sees a ~0.02 drop. The reason could be concurrent updates to communities: vertex A moves to community C, thinking vertex B is in C; but B may have simultaneously moved to other communities.

However, when input graphs are larger in size, this issue seems to disappear, and modularities from the GPU implementation are sometimes even better than the serial implementation. (See detailed results below.)

The number of resulting communities:

DataSet Gunrock GPU OMP (32T) Serial PNNL (8T) PNNL (serial)
amazon 7667 213 240 298 251
ca 1120 616 617 654 623

More parallelism also affects the number of resulting communities. On these two small datasets, the GPU implementation produces significantly more communities than all CPU implementations; on large datasets, the differences in the number of communities are much smaller. We believe the reason may also be concurrent community updates, especially when whole-community migration happens: all vertices in community A decide to move to community B, and all vertices in community B decide to move to community A. In the serial implementation, we may see these two communities combine into a single community, but on the GPU, we instead see that community A and B just swap their labels and never become a single community.

Performance and Analysis

We measure Louvain performance with three metrics: the number of resulting communities (#Comm), the modularity of resulting communities (Q), and the running time (Time, in seconds). Higher Q and lower running time are better. * indicates the graph is given as undirected, and the number of edges is counted after edge doubling and removing of self loops or duplicate edges; otherwise, the graph is taken as directed, and self loops or duplicate edges are also removed. If edge weights are available in the input graph, they follow the input; otherwise, the initial edge weights are set to 1.

Details of the datasets:

DataSet #V #E #dangling vertices
ca 108299 186878 85166
preferentialAttachment 100000 999970* 0
caidaRouterLevel 192244 1218132* 0
amazon 548551 1851744 213688
coAuthorsDBLP 299067 1955352* 0
webbase-1M 1000005 2105531 2453
citationCiteseer 268495 2313294* 0
cnr-2000 325557 3128710 0
as-Skitter 1696415 22190596* 0
coPapersDBLP 540486 30481458* 0
pokec 1632803 30622564 0
coPapersCiteseer 434102 32073440* 0
akamai 16956250 53300364 0
soc-LiveJournal1 4847571 68475391 962
channel-500x100x100-b050 4802000 85362744 0
europe_osm 50912018 108109320* 0
hollywood-2009 11399905 112751422* 32662
rgg_n_2_24_s0 16777216 265114400* 1
uk-2002 18520486 292243663 37300

Running time in seconds:

GPU Dataset Gunrock GPU Speedup vs. OMP OMP Serial
P100 ca 0.108 0.24 0.026 0.065
V100 ca 0.089 0.33 0.029 0.067
V100 preferentialAttachment 0.076 1.26 0.096 0.235
V100 caidaRouterLevel 0.063 1.03 0.065 0.229
P100 amazon 0.160 1.27 0.203 0.648
V100 amazon 0.122 1.62 0.198 0.631
V100 coAuthorsDBLP 0.082 1.62 0.133 0.414
V100 webbase-1M 0.107 1.57 0.168 0.318
V100 citationCiteseer 0.074 1.50 0.111 0.432
V100 cnr-2000 0.235 0.57 0.133 0.388
V100 as-Skitter 0.376 1.76 0.660 2.480
V100 coPapersDBLP 0.358 1.22 0.437 1.860
P100 pokec 0.929 1.34 1.244 6.521
V100 pokec 0.624 1.74 1.083 6.110
V100 coPapersCiteseer 0.353 1.11 0.391 1.592
P100 akamai 1.278 5.13 6.560 14.427
V100 akamai 0.934 6.79 6.343 13.266
V100 soc-LiveJournal1 1.548 2.59 4.016 16.311
V100 channel-500x100x100-b050 1.133 0.68 0.768 4.449
V100 europe_osm 4.902 7.11 34.875 101.320
V100 hollywood-2009 1.230 1.40 1.721 9.419
V100 rgg_n_2_24_s0 3.378 0.79 2.664 17.816
V100 uk-2002 4.921 1.15 5.682 31.006

Resulting modularity:

GPU DataSet Gunrock GPU Gunrock - Serial OMP OMP - Serial Serial
P100 ca 0.7112 -0.0193 0.7302 -0.0011 0.7313
V100 ca 0.7166 -0.0147 0.7290 -0.0025 0.7313
V100 preferentialAttachment 0.1758 -0.1095 0.2287 -0.0565 0.2852
V100 caidaRouterLevel 0.8500 +0.0065 0.8362 -0.0073 0.8436
P100 amazon 0.9081 -0.0184 0.9257 -0.0007 0.9264
V100 amazon 0.9089 -0.0175 0.9258 -0.0006 0.9264
V100 coAuthorsDBLP 0.8092 -0.0179 0.8136 -0.0135 0.8271
V100 webbase-1M 0.8945 -0.0613 0.9471 -0.0087 0.9558
V100 citationCiteseer 0.7888 -0.0137 0.7605 -0.0420 0.8025
V100 cnr-2000 0.8766 -0.0031 0.8784 -0.0013 0.8797
V100 as-Skitter 0.8366 +0.0234 0.8223 +0.0091 0.8132
V100 coPapersDBLP 0.8494 +0.0003 0.8440 -0.0051 0.8492
P100 pokec 0.6933 -0.0012 0.6914 -0.0032 0.6945
V100 pokec 0.6741 -0.0204 0.6763 -0.0183 0.6945
V100 coPapersCiteseer 0.9075 -0.0035 0.9059 -0.0051 0.9110
P100 akamai 0.9334 +0.0329 0.9072 +0.0067 0.9005
V100 akamai 0.9333 +0.0328 0.9074 +0.0070 0.9005
V100 soc-LiveJournal1 0.7336 +0.0097 0.5456 -0.1782 0.7239
V100 channel-500x100x100-b050 0.9004 +0.0498 0.9512 +0.1007 0.8505
V100 europe_osm 0.9979 +0.0142 0.9844 +0.0008 0.9836
V100 hollywood-2009 0.7432 -0.0079 0.7502 -0.0010 0.7511
V100 rgg_n_2_24_s0 0.9921 +0.0026 0.9920 +0.0024 0.9896
V100 uk-2002 0.9507 -0.0098 0.9604 +0.0000 0.9604

The number of resulting communities:

GPU DataSet Gunrock GPU OMP Serial
P100 ca 1120 616 617
V100 ca 1076 615 617
V100 preferentialAttachment 18 14 39
V100 caidaRouterLevel 410 467 745
P100 amazon 7667 213 240
V100 amazon 7671 233 240
V100 coAuthorsDBLP 95 138 273
V100 webbase-1M 4430 1469 1362
V100 citationCiteseer 67 48 141
V100 cnr-2000 65621 59219 59253
V100 as-Skitter 924 1945 2531
V100 coPapersDBLP 70 111 237
P100 pokec 154988 161709 166156
V100 pokec 155100 162464 166156
V100 coPapersCiteseer 108 110 358
P100 akamai 90285 130639 145785
V100 akamai 90245 127843 145785
V100 soc-LiveJournal1 506826 434272 447426
V100 channel-500x100x100-b050 24 54 12
V100 europe_osm 17320 784171 828662
V100 hollywood-2009 12218 12593 12741
V100 rgg_n_2_24_s0 344 359 311
V100 uk-2002 2402560 2245355 2245678

Implementation limitations

Comparison against existing implementations

We compare against serial CPU and OpenMP implementations; both are Gunrock's CPU reference routines. PNNL's results and previous published works are also referenced, to make sure the resulting modularities and running times are sane.

Published results (timing and modularity) from previous work are summarized in the louvain_results.xlsx file in the louvain directory.

Performance limitations

The GPU performance bottleneck is the sort function, especially in the first pass. It's true that sort on GPU is much faster than CPU; but the CPU implementations use hash tables, which may not be suitable for the GPU; the alternative approaches section has more details on this.

Using the akamai dataset to profile the GPU Louvain implementation on a V100, an iteration in the first pass takes 64.08 ms, and the sort takes 42.05 ms, which is two-thirds of the iteration time. The Louvain implementation uses CUB's radix-sort-pair function. CUB is considered to be one of the GPU primitive libraries that provide the best performance. Further profiling shows during the sort kernel, the device memory controller is ~75% utilized; in other words, the sort is memory bound. This is as expected, as in each iteration of radix sort, the whole edge_pairs and pair_weights arrays are shuffled, with cost on the order of $O(|E|)$, although the memory accesses are mostly coalesced. The memory system utilizations are as below (from NVIDIA profiler):

Type Transactions Bandwidth Utilization
Shared Loads 2141139 1168.859 GB/s
Shared Stores 2486946 1357.636 GB/s
Shared Total 4628085 2526.495 GB/s Low
L2 Reads 2533302 345.736 GB/s
L2 Writes 2831732 386.464 GB/s
L2 Total 5365034 732.200 GB/s Low
Local Loads 588 80.248 MB/s
Local Stores 588 80.248 MB/s
Global Loads 2541637 346.873 GB/s
Global Stores 2675820 365.186 GB/s
Texture Reads 2515533 1373.242 GB/s
Unified Cache Total 7734166 2085.462 GB/s Idle to Low
Device Reads 2469290 336.999 GB/s
Device Writes 2596403 354.347 GB/s
Device Memory Total 5065693 691.347 GB/s High
PCIe Reads 0 0 B/s None
PCIe Writes 5 682.381 kB/s Idle to Low

Louvain_akamai

It's clear that the bottleneck is at the device memory: most data are read-once and write-once, with little possibility to reuse the data. The kernel achieved 691 GB/s bandwidth utilization, ~77% of the 32GB HBM2's 900 GB/s capability. This high-bandwidth utilization fits the memory access pattern: mostly regular and coalesced.

The particular issue here is not the kernel implementation itself, it's the usage of sorting: fully sorting the whole edge list is perhaps overkill. One possible improvement is to use a custom hash table on the GPU, to replace the sort + segmented reduce part. The hash table could also cut the memory requirement by about 50%.

Next Steps

Alternate approaches

Things we tried that didn't work well

Segmented sort and CUB segmented reduce: the original idea for modularity optimization is to use CUB's segmented sort and segmented reduce within each vertex's neighborhood to get the vertex-community weights. However, CUB uses one block per each segment in these two routines, and that creates huge load imbalance, as the neighbor list length can have large differences. The solution is 1) use vertex-community pairs as keys, and CUB's unsegmented sort; 2) implement a load-balanced segmented-reduce kernel. This solution reduces the sort-reduce time by a few X, and can be enabled by the --unify-segments flag on the command line.

STL hash table on CPU: the std::map class has performance issues when used in a multi-threading environment, and yields poor scalability; it might be caused by using locks in the STL. The solution is to use a size $|V|$ array per thread for the vertices that thread processes. Since within a thread, vertices are processed one by one, and the maximum number of communities is $|V|$, so instead of a hash table, a flat size $|V|$ array can replace the hash table. This solution significantly reduces the running time, even using a single thread. The multi-thread scalability is also much better.

Vanilla hash table on GPU: this can't be used, as keys (vertex-community or community-community pair) and the values (edge weights) are both 64-bit, and there is currently no support of atomic-compare-and-exchange operations for 128-bit data. Even if that's available, a vanilla table only provides insert and query operations, but not accumulation of values for the same key.

Custom hash table

Hash tables can be used to accumulate the vertex-to-community weights during modularity optimization iterations, and to get the community-to-community edge weights during the graph contraction part of the algorithm. Previous implementations use hash tables as a common practice. However, as mentioned above, vanilla hash tables that store key-value pairs are not a good choice for Louvain.

Louvain not only needs to store the key-value pairs, but also to accumulate values associated with the same key. That key may be the same vertex-community pair for modularity optimization, or the same community-community pair for graph contraction. Ideally, if the hash table provides the following functionality, it would be much more suitable for Louvain:

This kind of hash table removes the strong restriction that the key-value pair needs to be able to support an atomic compare-and-exchange operation, which is imposed by vanilla hash table implementations. The custom hash table is also more value-accumulation-friendly. It replaces the sort-segmented reduce part, and can reduce the workload from about $O(6|E|)$ to $O(2|E|)$, and the memory requirement from $48|E|$ bytes to $24|E|$. However, the memory access pattern now becomes irregular and uncoalesced, so it is still unknown whether it can actually yield a performance gain.

Iteration and pass stop conditions

The concurrent nature of the GPU implementation makes modularity gain calculation inaccurate. Community migrations are also observed. Because of these, the optimization iterations and the passes may run more than needed, and resulted in longer running times, especially in the first pass. Preliminary experiments to cap the number of iterations for the first pass can reduce the running time by about 40%, with the modularity value roughly unchanged. The actual cap is dataset-dependent, and more investigation is needed to get a better understanding of how to set the cap.

Gunrock implications

The core of the Louvain implementation mainly uses all-edges advance, sort, segmented reduce, and for loops. The sort-segmented reduce operation is actually a segmented keyed reduce; if that's a common operation that appears in more algorithms, it could be made into a new operator. The all-edges advance is used quite often in several applications, so wrapping it with a simpler operator interface could be helpful.

Notes on multi-GPU parallelization

When parallelizing across multiple GPUs, the community assignment of local vertices and their neighbors needs to available locally, either explicitly by moving them using communication functions, or implicitly by peer memory access or unified virtual memory. The communication volume is in the order of $O(|V|)$ and the computation workload is at least in $O(|E|)$, so the scalability should be manageable.

When the number of GPUs is small, 1D partitioning can be used to divide the edges, and replicate the vertices, so there is no need to do vertex id conversion across multiple GPUs. When the number of GPUs is large, the high-low degree vertex separation and partitioning scheme can be used: edges are still distributed across GPUs, high degree vertices are duplicated, and low degree vertices are owned by one GPU each. The boundary to use different partitioning scheme is still unclear, but it's likely that 8 GPUs within a DGX-1 can still be considered as a small number, and use the simple 1D partitioning.

Notes on dynamic graphs

Louvain is not directly related to dynamic graphs. But it should be able to run on a dynamic graph, provided the way to access all the edges is the same. Community assignment from the previous graph can be used as a good starting point, if the vertex ids are consistent and the graph is not dramatically changed.

Notes on larger datasets

The bottleneck of memory usage of the current implementation is in the edge pair-weight sort function: that makes up about half of the memory usage. Replacing sort-segmented reduce with a custom hash table could significantly lower the memory usage.

Louvain needs to go over the whole graph once in each modularity optimization iteration. If the graph is larger than the combined GPU memory space, which forces streaming of the graph in each modularity optimization iteration, the performance bottleneck will be CPU-GPU bandwidth. That can be an order of magnitude slower than when the graph can fully fit in GPU memory. Considering the OpenMP implementation is not that slow, using that may well be faster than moving the graph multiple times across the CPU-GPU bridge.

Notes on other pieces of this workload

All parts of Louvain are graph related, and fully implemented in Gunrock.

Research potential

Community detection on graphs is generally an interesting research topic, and others have targeted good GPU implementations for it. Our work is the first time Louvain is mapped to sort and segmented reduce, so more comparisons are needed to know whether it's better than the hash table mapping. The custom hash table implementation is worth trying out, and comparing with the current sort and segmented reduce implementation. Multi-GPU implementations, particularly the graph contraction part, can be a place for research investigation: it may be better to contract the graph onto a single GPU, or even the CPU, when the graph is small; but where is the threshold and how to do the multi-GPU to single-GPU or CPU switch?

Label propagation is another algorithm for community detection. It has a similar structure to Louvain, but has a simpler way to decide how to move the vertices. Comparing them side by side for both result quality (the modularity value) and computation speed (the running time) will be beneficial.

References

[1] Hao Lu, Mahantesh Halappanavar, Ananth Kalyanaraman. "Parallel Heuristics for Scalable Community Detection", https://arxiv.org/abs/1410.1237 (2015).

[2] Cheong C.Y., Huynh H.P., Lo D., Goh R.S.M. "Hierarchical Parallel Algorithm for Modularity-Based Community Detection Using GPUs". Euro-Par 2013.

[3] Md. Naim, Fredrik Manne, Mahantesh Halappanavar, and Antonio Tumeo. "Highly Scalable Community Detection Using a GPU", https://www.eecs.wsu.edu/~assefaw/CSC16/abstracts/naim-CSC16_paper_14.pdf (2016).

[4] M. Naim, F. Manne, M. Halappanavar and A. Tumeo, "Community Detection on the GPU," IPDPS `17.

[5] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, "Fast unfolding of communities in large networks". https://arxiv.org/abs/0803.0476 (2008).