toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.
Scan statistics, as described in Priebe et al., is the generic method that computes a statistic for the neighborhood of each node in the graph, and looks for anomalies in those statistics. In this workflow, we implement a specific version of scan statistics where we compute the number of edges in the subgraph induced by the one-hop neighborhood of each node $u$ in the graph. It turns out that this statistic is equal to the number of triangles that node $u$ participates in plus the degree of $u$. Thus, we are able to implement scan statistics by making relatively minor modifications to our existing Gunrock triangle counting (TC) application.
Summary of Results
Scan statistics applied to static graphs fits perfectly into the Gunrock framework. Using a combination of
ForAll and Intersection operations, we outperform the parallel OpenMP CPU reference by up to 45.4 times speedup on the small Enron graph (provided as part of the HIVE workflows) and up to a 580 times speedup on larger graphs that feature enough computation to saturate the throughput of the GPU.
Algorithm: Scan Statistics
Input is an undirected graph w/o self-loops.
scan_stats = [len(graph.neighbors(u)) for u in graph.nodes] for (u, v) in graph.edges: if u < v: u_neibs = graph.neighbors(u) v_neibs = graph.neighbors(v) for shared_neib in intersect(u_neibs, v_neibs): scan_stats[shared_neib] += 1 return argmax([scan_stats(node) for node in graph.nodes])
Summary of Gunrock implementation
max_scan_stat = -1 node = -1 src_node_ids[nodes] scan_stats[nodes] ForAll (src_node_ids, scan_stats): fill scan_stats with the degree of each node. Intersection (src_node_ids, scan_stats): intersect neighboring nodes of both nodes of each edge; add 1 to scan_stats[common_node] for each common_node we get from the intersection. return [scan_stats]
How To Run This Application on DARPA's DGX-1
git clone --recursive https://github.com/gunrock/gunrock \ -b dev-refactor cd gunrock/tests/ss/ make clean && make
Running the application
Application specific parameters:
./bin/test_ss_main_10.0_x86_64 \ --graph-type=market \ --graph-file=./enron.mtx \ --undirected --num_runs=10
The output of this app is an array of uint values: the scan statistics values for each node. The output file will be in
.txt format with the aforementioned values.
We compare our GPU output with the HIVE CPU reference implementation implemented using OpenMP.
Details of the datasets:
Running time in milliseconds:
|GPU||Dataset||Gunrock GPU||Speedup vs. OMP||OMP|
Performance and Analysis
We measure the performance by runtime. The whole process runs on the GPU; we don't include data copy time and regard it as part of preprocessing. The algorithm is deterministic so does not require any accuracy comparison.
Since the implementation only needs
number of nodes's size of memory allocated on the GPU, so the largest dataset that can fit on a single GPU is limited by GPU on-die memory size / number of nodes.
Our implementation regards the graph as undirected and is limited to graphs of that type.
We currently use atomic adds to accumulate the number of triangles for each node; atomic operations are relatively slow.
The most time-consuming operation for scan statistics is the intersection operator. We believe we can improve its performance in future work.
Currently we divide the edge lists into two groups:
(1) small neighbor lists and (2) large neighbor lists. We implement two kernels (
TwoLarge) that cooperatively compute intersections. Our
TwoSmall kernel uses one thread to compute the intersection of a node pair. Our
TwoLarge kernel partitions the edge list into chunks and assigns each chunk to a separate thread block. Then each block uses the balanced path primitive to cooperatively compute intersections. However, these two kernels do not cover the case of one small neighbor list and one large neighbor list. By using a 3-kernel strategy and carefully choosing a threshold value to divide the edge list into two groups, we could potentially process intersections with the same level of workload together, gaining better load balancing and a higher GPU resource utilization.
Gunrock is well-suited for this implementation. The major challenge is the need for accumulation when doing intersection. Intersection was originally designed to count the total number of triangles. But in scan statistics, we instead need the triangle count for each node, which introduces an extra atomic add when doing the accumulation if multiple edges share the same node, since we count triangles in parallel for each edge.
Notes on multi-GPU parallelization
Non-ideal graph partitioning could be a bottleneck on a multi-GPU parallelization. Because we need the info of each node's two-hop neighbors, an unbalanced workload will decrease the performance and increase the communication bottleneck.
Notes on dynamic graphs
Scan statistics as a workload certainly has a dynamic-graph component: in its general form, it inputs a time-series graph and tries to identify any abnormal behavior in any statistics change (as mentioned in the referenced papers). To support this general workload, we would need Gunrock to support dynamic graphs in its advance and intersection operators. The application could be easily changed to record all history scan statistics and then to try to find any significant changes given a certain threshold.
Notes on larger datasets
For a dataset which is too large to fit into a single GPU, we can leverage the multi-GPU implementation of Gunrock to make it work on multiple GPUs on a single node. The implementation won't change a lot since Gunrock already has good support in its multi-GPU implementation. We expect the performance and memory usage to scale linearly with good graph partionling method.
For a dataset that cannot fit onto multiple GPUs on a single node, we need a distributed level of computation, which Gunrock doesn't support yet. However, we can leverage open-source libraries such as NCCL and Horovod that support this. Performance-wise, the way of partitioning the graph as well as the properties of a graph will affect the communication bottleneck. Since we need to calculate the total number of triangles each node is involved in, if we cannot fit the entire neighborhood of a node on a single node, we need other compute resources' help to compute its scan statistic. The worst-case senario is if the graph is fully connected, in which case we must to wait for the counting results from all other compute resources and then sum them up. In this case, if we can do good load-balanced scheduling, we can potentially minimize the communication bottleneck and reduce latency.
Notes on other pieces of this workload
The other important piece of this work is statistical time series analysis on dynamic changes of the graph. We hope to add dynamic graph capability in the future.
This application leverages a classic triangle-counting problem to solve a more complex statistics problem. Instead of directly using the existing solution, we solve it from a different angle: rather than counting the total number of triangles in the graph, we instead count triangles from each node's perspective.
From this app, we see the flexibility of Gunrock's intersection operation. This operation could potentially be used in other graph analytics problems such as community detection, etc.