toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

## full_length: true

# Scan Statistics

Scan statistics, as described in Priebe et al., is the generic method that computes a statistic for the neighborhood of each node in the graph, and looks for anomalies in those statistics. In this workflow, we implement a specific version of scan statistics where we compute the number of edges in the subgraph induced by the one-hop neighborhood of each node $u$ in the graph. It turns out that this statistic is equal to the number of triangles that node $u$ participates in plus the degree of $u$. Thus, we are able to implement scan statistics by making relatively minor modifications to our existing Gunrock triangle counting (TC) application.

## Summary of Results

Scan statistics applied to static graphs fits perfectly into the Gunrock framework. Using a combination of `ForAll`

and Intersection operations, we outperform the parallel OpenMP CPU reference by up to 45.4 times speedup on the small Enron graph (provided as part of the HIVE workflows) and up to a 580 times speedup on larger graphs that feature enough computation to saturate the throughput of the GPU.

## Algorithm: Scan Statistics

Input is an undirected graph w/o self-loops.

```
scan_stats = [len(graph.neighbors(u)) for u in graph.nodes]
for (u, v) in graph.edges:
if u < v:
u_neibs = graph.neighbors(u)
v_neibs = graph.neighbors(v)
for shared_neib in intersect(u_neibs, v_neibs):
scan_stats[shared_neib] += 1
return argmax([scan_stats(node) for node in graph.nodes])
```

## Summary of Gunrock implementation

```
max_scan_stat = -1
node = -1
src_node_ids[nodes]
scan_stats[nodes]
ForAll (src_node_ids, scan_stats): fill scan_stats with the degree of each node.
Intersection (src_node_ids, scan_stats): intersect neighboring nodes of both
nodes of each edge; add 1 to scan_stats[common_node] for each common_node
we get from the intersection.
return [scan_stats]
```

## How To Run This Application on DARPA's DGX-1

### Prereqs/input

```
git clone --recursive https://github.com/gunrock/gunrock \
-b dev-refactor
cd gunrock/tests/ss/
make clean && make
```

### Running the application

Application specific parameters:

Example command-line:

```
./bin/test_ss_main_10.0_x86_64 \
--graph-type=market \
--graph-file=./enron.mtx \
--undirected --num_runs=10
```

### Output

The output of this app is an array of uint values: the scan statistics values for each node. The output file will be in `.txt`

format with the aforementioned values.

We compare our GPU output with the HIVE CPU reference implementation implemented using OpenMP.

Details of the datasets:

DataSet | $\cardinality{V}$ | $\cardinality{E}$ | #dangling vertices |
---|---|---|---|

enron | 15056 | 57074 | 0 |

ca | 108299 | 186878 | 85166 |

amazon | 548551 | 1851744 | 213688 |

coAuthorsDBLP | 299067 | 1955352* | 0 |

citationCiteseer | 268495 | 2313294* | 0 |

as-Skitter | 1696415 | 22190596* | 0 |

coPapersDBLP | 540486 | 30481458* | 0 |

pokec | 1632803 | 30622564 | 0 |

coPapersCiteseer | 434102 | 32073440* | 0 |

akamai | 16956250 | 53300364 | 0 |

soc-LiveJournal1 | 4847571 | 68475391 | 962 |

europe_osm | 50912018 | 108109320* | 0 |

hollywood-2009 | 11399905 | 112751422* | 32662 |

rgg_n_2_24_s0 | 16777216 | 265114400* | 1 |

Running time in milliseconds:

GPU | Dataset | Gunrock GPU | Speedup vs. OMP | OMP |
---|---|---|---|---|

P100 | enron | 0.461 |
45.4 | 20.95 |

P100 | ca | 0.219 |
71.6 | 15.681 |

P100 | amazon | 1.354 |
74.5 | 100.871 |

P100 | coAuthorsDBLP | 1.569 |
88.0 | 138.111 |

P100 | citationCiteseer | 3.936 |
65.22 | 256.694 |

P100 | as-Skitter | 111.738 |
579.22 | 64721.414 |

P100 | coPapersDBLP | 226.672 |
25.4 | 5766.4 |

P100 | pokec | 202.185 |
80.7 | 16316.474 |

P100 | coPapersCiteseer | 451.582 |
16.34 | 7378.188 |

P100 | akamai | 151.47 |
5.13 | 12596.36 |

P100 | soc-LiveJournal1 | 1.548 |
2.59 | 4.016 |

P100 | europe_osm | 9.59 |
153.35 | 1470.632 |

P100 | hollywood-2009 | 10032.46 |
18.76 | 188206.234 |

P100 | rgg_n_2_24_s0 | 539.559 |
29.45 | 15887.536 |

## Performance and Analysis

We measure the performance by runtime. The whole process runs on the GPU; we don't include data copy time and regard it as part of preprocessing. The algorithm is deterministic so does not require any accuracy comparison.

### Implementation limitations

Since the implementation only needs `number of nodes`

's size of memory allocated on the GPU, so the largest dataset that can fit on a single GPU is limited by GPU on-die memory size / number of nodes.

Our implementation regards the graph as undirected and is limited to graphs of that type.

### Performance limitations

We currently use atomic adds to accumulate the number of triangles for each node; atomic operations are relatively slow.

## Next Steps

### Alternate approaches

The most time-consuming operation for scan statistics is the intersection operator. We believe we can improve its performance in future work.

Currently we divide the edge lists into two groups:
(1) small neighbor lists and (2) large neighbor lists. We implement two kernels (`TwoSmall`

and `TwoLarge`

) that cooperatively compute intersections. Our `TwoSmall`

kernel uses one thread to compute the intersection of a node pair. Our `TwoLarge`

kernel partitions the edge list into chunks and assigns each chunk to a separate thread block. Then each block uses the balanced path primitive to cooperatively compute intersections. However, these two kernels do not cover the case of one small neighbor list and one large neighbor list. By using a 3-kernel strategy and carefully choosing a threshold value to divide the edge list into two groups, we could potentially process intersections with the same level of workload together, gaining better load balancing and a higher GPU resource utilization.

### Gunrock implications

Gunrock is well-suited for this implementation. The major challenge is the need for accumulation when doing intersection. Intersection was originally designed to count the total number of triangles. But in scan statistics, we instead need the triangle count for each node, which introduces an extra atomic add when doing the accumulation if multiple edges share the same node, since we count triangles in parallel for each edge.

### Notes on multi-GPU parallelization

Non-ideal graph partitioning could be a bottleneck on a multi-GPU parallelization. Because we need the info of each node's two-hop neighbors, an unbalanced workload will decrease the performance and increase the communication bottleneck.

### Notes on dynamic graphs

Scan statistics as a workload certainly has a dynamic-graph component: in its general form, it inputs a time-series graph and tries to identify any abnormal behavior in any statistics change (as mentioned in the referenced papers). To support this general workload, we would need Gunrock to support dynamic graphs in its advance and intersection operators. The application could be easily changed to record all history scan statistics and then to try to find any significant changes given a certain threshold.

### Notes on larger datasets

For a dataset which is too large to fit into a single GPU, we can leverage the multi-GPU implementation of Gunrock to make it work on multiple GPUs on a single node. The implementation won't change a lot since Gunrock already has good support in its multi-GPU implementation. We expect the performance and memory usage to scale linearly with good graph partionling method.

For a dataset that cannot fit onto multiple GPUs on a single node, we need a distributed level of computation, which Gunrock doesn't support yet. However, we can leverage open-source libraries such as NCCL and Horovod that support this. Performance-wise, the way of partitioning the graph as well as the properties of a graph will affect the communication bottleneck. Since we need to calculate the total number of triangles each node is involved in, if we cannot fit the entire neighborhood of a node on a single node, we need other compute resources' help to compute its scan statistic. The worst-case senario is if the graph is fully connected, in which case we must to wait for the counting results from all other compute resources and then sum them up. In this case, if we can do good load-balanced scheduling, we can potentially minimize the communication bottleneck and reduce latency.

### Notes on other pieces of this workload

The other important piece of this work is statistical time series analysis on dynamic changes of the graph. We hope to add dynamic graph capability in the future.

### Research value

This application leverages a classic triangle-counting problem to solve a more complex statistics problem. Instead of directly using the existing solution, we solve it from a different angle: rather than counting the total number of triangles in the graph, we instead count triangles from each node's perspective.

From this app, we see the flexibility of Gunrock's intersection operation. This operation could potentially be used in other graph analytics problems such as community detection, etc.

## References

Scan Statistics on Enron Graphs

Statistical inference on random graphs: Comparative power analyses via Monte Carlo