toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

full_length: true

GraphSearch

The graph search (GS) workflow is a walk-based method that searches a graph for nodes that score highly on some arbitrary indicator of interest.

The use case given by the HIVE government partner was sampling a graph: given some seed nodes, and some model that can score a node as "interesting", find lots of "interesting" nodes as quickly as possible. Their algorithm attempts to solve this problem by implementing several different strategies for walking the graph.

Use of these walk-based methods is motivated by the presence of homophily in many real world social networks: we expect interesting people to have relationships with interesting people.

Summary of Results

Graph search is a relatively minor modification to Gunrock's random walk application, and was straightforward to implement. Though random walks are a "worst case scenario" for GPU memory bandwidth, we still achieve 3--5x speedup over a modified version of the OpenMP reference implementation.

The original OpenMP reference implementation actually ran slower with more threads -- we fixed the bugs, but the benchmarking experience highlights the need for performant and hardened CPU baselines.

Until recently, Gunrock did not support parallelism within the lambda functions run by the advance operator, so neighbor selection for a given step in the walk is done sequentially. Methods for exposing more parallelism to the programmer are currently being developed via parallel neighbor reduce functions.

In an end-to-end graph search application, we'd need to implement the scoring function as well as the graph walk component. For performance, we'd likely want to implement the scoring function on the GPU as well, which makes this a good example of a "Gunrock+X" app, where we'd need to integrate the high-performance graph processing component with arbitrary user code.

Summary of Gunrock Implementation

The scoring model can be an arbitrary function (e.g., of node metadata). For example, if we were running GS on the Twitter friends/followers graph, the scoring model might be the output of a text classifier on each users' messages. Thus, we do not implement the scoring model in our Gunrock implementation -- instead, we read scores from an input file and access them as necessary.

GS is a generalization of a random walk implementation, where there can be more variety in the transition function between nodes.

The GS uniform mode is exactly a uniform random walk, so we can use the pre-existing Gunrock application. Given a node, we compute the node to walk to as:

r = random.uniform(0, 1)
neighbors = graph.get_neighbors(node)
next_node = neighbors[floor(r * len(neighbors))]

Both the GraphSearch greedy and stochastic_greedy consist of small modifications to this transition function.

For greedy, we find the neighbor with maximum score:

neighbors = graph.get_neighbors(node)
next_node = neighbors[0]
next_node_score = scores[next_node]
for neighbor in neighbors:
    neighbor_score = scores[neighbor]
    if neighbor_score > next_node_score:
        next_node = neighbor
        next_node_score = neighbor_score

For stochastic_greedy, we sample neighbors proportional to their score -- e.g.:

sum_neighbor_scores = 0
for neighbor in graph.neighbors(node):
   sum_neighbor_scores += scores[neighbor]

r *= sum_neighbor_scores

tmp = 0
for neighbor in graph.neighbors(node):
   tmp += scores[neighbor]
   if r < tmp:
       next_node = neighbor
       break

In Gunrock, we create a frontier containing all of the nodes we want to walk from. Then we map the transition function over the frontier using Gunrock's ForEach operator. Current nodes in the frontier are replaced with the chosen neighbor, and the walk is (optionally) recorded in an output array.

Because this is such a straightforward modification, we implement GS inside of the existing random walk rw Gunrock application. GS just requires adding a couple of extra flags and one extra array of size |V| to store the node values.

How To Run This Application on DARPA's DGX-1

Prereqs/input

git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/rw/
cp ../../gunrock/util/gitsha1.c.in ../../gunrock/util/gitsha1.c
make clean
make

Running the application

Application specific parameters

  --walk-mode
      0 = uniform
      1 = greedy
      2 = stochastic_greedy
  --node-value-path
      If --walk-mode != 0, this is the path to node scores
  --store-walks
      0 = just do the walk -- don't actually store it anywhere
      1 = store walks in memory
  --walk-length
      Length of each walk
  --walks-per-node
      Number of walks to do per seed node
  --seed
      Seed for random number generator

Example Command

# generate random features
python random-values.py 39 > chesapeake.values

# uniform random
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --walk-mode 0 --seed 123

# greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
  --walk-mode 1

# stochastic greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
  --walk-mode 2 --seed 123

Example Output

# ------------------------------------------------
# uniform random

Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
  Parsing MARKET COO format
 (39 nodes, 340 directed edges)...
Done parsing (0 s).
  Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
 Elapsed: 0.001907
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0    0   0   queue3      oversize :  234 ->  682
0    0   0   queue3      oversize :  234 ->  682
0    1   0   queue3      oversize :  682 ->  1085
0    1   0   queue3      oversize :  682 ->  1085
0    5   0   queue3      oversize :  1085 ->     1166
0    5   0   queue3      oversize :  1085 ->     1166
--------------------------
Run 0 elapsed: 4.551888, #iterations = 10
[[0, 38, 8, 35, 11, 25, 13, 27, 37, 7, ],
[1, 34, 1, 38, 30, 38, 29, 37, 7, 37, ],
[2, 17, 2, 38, 4, 38, 10, 18, 14, 28, ],
...
[36, 33, 0, 22, 38, 27, 37, 18, 38, 8, ],
[37, 21, 31, 17, 25, 17, 18, 32, 37, 26, ],
[38, 7, 8, 34, 6, 5, 6, 5, 38, 19, ]]
-------- NO VALIDATION -----[rw] finished.
 avg. elapsed: 4.551888 ms
 iterations: 10
 min. elapsed: 4.551888 ms
 max. elapsed: 4.551888 ms
 load time: 60.925 ms
 preprocess time: 964.890000 ms
 postprocess time: 0.715017 ms
 total time: 970.350027 ms

# ------------------------------------------------
# greedy
# !! In this case, the output is formatted as `GPU_result:CPU_result`, for correctness checking

Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
  Parsing MARKET COO format
 (39 nodes, 340 directed edges)...
Done parsing (0 s).
  Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
 Elapsed: 0.085831
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0    0   0   queue3      oversize :  234 ->  682
0    0   0   queue3      oversize :  234 ->  682
0    1   0   queue3      oversize :  682 ->  770
0    1   0   queue3      oversize :  682 ->  770
--------------------------
Run 0 elapsed: 0.695944, #iterations = 10
[[0:0, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[1:1, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, ],
...
[36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, ],
[37:37, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[38:38, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, ]]
0 errors occurred.
[rw] finished.
 avg. elapsed: 0.695944 ms
 iterations: 10
 min. elapsed: 0.695944 ms
 max. elapsed: 0.695944 ms
 load time: 44.2419 ms
 preprocess time: 974.721000 ms
 postprocess time: 0.731945 ms
 total time: 976.338863 ms

# ------------------------------------------------
# stochastic_greedy
# Output same format as `uniform` above.
# No correctness checking is implemented due to stochasticity.

Expected Output

When run in --verbose mode, the app outputs the walks. When run in --quiet mode, it outputs performance statistics (e.g., total number of steps taken). If running greedy GraphSearch, the app also outputs the results of a correctness check. Correctness checks for uniform and stochastic_greedy are omitted because of their inherent stochasticity.

Validation

The correctness of the implementation has been validated in outside experiments, by making sure that the output walks are valid and the distribution of transitions is as expected.

Performance and Analysis

Performance is measured by the runtime of the app, given

Implementation limitations

The output of the random walk is a dense array of size (# seeds) * (steps per walk) * (walks per seed). When we have a large graph or long walks or multiple walks per seed, this array may exceed the size of GPU memory.

At the moment, we only support walks starting from all of the nodes in G. It would be straightforward to add a parameter that would allow the use to specify a smaller set of seed nodes.

This app can only be used for graphs that have scores associated w/ each node. In order to run benchmarks, if scores are not available we often assign uniformly random scores to nodes. The distribution of these scores may affect the runtime of the algorithm by changing data access patterns -- we test on the provided Twitter dataset, but do not have a variety of other node attributed graphs to test on.

Comparison against existing implementations

We measure runtime on the HIVE graphsearch Twitter dataset. This graph has |U|=9291392 nodes and |E|=21741663 edges.

At a high level, the results show:

Variant OpenMP w/ 64 threads Gunrock GPU Gunrock Speedup
Directed greedy 236ms 64ms 3.7x
Directed random 158ms 34ms 4.6x
Undirected random 3186ms 630ms 5.0x

The undirected random walks take \~ 10x longer because directed walks terminate when they encounter a node without any neighbors and thus have average length significantly shorter than the --walk-length parameter.

Details and raw data follow.

HIVE Python reference implementation

We run the HIVE Python reference implementation w/ the following settings:

With the uniform transition function, the run took 41 seconds. Walks are done sequentially, so runtime will scale linearly with the number of seeds. This implementation is substantially slower than even a single-threaded run of PNNLs OpenMP code. Thus, we omit further analysis.

PNNL OpenMP implementation

We run the PNNL OpenMP implementation on the Twitter graph w/ the following settings:

We omit the greedy undirected case because the algorithm gets stuck jumping between a local maximum and its highest-scoring neighbor.

threads method directed? nseeds elapsed_sec nsteps steps_per_sec
1 greedy yes 7199978 3.02876 16325873 5.39e+06
2 greedy yes 7199978 2.83467 16325873 5.75e+06
4 greedy yes 7199978 1.64405 16325873 9.93e+06
8 greedy yes 7199978 0.870028 16325873 1.87e+07
16 greedy yes 7199978 0.605769 16325873 2.69e+07
32 greedy yes 7199978 0.43742 16325873 3.73e+07
64 greedy yes 7199978 0.236701 16325873 6.89e+07
1 unif. yes 7199978 14.6291 14510781 991915
2 unif. yes 7199978 24.2175 14186833 585809
4 unif. yes 7199978 25.1764 14487202 575427
8 unif. yes 7199978 27.7312 13937449 502591
16 unif. yes 7199978 30.5377 14062226 460488
32 unif. yes 7199978 32.1057 13906144 433137
64 unif. yes 7199978 31.2754 13876284 443680
1 unif. no 100000 12.3982 12700000 1.024+06
2 unif. no 100000 19.7925 12700000 641658
4 unif. no 100000 22.5432 12700000 563362
8 unif. no 100000 26.1053 12700000 486491
16 unif. no 100000 28.275 12700000 449160
32 unif. no 100000 28.334 12700000 448224
64 unif. no 100000 28.7419 12700000 441864

Note that we use fewer seeds for the undirected uniform case due to slow runtime.

Observe that the rand modes have very bad scaling as a function of cores. After investigation, this was due to two issues. First, the neighbors were being sampled incorrectly, which led to chaotic behavior. Second, the app was using a slow random number generator w/ an excessive number of seed resets. We created a PR to fix those issues here.

After these fixes, runtimes were as follows:

threads method directed? nseeds elapsed_sec nsteps steps_per_sec
1 greedy yes 7199978 3.02876 16325873 5.39e+06
2 greedy yes 7199978 2.83467 16325873 5.75e+06
4 greedy yes 7199978 1.64405 16325873 9.93e+06
8 greedy yes 7199978 0.870028 16325873 1.87e+07
16 greedy yes 7199978 0.605769 16325873 2.69e+07
32 greedy yes 7199978 0.43742 16325873 3.73e+07
64 greedy yes 7199978 0.236701 16325873 6.89e+07
1 unif. yes 7199978 1.49886 16529694 1.10e+07
2 unif. yes 7199978 1.60176 16533004 1.03e+07
4 unif. yes 7199978 0.974128 16538957 1.69e+07
8 unif. yes 7199978 0.455227 16534756 3.63+07
16 unif. yes 7199978 0.257524 16528617 6.42e+07
32 unif. yes 7199978 0.155722 13906144 1.06e+08
64 unif. yes 7199978 0.158828 16537488 1.04e+08
1 unif. no 7199978 125.963 914397206 1.92e+08
2 unif. no 7199978 78.927 914397206 1.80e+08
4 unif. no 7199978 39.7097 914397206 2.96e+08
8 unif. no 7199978 22.5195 914397206 6.35e+08
16 unif. no 7199978 11.0047 914397206 1.12e+09
32 unif. no 7199978 5.56317 914397206 1.85e+09
64 unif. no 7199978 3.18615 914397206 1.82e+09

Note the improved runtimes and scaling. These experiments were run with this branch at commit 6c25a0687eecebfd4393e86fa4c7308d5594b73d.

All experiments are conducted on the HIVE DGX-1.

Gunrock GPU implementation

directed, greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 1 \
    --walk-length 32 \
    --undirected=0 \
    --store-walks 0 \
    --quick \
    --num-runs 10
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 65.273046, #iterations = 32
Run 1 elapsed: 64.157963, #iterations = 32
Run 2 elapsed: 64.009190, #iterations = 32
Run 3 elapsed: 64.055920, #iterations = 32
Run 4 elapsed: 64.069033, #iterations = 32
Run 5 elapsed: 64.002037, #iterations = 32
Run 6 elapsed: 64.031839, #iterations = 32
Run 7 elapsed: 64.036846, #iterations = 32
Run 8 elapsed: 64.065933, #iterations = 32
Run 9 elapsed: 64.047098, #iterations = 32
Validate_Results: total_neighbors_seen=298668024
Validate_Results: total_steps_taken=16325873
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 64.174891 ms
 iterations: 32
 min. elapsed: 64.002037 ms
 max. elapsed: 65.273046 ms
 load time: 7086.91 ms
 preprocess time: 1016.620000 ms
 postprocess time: 101.121902 ms
 total time: 2073.837996 ms
directed, uniform
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 0 \
    --walk-length 128 \
    --undirected=0 \
    --store-walks 0 \
    --quick \
    --num-runs 10 \
    --seed 123
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (1s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
__________________________
Run 0 elapsed: 38.613081, #iterations = 128
Run 1 elapsed: 34.458876, #iterations = 128
Run 2 elapsed: 34.530163, #iterations = 128
Run 3 elapsed: 33.849001, #iterations = 128
Run 4 elapsed: 33.759117, #iterations = 128
Run 5 elapsed: 33.967972, #iterations = 128
Run 6 elapsed: 33.873081, #iterations = 128
Run 7 elapsed: 33.970118, #iterations = 128
Run 8 elapsed: 33.756971, #iterations = 128
--------------------------
Run 9 elapsed: 33.715963, #iterations = 128
Validate_Results: total_neighbors_seen=289124779
Validate_Results: total_steps_taken=16530404
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 34.449434 ms
 iterations: 128
 min. elapsed: 33.715963 ms
 max. elapsed: 38.613081 ms
 load time: 7176.17 ms
 preprocess time: 1016.720000 ms
 postprocess time: 101.902962 ms
 total time: 1781.071901 ms
undirected, uniform
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file undir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 0 \
    --walk-length 128 \
    --store-walks 0 \
    --quick \
    --num-runs 10 \
    --seed 123
Loading Matrix-market coordinate-formatted graph ...
Reading from undir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 43483326 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 43483326 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 636.021852, #iterations = 128
Run 1 elapsed: 631.129026, #iterations = 128
Run 2 elapsed: 631.053925, #iterations = 128
Run 3 elapsed: 631.713152, #iterations = 128
Run 4 elapsed: 631.028175, #iterations = 128
Run 5 elapsed: 631.374836, #iterations = 128
Run 6 elapsed: 631.196976, #iterations = 128
Run 7 elapsed: 632.030964, #iterations = 128
Run 8 elapsed: 631.026983, #iterations = 128
Run 9 elapsed: 630.996943, #iterations = 128
Validate_Results: total_neighbors_seen=75443835041
Validate_Results: total_steps_taken=914397206
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 631.757283 ms
 iterations: 128
 min. elapsed: 630.996943 ms
 max. elapsed: 636.021852 ms
 load time: 7705.9 ms
 preprocess time: 1010.830000 ms
 postprocess time: 102.057934 ms
 total time: 7755.448818 ms

Performance limitations

When we do a large number of walks and/or the length of each walk is very long, there may not be enough GPU memory to store all of the walks in memory. For now, we expose the --store-walks parameter -- when this is set to zero, the walk is discarded as it is computed and only the length of the walk is stored. A better solution that could be implemented in the future would be to move walks from GPU to CPU memory as they grow too large.

Optimization: In a directed walk, once we hit a node with no outgoing neighbors, we halt the walk. In the current Gunrock implementation, the enactor runs for a fixed number of iterations, regardless of whether any of the nodes are still active. It would be straightforward to add a check that terminates the app when no "living" nodes are left.

Next Steps

Alternate approaches

The size of the output array may become a significant bottleneck for large graphs. However, since all of the transition functions do not depend on anything besides the current node, we could reasonably move the results of the walk from GPU to CPU memory every N iterations. Properly executed, this should eliminate the largest bottleneck without unduely impacting performance.

Gunrock implications

For the greedy and stochastic_greedy transition function, we have to sequentially iterate over all of a node's neighbors. Simple wrappers for computing, e.g., the maximum of node scores across all of a node's neighbors could be helpful, both for ease of programming and performance. Gunrock has a newly added NeighborReduce kernel that supports associative reductions -- it should be straightforward to implement (at least) the greedy transition function with this kernel. The stochastic_greedy transition function would require a more complex reduction function along the lines of reservoir sampling.

Notes on multi-GPU parallelization

If the graph is small enough to be duplicated on each GPU, the implementation is trivial: just do a subset of the walks on each GPU. The scalability will be perfect, as there is no communication involved at all.

When the graph is distributed across multiple GPUs, we expect to have very poor scalability, as the ratio of computation to communication is very low. A more detailed discussion is available here.

Notes on dynamic graphs

This workflow does not have an explicit dynamic component. However, because steps only depend on the current node, the underlying graph could change during the walks.

Notes on larger datasets

The random accesses inherent to graph search make it a particularly difficult workflow for larger-than-GPU memory datasets. The most straightforward solution would be to let Unified Virtual Memory (UVM) in CUDA automatically handle memory movement, but we should expect to see a substantial reduction in performance.

Notes on other pieces of this workload

In real use cases, the scoring function would be computed lazily -- that is, we wouldn't have a precomputed array with scores for each of the nodes, and we would need to run the scoring function as the walk is running. Thus, it would be critical for us to be able to call the scoring function from within Gunrock quickly and without excessive programmer overhead.