toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

## full_length: true

# GraphSearch

The graph search (GS) workflow is a walk-based method that searches a graph for nodes that score highly on some arbitrary indicator of interest.

The use case given by the HIVE government partner was sampling a graph: given some seed nodes, and some model that can score a node as "interesting", find lots of "interesting" nodes as quickly as possible. Their algorithm attempts to solve this problem by implementing several different strategies for walking the graph.

`uniform`

: given a node`u`

, randomly move to one of`u`

's neighbors (ignoring scores)`greedy`

: given a node`u`

, walk to neighbor with maximum score`stochastic_greedy`

: given a node`u`

, choose neighbor to walk to with probability proportional to score

Use of these walk-based methods is motivated by the presence of homophily in many real world social networks: we expect interesting people to have relationships with interesting people.

## Summary of Results

Graph search is a relatively minor modification to Gunrock's random walk application, and was straightforward to implement. Though random walks are a "worst case scenario" for GPU memory bandwidth, we still achieve 3--5x speedup over a modified version of the OpenMP reference implementation.

The original OpenMP reference implementation actually ran slower with more threads -- we fixed the bugs, but the benchmarking experience highlights the need for performant and hardened CPU baselines.

Until recently, Gunrock did not support parallelism *within* the lambda functions run by the `advance`

operator, so neighbor selection for a given step in the walk is done sequentially. Methods for exposing more parallelism to the programmer are currently being developed via parallel neighbor reduce functions.

In an end-to-end graph search application, we'd need to implement the scoring function as well as the graph walk component. For performance, we'd likely want to implement the scoring function on the GPU as well, which makes this a good example of a "Gunrock+X" app, where we'd need to integrate the high-performance graph processing component with arbitrary user code.

## Summary of Gunrock Implementation

The scoring model can be an arbitrary function (e.g., of node metadata). For example, if we were running GS on the Twitter friends/followers graph, the scoring model might be the output of a text classifier on each users' messages. Thus, we do not implement the scoring model in our Gunrock implementation -- instead, we read scores from an input file and access them as necessary.

GS is a generalization of a random walk implementation, where there can be more variety in the transition function between nodes.

The GS `uniform`

mode is exactly a uniform random walk, so we can use the pre-existing Gunrock application. Given a node, we compute the node to walk to as:

```
r = random.uniform(0, 1)
neighbors = graph.get_neighbors(node)
next_node = neighbors[floor(r * len(neighbors))]
```

Both the `GraphSearch`

`greedy`

and `stochastic_greedy`

consist of small modifications to this transition function.

For `greedy`

, we find the neighbor with maximum score:

```
neighbors = graph.get_neighbors(node)
next_node = neighbors[0]
next_node_score = scores[next_node]
for neighbor in neighbors:
neighbor_score = scores[neighbor]
if neighbor_score > next_node_score:
next_node = neighbor
next_node_score = neighbor_score
```

For `stochastic_greedy`

, we sample neighbors proportional to their score -- e.g.:

```
sum_neighbor_scores = 0
for neighbor in graph.neighbors(node):
sum_neighbor_scores += scores[neighbor]
r *= sum_neighbor_scores
tmp = 0
for neighbor in graph.neighbors(node):
tmp += scores[neighbor]
if r < tmp:
next_node = neighbor
break
```

In Gunrock, we create a frontier containing all of the nodes we want to walk from. Then we map the transition function over the frontier using Gunrock's `ForEach`

operator. Current nodes in the frontier are replaced with the chosen neighbor, and the walk is (optionally) recorded in an output array.

Because this is such a straightforward modification, we implement GS inside of the existing random walk `rw`

Gunrock application. GS just requires adding a couple of extra flags and one extra array of size `|V|`

to store the node values.

## How To Run This Application on DARPA's DGX-1

### Prereqs/input

```
git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/rw/
cp ../../gunrock/util/gitsha1.c.in ../../gunrock/util/gitsha1.c
make clean
make
```

### Running the application

#### Application specific parameters

```
--walk-mode
0 = uniform
1 = greedy
2 = stochastic_greedy
--node-value-path
If --walk-mode != 0, this is the path to node scores
--store-walks
0 = just do the walk -- don't actually store it anywhere
1 = store walks in memory
--walk-length
Length of each walk
--walks-per-node
Number of walks to do per seed node
--seed
Seed for random number generator
```

#### Example Command

```
# generate random features
python random-values.py 39 > chesapeake.values
# uniform random
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
../../dataset/small/chesapeake.mtx --walk-mode 0 --seed 123
# greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
--walk-mode 1
# stochastic greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
--walk-mode 2 --seed 123
```

#### Example Output

```
# ------------------------------------------------
# uniform random
Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
Parsing MARKET COO format
(39 nodes, 340 directed edges)...
Done parsing (0 s).
Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
Elapsed: 0.001907
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0 0 0 queue3 oversize : 234 -> 682
0 0 0 queue3 oversize : 234 -> 682
0 1 0 queue3 oversize : 682 -> 1085
0 1 0 queue3 oversize : 682 -> 1085
0 5 0 queue3 oversize : 1085 -> 1166
0 5 0 queue3 oversize : 1085 -> 1166
--------------------------
Run 0 elapsed: 4.551888, #iterations = 10
[[0, 38, 8, 35, 11, 25, 13, 27, 37, 7, ],
[1, 34, 1, 38, 30, 38, 29, 37, 7, 37, ],
[2, 17, 2, 38, 4, 38, 10, 18, 14, 28, ],
...
[36, 33, 0, 22, 38, 27, 37, 18, 38, 8, ],
[37, 21, 31, 17, 25, 17, 18, 32, 37, 26, ],
[38, 7, 8, 34, 6, 5, 6, 5, 38, 19, ]]
-------- NO VALIDATION -----[rw] finished.
avg. elapsed: 4.551888 ms
iterations: 10
min. elapsed: 4.551888 ms
max. elapsed: 4.551888 ms
load time: 60.925 ms
preprocess time: 964.890000 ms
postprocess time: 0.715017 ms
total time: 970.350027 ms
# ------------------------------------------------
# greedy
# !! In this case, the output is formatted as `GPU_result:CPU_result`, for correctness checking
Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
Parsing MARKET COO format
(39 nodes, 340 directed edges)...
Done parsing (0 s).
Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
Elapsed: 0.085831
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0 0 0 queue3 oversize : 234 -> 682
0 0 0 queue3 oversize : 234 -> 682
0 1 0 queue3 oversize : 682 -> 770
0 1 0 queue3 oversize : 682 -> 770
--------------------------
Run 0 elapsed: 0.695944, #iterations = 10
[[0:0, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[1:1, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, ],
...
[36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, ],
[37:37, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[38:38, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, ]]
0 errors occurred.
[rw] finished.
avg. elapsed: 0.695944 ms
iterations: 10
min. elapsed: 0.695944 ms
max. elapsed: 0.695944 ms
load time: 44.2419 ms
preprocess time: 974.721000 ms
postprocess time: 0.731945 ms
total time: 976.338863 ms
# ------------------------------------------------
# stochastic_greedy
# Output same format as `uniform` above.
# No correctness checking is implemented due to stochasticity.
```

#### Expected Output

When run in `--verbose`

mode, the app outputs the walks. When run in `--quiet`

mode, it outputs performance statistics (e.g., total number of steps taken). If running `greedy`

`GraphSearch`

, the app also outputs the results of a correctness check. Correctness checks for `uniform`

and `stochastic_greedy`

are omitted because of their inherent stochasticity.

### Validation

The correctness of the implementation has been validated in outside experiments, by making sure that the output walks are valid and the distribution of transitions is as expected.

## Performance and Analysis

Performance is measured by the runtime of the app, given

- an input graph
`G=(U, E)`

- set of seed nodes (hardcoded to all nodes in
`G`

) - number of walks per seed
- number of steps per walk
- a transition function (e.g.,
`uniform|greedy|stochastic_greedy`

)

### Implementation limitations

The output of the random walk is a dense array of size `(# seeds) * (steps per walk) * (walks per seed)`

. When we have a large graph *or* long walks *or* multiple walks per seed, this array may exceed the size of GPU memory.

At the moment, we only support walks starting from *all* of the nodes in `G`

. It would be straightforward to add a parameter that would allow the use to specify a smaller set of seed nodes.

This app can only be used for graphs that have scores associated w/ each node. In order to run benchmarks, if scores are not available we often assign uniformly random scores to nodes. The distribution of these scores may affect the runtime of the algorithm by changing data access patterns -- we test on the provided Twitter dataset, but do not have a variety of other node attributed graphs to test on.

### Comparison against existing implementations

We measure runtime on the HIVE graphsearch Twitter dataset. This graph has `|U|=9291392`

nodes and `|E|=21741663`

edges.

At a high level, the results show:

Variant | OpenMP w/ 64 threads | Gunrock GPU | Gunrock Speedup |
---|---|---|---|

Directed greedy | 236ms | 64ms |
3.7x |

Directed random | 158ms | 34ms |
4.6x |

Undirected random | 3186ms | 630ms |
5.0x |

The undirected random walks take \~ 10x longer because directed walks terminate when they encounter a node without any neighbors and thus have average length significantly shorter than the `--walk-length`

parameter.

Details and raw data follow.

#### HIVE Python reference implementation

We run the HIVE Python reference implementation w/ the following settings:

- undirected graph
- uniform transition function
- 1000 random seeds
- 128 steps per walk

With the `uniform`

transition function, the run took 41 seconds. Walks are done sequentially, so runtime will scale linearly with the number of seeds. This implementation is *substantially* slower than even a single-threaded run of PNNLs OpenMP code. Thus, we omit further analysis.

#### PNNL OpenMP implementation

We run the PNNL OpenMP implementation on the Twitter graph w/ the following settings:

- commit: 69864383f0fc0e8aace52be34b329a2f8a58afb6
- 1,2,4,8,16,32 or 64 threads
`greedy`

or`uniform`

transition function- directed or undirected graph

We omit the `greedy`

undirected case because the algorithm gets stuck jumping between a local maximum and its highest-scoring neighbor.

threads | method | directed? | nseeds | elapsed_sec | nsteps | steps_per_sec |
---|---|---|---|---|---|---|

1 | greedy | yes | 7199978 | 3.02876 | 16325873 | 5.39e+06 |

2 | greedy | yes | 7199978 | 2.83467 | 16325873 | 5.75e+06 |

4 | greedy | yes | 7199978 | 1.64405 | 16325873 | 9.93e+06 |

8 | greedy | yes | 7199978 | 0.870028 | 16325873 | 1.87e+07 |

16 | greedy | yes | 7199978 | 0.605769 | 16325873 | 2.69e+07 |

32 | greedy | yes | 7199978 | 0.43742 | 16325873 | 3.73e+07 |

64 | greedy | yes | 7199978 | 0.236701 |
16325873 | 6.89e+07 |

1 | unif. | yes | 7199978 | 14.6291 |
14510781 | 991915 |

2 | unif. | yes | 7199978 | 24.2175 | 14186833 | 585809 |

4 | unif. | yes | 7199978 | 25.1764 | 14487202 | 575427 |

8 | unif. | yes | 7199978 | 27.7312 | 13937449 | 502591 |

16 | unif. | yes | 7199978 | 30.5377 | 14062226 | 460488 |

32 | unif. | yes | 7199978 | 32.1057 | 13906144 | 433137 |

64 | unif. | yes | 7199978 | 31.2754 | 13876284 | 443680 |

1 | unif. | no | 100000 | 12.3982 |
12700000 | 1.024+06 |

2 | unif. | no | 100000 | 19.7925 | 12700000 | 641658 |

4 | unif. | no | 100000 | 22.5432 | 12700000 | 563362 |

8 | unif. | no | 100000 | 26.1053 | 12700000 | 486491 |

16 | unif. | no | 100000 | 28.275 | 12700000 | 449160 |

32 | unif. | no | 100000 | 28.334 | 12700000 | 448224 |

64 | unif. | no | 100000 | 28.7419 | 12700000 | 441864 |

Note that we use fewer seeds for the undirected uniform case due to slow runtime.

Observe that the `rand`

modes have very bad scaling as a function of cores. After investigation, this was due to two issues. First, the neighbors were being sampled incorrectly, which led to chaotic behavior. Second, the app was using a slow random number generator w/ an excessive number of seed resets. We created a PR to fix those issues here.

After these fixes, runtimes were as follows:

threads | method | directed? | nseeds | elapsed_sec | nsteps | steps_per_sec |
---|---|---|---|---|---|---|

1 | greedy | yes | 7199978 | 3.02876 | 16325873 | 5.39e+06 |

2 | greedy | yes | 7199978 | 2.83467 | 16325873 | 5.75e+06 |

4 | greedy | yes | 7199978 | 1.64405 | 16325873 | 9.93e+06 |

8 | greedy | yes | 7199978 | 0.870028 | 16325873 | 1.87e+07 |

16 | greedy | yes | 7199978 | 0.605769 | 16325873 | 2.69e+07 |

32 | greedy | yes | 7199978 | 0.43742 | 16325873 | 3.73e+07 |

64 | greedy | yes | 7199978 | 0.236701 |
16325873 | 6.89e+07 |

1 | unif. | yes | 7199978 | 1.49886 | 16529694 | 1.10e+07 |

2 | unif. | yes | 7199978 | 1.60176 | 16533004 | 1.03e+07 |

4 | unif. | yes | 7199978 | 0.974128 | 16538957 | 1.69e+07 |

8 | unif. | yes | 7199978 | 0.455227 | 16534756 | 3.63+07 |

16 | unif. | yes | 7199978 | 0.257524 | 16528617 | 6.42e+07 |

32 | unif. | yes | 7199978 | 0.155722 |
13906144 | 1.06e+08 |

64 | unif. | yes | 7199978 | 0.158828 | 16537488 | 1.04e+08 |

1 | unif. | no | 7199978 | 125.963 | 914397206 | 1.92e+08 |

2 | unif. | no | 7199978 | 78.927 | 914397206 | 1.80e+08 |

4 | unif. | no | 7199978 | 39.7097 | 914397206 | 2.96e+08 |

8 | unif. | no | 7199978 | 22.5195 | 914397206 | 6.35e+08 |

16 | unif. | no | 7199978 | 11.0047 | 914397206 | 1.12e+09 |

32 | unif. | no | 7199978 | 5.56317 | 914397206 | 1.85e+09 |

64 | unif. | no | 7199978 | 3.18615 |
914397206 | 1.82e+09 |

Note the improved runtimes and scaling. These experiments were run with this branch at commit 6c25a0687eecebfd4393e86fa4c7308d5594b73d.

All experiments are conducted on the HIVE DGX-1.

#### Gunrock GPU implementation

###### directed, greedy

```
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
--node-value-path gs_twitter.values \
--walk-mode 1 \
--walk-length 32 \
--undirected=0 \
--store-walks 0 \
--quick \
--num-runs 10
```

```
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
Parsing MARKET COO format
(7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 65.273046, #iterations = 32
Run 1 elapsed: 64.157963, #iterations = 32
Run 2 elapsed: 64.009190, #iterations = 32
Run 3 elapsed: 64.055920, #iterations = 32
Run 4 elapsed: 64.069033, #iterations = 32
Run 5 elapsed: 64.002037, #iterations = 32
Run 6 elapsed: 64.031839, #iterations = 32
Run 7 elapsed: 64.036846, #iterations = 32
Run 8 elapsed: 64.065933, #iterations = 32
Run 9 elapsed: 64.047098, #iterations = 32
Validate_Results: total_neighbors_seen=298668024
Validate_Results: total_steps_taken=16325873
-------- NO VALIDATION --------
[rw] finished.
avg. elapsed: 64.174891 ms
iterations: 32
min. elapsed: 64.002037 ms
max. elapsed: 65.273046 ms
load time: 7086.91 ms
preprocess time: 1016.620000 ms
postprocess time: 101.121902 ms
total time: 2073.837996 ms
```

###### directed, uniform

```
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
--node-value-path gs_twitter.values \
--walk-mode 0 \
--walk-length 128 \
--undirected=0 \
--store-walks 0 \
--quick \
--num-runs 10 \
--seed 123
```

```
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
Parsing MARKET COO format
(7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (1s).
==============================================
advance-mode=LB
Using advance mode LB
Using filter mode CULL
__________________________
Run 0 elapsed: 38.613081, #iterations = 128
Run 1 elapsed: 34.458876, #iterations = 128
Run 2 elapsed: 34.530163, #iterations = 128
Run 3 elapsed: 33.849001, #iterations = 128
Run 4 elapsed: 33.759117, #iterations = 128
Run 5 elapsed: 33.967972, #iterations = 128
Run 6 elapsed: 33.873081, #iterations = 128
Run 7 elapsed: 33.970118, #iterations = 128
Run 8 elapsed: 33.756971, #iterations = 128
--------------------------
Run 9 elapsed: 33.715963, #iterations = 128
Validate_Results: total_neighbors_seen=289124779
Validate_Results: total_steps_taken=16530404
-------- NO VALIDATION --------
[rw] finished.
avg. elapsed: 34.449434 ms
iterations: 128
min. elapsed: 33.715963 ms
max. elapsed: 38.613081 ms
load time: 7176.17 ms
preprocess time: 1016.720000 ms
postprocess time: 101.902962 ms
total time: 1781.071901 ms
```

###### undirected, uniform

```
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file undir_gs_twitter.mtx \
--node-value-path gs_twitter.values \
--walk-mode 0 \
--walk-length 128 \
--store-walks 0 \
--quick \
--num-runs 10 \
--seed 123
```

```
Loading Matrix-market coordinate-formatted graph ...
Reading from undir_gs_twitter.mtx:
Parsing MARKET COO format
(7199978 nodes, 43483326 directed edges)...
Done parsing (7 s).
Converting 7199978 vertices, 43483326 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 636.021852, #iterations = 128
Run 1 elapsed: 631.129026, #iterations = 128
Run 2 elapsed: 631.053925, #iterations = 128
Run 3 elapsed: 631.713152, #iterations = 128
Run 4 elapsed: 631.028175, #iterations = 128
Run 5 elapsed: 631.374836, #iterations = 128
Run 6 elapsed: 631.196976, #iterations = 128
Run 7 elapsed: 632.030964, #iterations = 128
Run 8 elapsed: 631.026983, #iterations = 128
Run 9 elapsed: 630.996943, #iterations = 128
Validate_Results: total_neighbors_seen=75443835041
Validate_Results: total_steps_taken=914397206
-------- NO VALIDATION --------
[rw] finished.
avg. elapsed: 631.757283 ms
iterations: 128
min. elapsed: 630.996943 ms
max. elapsed: 636.021852 ms
load time: 7705.9 ms
preprocess time: 1010.830000 ms
postprocess time: 102.057934 ms
total time: 7755.448818 ms
```

### Performance limitations

- For the undirected uniform settings, profiling shows that 79% of compute time is spent in the
`ForAll`

operator and 20% is spent in the`curand`

random number generator. Device memory bandwidth in the`ForAll`

kernel is 193 GB/s. - For the directed greedy settings, profiling shows that 99.5% of compute time is spent in the
`ForAll`

operator. Device memory bandwidth in the`ForAll`

kernel is 136 GB/s.

When we do a large number of walks and/or the length of each walk is very long, there may not be enough GPU memory to store all of the walks in memory. For now, we expose the `--store-walks`

parameter -- when this is set to zero, the walk is discarded as it is computed and only the length of the walk is stored. A better solution that could be implemented in the future would be to move walks from GPU to CPU memory as they grow too large.

**Optimization:** In a directed walk, once we hit a node with no outgoing neighbors, we halt the walk. In the current Gunrock implementation, the enactor runs for a fixed number of iterations, regardless of whether any of the nodes are still active. It would be straightforward to add a check that terminates the app when no "living" nodes are left.

## Next Steps

### Alternate approaches

The size of the output array may become a significant bottleneck for large graphs. However, since all of the transition functions do not depend on anything besides the current node, we could reasonably move the results of the walk from GPU to CPU memory every N iterations. Properly executed, this should eliminate the largest bottleneck without unduely impacting performance.

### Gunrock implications

For the `greedy`

and `stochastic_greedy`

transition function, we have to sequentially iterate over all of a node's neighbors. Simple wrappers for computing, e.g., the maximum of node scores across all of a node's neighbors could be helpful, both for ease of programming and performance. Gunrock has a newly added `NeighborReduce`

kernel that supports associative reductions -- it should be straightforward to implement (at least) the `greedy`

transition function with this kernel. The `stochastic_greedy`

transition function would require a more complex reduction function along the lines of reservoir sampling.

### Notes on multi-GPU parallelization

If the graph is small enough to be duplicated on each GPU, the implementation is trivial: just do a subset of the walks on each GPU. The scalability will be perfect, as there is no communication involved at all.

When the graph is distributed across multiple GPUs, we expect to have very poor scalability, as the ratio of computation to communication is very low. A more detailed discussion is available here.

### Notes on dynamic graphs

This workflow does not have an explicit dynamic component. However, because steps only depend on the current node, the underlying graph could change during the walks.

### Notes on larger datasets

The random accesses inherent to graph search make it a particularly difficult workflow for larger-than-GPU memory datasets. The most straightforward solution would be to let Unified Virtual Memory (UVM) in CUDA automatically handle memory movement, but we should expect to see a substantial reduction in performance.

### Notes on other pieces of this workload

In real use cases, the scoring function would be computed lazily -- that is, we wouldn't have a precomputed array with scores for each of the nodes, and we would need to run the scoring function as the walk is running. Thus, it would be critical for us to be able to call the scoring function from within Gunrock quickly and without excessive programmer overhead.