toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

full_length: true

Graph Projections

Given a (directed) graph G, graph projection outputs a graph H such that H contains edge (u, v) iff G contains edges (w, u) and (w, v) for some node w. That is, graph projection creates a new graph where nodes are connected iff they are neighbors of the same node in the original graph. Typically, the edge weights of H are computed via some (simple) function of the corresponding edge weights of G.

Graph projection is most commonly used when the input graph G is bipartitite with node sets U1 and U2 and directed edges (u, v). In this case, the operation yields a unipartite projection onto one of the node sets. However, graph projection can also be applied to arbitrary (unipartite) graphs.

Summary of Results

Because it has a natural representation in terms of sparse matrix operations, graph projections gave us an opportunity to compare ease of implementation and performance between Gunrock and another UC-Davis project, GPU GraphBLAS.

Overall, we found that Gunrock was more flexible and more performant than GraphBLAS, likely due to better load balancing. However, in this case, the GraphBLAS application was substantially easier to program than Gunrock, and also allowed us to take advantage of some more sophisticated memory allocation methods available in the GraphBLAS cuSPARSE backend. These findings suggest that addition of certain commonly used API functions to Gunrock could be a fruitful direction for further work.

Summary of Gunrock Implementation

We implement two versions of graph projections: one using Gunrock and one using GraphBLAS.

Gunrock

First, we can compute graph projection in Gunrock via a single advance operation from all nodes w/ nonzero outgoing degree:

def _advance_op(self, G, H_edges, src, dest):
    for neib in G.neighbors(src):
        if dest != neib:
            H_edges[dest * G.num_nodes + neib] += 1

That is, for each edge in the graph, we fetch the neighbors of the source node in G, then increment the weight of the edge between dest and each of those neighbors in H_edges.

Note that we have only implemented the unweighted case and a single method for computing the edgeweights of H, but the extension to weighted graphs and different weighting functions would be straightforward.

We use a dense |V|x|V| array to store the edges of the output matrix H. This is simple and fast, but uses an unreasonably large amount of memory (a graph with 60k nodes requires 16 GB). In the worst-case scenario, H may actually have all |V|x|V| possible edges, but any typical real-world graph has far fewer edges in practice.

GraphBLAS

Second, we implement graph projection as a single sparse matrix-matrix multiply in our GraphBLAS GPU library, which wraps and extends cuSPARSE.

Graph projection admits a simple linear algebra formulation. Given the adjacency matrix A of graph G, the projection is just:

H = matmul(transpose(A), A)

which can be concisely implemented via cuSPARSE's csr2csc and csrgemm functions.

The csrgemm functions in cuSPARSE allocate memory more intelligently than we do above, on the order of the number of edges in the output. Thus, our GraphBLAS implementation can scale to substantially larger matrices than our Gunrock implementation. However, implementing graph projection via a single call to csrgemm requires both the input graph G and output graph H to fit in GPU memory (16 GB on the DGX-1). This limit can easily be hit, even for a moderately sized G, as the number of edges in H is often orders of magnitude larger than in G.

Thus, to scale to larger graphs, we implement graph projections via a chunked matrix multiply. Specifically, to compute matmul(X, Y) w/ X.shape = (n, m) and Y.shape = (m, k), we split X into c matrices (X_1, ..., X_c), w/ X_i.shape = (n / c, m). Then we compute matmul(X_i, Y) for each X_i, moving the output of each multiplication from GPU to CPU memory as we go. This implementation addresses the common case where we can fit both X and Y in GPU memory, but not matmul(X, Y). Obviously, the chunked matrix multiply incurs a performance penalty, but allows us to run graph projections of much larger graphs on the GPU.

How To Run This Application on DARPA's DGX-1

Gunrock

Prereqs/input

git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/proj/
cp ../../gunrock/util/gitsha1.c.in ../../gunrock/util/gitsha1.c
make clean
make

Application specific parameters

None

Example Command

./bin/test_proj_9.1_x86_64 \
        --graph-type market \
        --graph-file ../../dataset/small/chesapeake.mtx

Example Output

Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
  Parsing MARKET COO format edge-value-seed = 1539110067
 (39 nodes, 340 directed edges)...
Done parsing (0 s).
  Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
 Elapsed: 0.026941
Using advance mode LB
Using filter mode CULL
__________________________
0    0   0   queue3      oversize :  234 ->  342
0    0   0   queue3      oversize :  234 ->  342
--------------------------
Run 0 elapsed: 0.199080, #iterations = 1
edge_counter=1372
0->1 | GPU=9.000000 CPU=9.000000
0->2 | GPU=1.000000 CPU=1.000000
0->3 | GPU=2.000000 CPU=2.000000
...
38->35 | GPU=28.000000 CPU=28.000000
38->36 | GPU=2.000000 CPU=2.000000
38->37 | GPU=18.000000 CPU=18.000000
======= PASSED ======
[proj] finished.
 avg. elapsed: 0.199080 ms
 iterations: 38594739
 min. elapsed: 0.199080 ms
 max. elapsed: 0.199080 ms
 src: 0
 nodes_visited: 38578864
 edges_visited: 38578832
 nodes queued: 140734466796512
 edges queued: 140734466795232
 load time: 85.5711 ms
 preprocess time: 955.861000 ms
 postprocess time: 3.808022 ms
 total time: 960.005045 ms

Expected Output

When run in verbose mode, the app outputs the weighted edgelist of the graph projection H. When run in quiet mode, it only outputs performance statistics and the results of a correctness check.

GraphBLAS

Prereqs/input

git clone --recursive https://github.com/bkj/graphblas_proj
cd graphblas_proj
make clean
make

Application specific parameters

--unweighted
 1 = convert entries of adjacency matrix to 1
 0 = leave entries of adjacency matrix as-is
--proj-debug
 1 = print debug information
 0 = don't print debug information
--print-results
 1 = print the edges of the projected graph
 0 = don't print edges
--onto-cols
 1 = given adjacency matrix A w/ `A.shape = (m, n)`,
     compute projection `H = matmul(transpose(A), A)` w/ `H.shape = (n, n)`
 0 = given adjacency matrix A w/ `A.shape = (m, n)`,
     compute projection `H = matmul(A, transpose(A))` w/ `H.shape = (m, m)`
--num-chunks
 <= 1 = do matrix multiply in one step
  > 1 = break matrix multiply into multiple chunks (eg, so we can compute
        projections larger than GPU memory)

Example Command

# generate some random data `data/X.mtx`
python data/make-random.py --seed 111 \
        --num-rows 1000 --num-cols 1000 --density 0.1

# run graph projection
./proj --X data/X.mtx --unweighted 1 --proj-debug 1

Example Output

proj.cu: loading data/X.mtx
  done
proj.cu: computing transpose
  done
proj.cu: computing projection
mxm analyze successful!
mxm compute successful!
  done
proj_num_edges          = 999946  # number of edges in H, including self loops
dim_out                 = 1000    # dimension of projected graph
proj_num_edges (noloop) = 998946  # number of edges in H, excluding self loops
timer                   = 208.98  # elapsed time (no IO)

Expected Output

The app will print the number of edges in the projected graph. Additionally,

Validation

We compared the results of the Gunrock implementation to the HIVE reference implementation and the PNNL implementation. These two implementations vary slightly in their output (e.g., handling of self loops). We validated the correctness of our results against the HIVE reference implementation.

Performance and Analysis

Performance is measured by the runtime of the app, given:

Implementation limitations

Gunrock

The primary limitation of the current implementation is that it allocates a |V|x|V| array, where |V| is the number of nodes in the network. This means that the memory requirements of the app can easily exceed the memory available on a single GPU (16 GB on the DGX-1). The size of this array reflects the worst case memory requirements of the graph projection workflow; while some graphs can become exceptionally large and dense when projected, we should be able to run the app on larger graphs if we store the output in a different data structure and/or allocate memory more efficiently. Algorithms do exist for mitigating this issue: cuSPARSE's csrgemm method computes the row pointers in one pass, then allocates the exact amount of memory for H's column and value arrays, then actually computes the matrix product. An interesting future direction would be to integrate this sort of algorithm into Gunrock.

It may be possible to improve performance by making assumptions about the topology of the graph. Graph projection is often used for bipartite graphs, but this app does not make any assumptions about the topology of the graph. This choice was made in order to remain consistent with the HIVE reference implementation.

There are various ways that the edges of the output graph H can be weighted. We only implement graph projections for unweighted graphs: the weight of the edge (u, v) in the output graph H is the count of (incoming) neighbors that u and v have in common in the original graph G. Implementation of other weight functions would be fairly straightforward.

GraphBLAS

Currently, for the chunked matrix multiply, the CPU memory allocation and GPU to CPU memory copies for matmul(X_i, Y) block the computation of matmul(X_[i+1], Y). We could implement this in a non-blocking way using CUDA streams, but this would require some redesign of the GraphBLAS APIs.

Certain weighting functions are easily implemented by applying a transformation to the values of the sparse adjacency matrix A, but others cannot. For instance, these weighting functions are easy to implement:

    weight_out = 1                             (by setting both matrices entries to 1)
    weight_out = weight_edge_1                 (by setting one matrix's entries to 1)
    weight_out = weight_edge_1 / weight_edge_2 (by setting one matrix's entries to 1 / x)
    ...

while this function (from the HIVE reference implementation) is not easy to implement in the cuSPARSE plus/multiply semiring:

        weight_out = weight_edge_1 / (weight_edge_1 + weight_edge_2)

Implementation of additional semirings in GraphBLAS is currently in progress, and will extend the family of weightings we could support.

Comparison against existing implementations

When the graph fits in its memory, the Gunrock implementation is approx. 5x faster than the GraphBLAS implementation and approx. 100x faster than PNNL's OpenMP CPU implemention w/ 64 threads. Somewhat surprisingly, PNNL's implementation is substantially slower than a single-threaded scipy sparse matrix multiplication.

When the graph does not fit in the Gunrock implementation's memory, our GPU GraphBLAS implementation is the fastest of the remaining implementations.

Existing implementations

PNNL

We compare our results against PNNL's OpenMP reference implementation. We make minor modifications to their code to handle unweighted graphs, in order to match the Gunrock and GraphBLAS implementations. That is, the weight of the edge in the output graph H is the number of shared neighbors in the original graph.

There is a --simple flag in PNNL's CLI, but examining the code reveals that it just changes the order of operations. Thus, all of our experiments are conducted with --simple=1, which is faster than --simple=0 due to better data access patterns.

Scipy

A very simple baseline is sparse matrix-matrix multiplication as implemented in the popular scipy python package. This is a single-threaded C++ implementation with a Python wrapper. Note, this implementation comes with the same caveats about weighting functions as the Gunrock implementation.

Experiments

MovieLens

MovieLens is a bipartite graph G=(U, V, E) w/ |U|=138493, |V|=26744 and |E|=20000264. We report results on the full graph, as well as several random subgraphs.

For PNNL's OpenMP implementation, we report results using {1, 2, 4, 8, 16, 32, 64} threads.

In all cases, we project onto the nodeset |V|, producing a |V|x|V| graph.

Note: Small differences in the number of nonzero entries in the output (nnz_out) are due to small book-keeping differences (specifically, keeping or dropping self-loops). These differences do not have any meaningful impact on runtimes.

1M edge subgraph (|U|=6743 |V|=13950 |E|=1M)
implementation num_threads nnz_out elapsed_seconds
scipy 1 63104132 2.4912
PNNL OpenMP 1 63090182 61.4852
PNNL OpenMP 2 63090182 62.2842
PNNL OpenMP 4 63090182 60.1542
PNNL OpenMP 8 63090182 37.5853
PNNL OpenMP 16 63090182 22.0257
PNNL OpenMP 32 63090182 13.1482
PNNL OpenMP 64 63090182 9.055
Gunrock 1xP100 GPU 63090182 0.060
GraphBLAS 1xP100 GPU 63090182 0.366
5M edge subgraph (|U|=34395 |V|=20402 |E|=5M)
implementation num_threads nnz_out elapsed_seconds
scipy 1 157071858 10.1052
PNNL OpenMP 1 157051456 357.511
PNNL OpenMP 2 157051456 309.723
PNNL OpenMP 4 157051456 218.519
PNNL OpenMP 8 157051456 113.987
PNNL OpenMP 16 157051456 57.4606
PNNL OpenMP 32 157051456 38.1186
PNNL OpenMP 64 157051456 29.0056
Gunrock 1xP100 GPU 157051456 0.3349
GraphBLAS 1xP100 GPU 157051456 1.221
MovieLens-20M graph (|U|=138493 |V|=26744 |E|=20M)
implementation num_threads nnz_out elapsed_seconds
scipy 1 286857534 39.181
PNNL OpenMP 1 286830790 I killed before finish
PNNL OpenMP 2 286830790 1109.32
PNNL OpenMP 4 286830790 727.224
PNNL OpenMP 8 286830790 358.708
PNNL OpenMP 16 286830790 188.701
PNNL OpenMP 32 286830790 102.964
PNNL OpenMP 64 286830790 163.731
Gunrock 1xP100 GPU 286830790 out-of-memory
GraphBLAS 1xP100 GPU 286830790 5.012

Takeaway: When the graph is small enough, Gunrock graph projection is fastest, followed by GraphBLAS (approx. 5x slower). The PNNL OpenMP implementation is consistently substantially slower than the single threaded scipy implementation, even when using 32+ threads.

RMAT

Next we test on a scale 18 RMAT graph. This is not a bipartite graph, but the graph projection algorithm can still be applied.

This graph was chosen because it was used in benchmarks in PNNL's gitlab repo. However, their command line parameters appear to be incorrect, so our results here are substantially different than reported in their README.

RMAT-18 (|V|=174147 |E|=7600696)
implementation num_threads nnz_out elapsed_seconds
scipy 1 2973926895 150.869
PNNL OpenMP 1 2973752748 I killed before finish
PNNL OpenMP 2 2973752748 I killed before finish
PNNL OpenMP 4 2973752748 812.453
PNNL OpenMP 8 2973752748 677.582
PNNL OpenMP 16 2973752748 419.468
PNNL OpenMP 32 2973752748 369.278
PNNL OpenMP 64 2973752748 602.69
Gunrock 1xP100 GPU 2973752748 out-of-memory
GraphBLAS 1xP100 GPU 2973752748 26.478

Takeaway: GraphBLAS is approx. 5.7x faster than scipy, the next fastest implementation. Again, the PNNL OpenMP implementation is substantially faster than the single-threaded scipy implementation.

When the dataset can fit into memory, Gunrock is \~ 4x faster than GraphBLAS. Since the two implementations use slightly different algorithms, it's hard to tell where the Gunrock speedup comes from. Our hunch is that Gunrock's superior load balancing gives better performance than GraphBLAS, but this is an interesting topic for further research.

Performance information

Gunrock

GraphBLAS

Next Steps

Alternate approaches

As mentioned above, it would be worthwhile to implement a Gunrock version that does not require allocating the |V|x|V| array. It should be possible to achieve this by implementing the same kind of two-pass approach that cuSPARSE uses for spgemm -- one pass computes the CSR offsets, then column and data values are inserted at the appropriate locations.

Gunrock implications

Gunrock does not natively support bipartite graphs, but it is straightforward for programmers to implement algorithms that expect bipartite graphs by keeping track of the number of nodes in each node set. However, for multi-GPU implementations, the fact that a graph is bipartite may be useful for determining the optimal graph partitioning.

Notes on multi-GPU parallelization

Multiple GPU support for GraphBLAS is on the roadmap. Unlike the Seeded Graph Matching problem -- which requires the mxm, mxv and vxm primitives, which in turn necessitates possible changes in data layout -- this problem only requires the mxm primitive, so multiple GPU support for graph projections is easier than for SGM.

Even though extending matrix multiplication to multiple GPUs can be straightforward, doing so in a backend-agnostic fashion that abstracts away the placement (i.e. which part of matrix A goes on which GPU) may still be quite challenging.

Further discussion can be found here.

Notes on dynamic graphs

This workflow does not have an explicit dynamic component. The graph projection operation seems like it would be fairly straightforward to adapt to dynamic graphs -- as nodes/edges are added to G, we create/increment the weight of the appropriate edges in H. However, this adds an additional layer of complexity to the memory allocation step, as we can't use the two-pass approach to allocate memory conservatively.

Notes on larger datasets

If the dataset were too big to fit into the aggregate GPU memory of multiple GPUs on a node, then two directions can be taken in order to be able to tackle these larger datasets:

Notes on other pieces of this workload

This workload does not involve any non-graph components.