toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

full_length: true


GraphSAGE is a way to fit graphs into a neural network: instead of getting the embedding of a vertex from all its neighbors' features as in conventional implementations, GraphSAGE selects some 1-hop neighbors, some 2-hop neighbors connected to those 1-hop neighbors, and computes the embedding based on the features of the 1-hop and 2-hop neighbors. The embedding can be considered as a vector containing hash values describing the interesting properties of a vertex.

During the training process, the adjacency matrix and features of the nodes in the graph are fed into a neural network. The parameters (the W arrays in the algorithm) are updated after each batch by the difference between the predicted labels and the real labels. The per-vertex features won't change, but the parameters will, so the GraphSAGE computation needs to be performed for each batch. Ultimately it should connect to the training process to complete a workflow. However, the training part is pure matrix operations, and the year 1 deliverable only focuses on the graph related portion, which is the GraphSAGE implementation.

Summary of Results

The vertex embedding part of the GraphSAGE algorithm is implemented in the Gunrock framework using custom CUDA kernels, utilizing block-level parallelism, that allow a shorter running time. For the embedding part alone, the GPU implementation is 7.5X to 15X on P100, and 20X to 30X on V100, faster than an OpenMP implementation using 32 threads. The GPU hardware, especially the memory system, has high utilizations from these custom kernels. It is still unclear how to expose block-level parallelism for more general usage in other applications in Gunrock.

Connecting the vertex embedding with the neural network training part, and making the GraphSAGE workflow complete, would be an interesting task for year 2. Testing on the complete workflow for prediction accuracy and running speed will be more meaningful.

Summary of Gunrock Implementation

Gunrock's implementation for the 1st year is only the embedding / inferencing phase, without the training phase. It is based on Algorithm 2 with K=2 of the GraphSAGE paper ("Inductive Representation Learning on Large Graphs",

Given a graph G, the inputs are per-vertex features, weight matrices W^k, and a non-linear activation function (ReLu); the output from GraphSAGE is the embedding vector of each vertex. The current Gunrock implementation randomly selects neighbors from the neighborhood, and simple changes to the code can enable other selection methods, such as weighted uniform, importance sampling (used by FaseGCN), or a random walk probability like DeepWalk or Node2Vec (used by PinSage). An aggregator is a function to accumulate data from the selected neighbors; the current implementation uses the Mean aggregator, and it can be changed to other accumulation functions easily.

The pseudocode of Gunrock's implementation follows. The current implementation uses custom CUDA kernels, because block-level parallelism is critical for faster running speed; all other functions, such as memory management, load and store routines, block level parallel reduction, and graph accesses (get degree, get neighbors, etc.) are provided by the framework or the utility functions of Gunrock. The GraphSAGE algorithm uses B2, B1 and B0 for the sources, the 1-hop neighbors and the 2-hop neighbors; for easier understanding of the code, we use sources, children and leafs for these three groups of vertices instead. To manage the memory usage of intermediate data, batches of source vertices are processed one by one, with each of size B.

source_start := 0;
While (source_source < V)
    num_sources := (source_start + B > V) ? V - source_start : B;

    // Kernel1: pick children and leafs
    For children# from 0 to num_children_per_source x num_sources:
        source := source_start + (children# / num_children_per_source);
        child := SelectNeighbor(source);
        leafs_feature := {0};
        For i from 1 to num_leafs_per_child:
            leaf := SelectNeighbor(child);
            leafs_feature += feature[leaf];

        children[children#] := child;
        leafs_features[children#] := leafs_feature;

    // Kernel2: Child-centric computation
    For children# from 0 to num_children_per_source x num_sources:
        child_temp := ReLu(concatenate(
            feature[children[children#]] x Wf1,
            Mean(leafs_features[children#]) x Wa1));
        child_temp /= sqrt(sum(child_temp));
        children_temp    [children# / num_children_per_source] += child_temp;
        children_features[children# / num_children_per_source] += feature[child];

    // Kernel3: Source-centric computation
    For source# from 0 to num_sources - 1:
        source := source_start + source#;
        source_temp := ReLu(concatenate(
            feature[source] x Wf1,
            Mean(children_features[source#] x Wa1)));
        source_temp /= sqrt(sum(source_temp));

        result := ReLu(concatenate(
            source_temp x Wf2,
            Mean(children_temp[source#]) x Wa2));
        result /= sqrt(sum(result));
        source_embedding[source#] := result;

    Move source_embedding from GPU to CPU;
    source_start += B;

How To Run This Application on DARPA's DGX-1


CUDA should have been installed; $PATH and $LD_LIBRARY_PATH should have been set correctly to use CUDA. The current Gunrock configuration assumes boost (1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in the Makefiles. DARPA's DGX-1 has both installed when the tests are performed.

git clone --recursive \
  -b dev-refactor
cd gunrock
git submodule init
git submodule update
mkdir build
cd build
cmake ..
cd ../tests/sage

At this point, there should be an executable test_sage_<CUDA version>_x86_64 in tests/sage/bin.

The datasets are assumed to have been placed in /raid/data/hive, and converted to proper matrix market format (.mtx). At the time of testing, pokec, amazon, flickr, twitter and cit-Patents are available in that directory.

Note that GraphSage is an inductive representation learning algorithm, so it reasonable to assume that there is no dangling vertices in the graph. Before running GraphSAGE, please remove the dangling vertices from the graph. In the case when dangling vertices are present, the dangling vertices themselves will be treated as their neighbors.

The testing is done with Gunrock using dev-refactor branch at commit 0ed72d5 (Oct. 25, 2018), using CUDA 9.2 with NVIDIA driver 390.30 on a Tesla P100 GPU in the DGX-1 machine, and using CUDA 10.0 with NVIDIA driver 410.66 on a Tesla V100 GPU in Bowser (an machine used by the Gunrock team in UC Davis).

Running the application

Application specific parameters

The W arrays used by GraphSAGE should be produced by the training process; without the training part at the moment, we use some given datasets or randomly generate them if not available. The array input files are floating-point values in plain text format; examples can be found in the gunrock/app/sage directory of the gunrock repo.

--Wa1 : std::string, default =
        <weight matrix for W^1 matrix in algorithm 2, aggregation part>
         dimension 64 by 128 for pokec;
         It should be leaf feature length by a value you want for W2 layer
--Wa1-dim1 : int, default = 128
        Wa1 matrix column dimension
--Wa2 : std::string, default =
        <weight matrix for W^2 matrix in algorithm 2, aggregation part>
         dimension 256 by 128 for pokec;
         It should be child_temp length by output length
--Wa2-dim1 : int, default = 128
        Wa2 matrix column dimension
--Wf1 : std::string, default =
        <weight matrix for W^1 matrix in algorithm 2, feature part>
         dimension 64 by 128 for pokec;
         It should be child feature length by a value you want for W2 layer
--Wf1-dim1 : int, default = 128
        Wf1 matrix column dimension
--Wf2 : std::string, default =
        <weight matrix for W^2 matrix in algorithm 2, feature part>
         dimension 256 by 128 for pokec;
         It should be source_temp length by output length
--Wf2-dim1 : int, default = 128
        Wf2 matrix column dimension
--feature-column : std::vector<int>, default = 64
        feature column dimension
--features : std::string, default =
        <features matrix>
    dimension |V| by 64 for pokec;
--omp-threads : int, default = 32
    number of threads to run CPU reference
--num-children-per-source : std::vector<int>, default = 10
        number of sampled children per source
--num-leafs-per-child : std::vector<int>, default = -1
        number of sampled leafs per child; default is the same as num-children-per-source
--batch-size : std::vector<int>, default = 65536
        number of source vertex to process in one iteration

Example Command

./bin/test_sage_9.2_x86_64 \
 market /raid/data/hive/pokec/pokec.mtx  --undirected \
 --Wf1 ../../gunrock/app/sage/wf1.txt \
 --Wa1 ../../gunrock/app/sage/wa1.txt \
 --Wf2 ../../gunrock/app/sage/wf2.txt \
 --Wa2 ../../gunrock/app/sage/wa2.txt \
 --features ../../gunrock/app/sage/features.txt \
 --num-runs=10 \

When Wf1, Wa1, Wf2, Wa2, or features are not available, random values are used. The embeddings may be not useful at all, but the memory access patterns and the computation workload are the same, regardless of whether the random inputs are used. Using the random inputs enable us to test on any graph without requiring the associative arrays.


The outputs are in the sage directory; look for the .txt files. An example is shown below for the pokec dataset:

Loading Matrix-market coordinate-formatted graph ...
  ... Loading progress ...
Converting 1632803 vertices, 44603928 directed edges ( ordered tuples) to CSR format...Done (0s).
Degree Histogram (1632803 vertices, 44603928 edges):
    Degree 0: 0 (0.000000 %)
    Degree 2^0: 163971 (10.042301 %)
    Degree 2^1: 201604 (12.347111 %)
    Degree 2^2: 238757 (14.622523 %)
    Degree 2^3: 273268 (16.736128 %)
    Degree 2^4: 298404 (18.275567 %)
    Degree 2^5: 265002 (16.229882 %)
    Degree 2^6: 148637 (9.103180 %)
    Degree 2^7: 38621 (2.365319 %)
    Degree 2^8: 4089 (0.250428 %)
    Degree 2^9: 418 (0.025600 %)
    Degree 2^10: 23 (0.001409 %)
    Degree 2^11: 3 (0.000184 %)
    Degree 2^12: 3 (0.000184 %)
    Degree 2^13: 3 (0.000184 %)

 feature-column=64 num-children-per-source=10 num-leafs-per-child=-1
Computing reference value ...
rand-seed = 1540498523
CPU Reference elapsed: 25242.941406 ms.
Embedding validation: PASS
Using randomly generated Wf1
Using randomly generated Wa1
Using randomly generated Wf2
Using randomly generated Wa2
Using randomly generated features
Using advance mode LB
Using filter mode CULL
rand-seed = 1540498553
Run 0 elapsed: 2047.366858 ms, #iterations = 3190
... More runs
Run 9 elapsed: 2013.216972 ms, #iterations = 3190
Embedding validation: PASS
[Sage] finished.
 avg. elapsed: 2042.503190 ms
 iterations: 3190
 min. elapsed: 2009.524107 ms
 max. elapsed: 2243.710995 ms
 load time: 1600.46 ms
 preprocess time: 3997.880000 ms
 postprocess time: 2106.487989 ms
 total time: 26634.360075 ms

There is 1 OMP reference run on the CPU for each combination of {feature-column, num-children-per-source, num-leafs-per-child}, with the timing reported after CPU Reference elapsed. There are 10 GPU runs for each combination of the previous three parameters, plus batch-size. The computation workload of the GPU runs are the same as the reference CPU run, with different batch sizes. The GPU timing is reported after Run x elapsed:, and the average running time of the 10 GPUs is reported after avg. elapsed.

The mathematical formula of the GraphSAGE algorithm is relatively simple and repeats itself three times in the form of Normalize(C x Wf + Mean(D) x Wa). Because of this simplicity, it is still possible to verify the implementation by visually inspecting the code. The resulted embeddings are also checked for the L2 norm, which should be close to 1 for every vertex. Because the neighbor selection process is inherently random, it would be very difficult to do a number-by-number checking with other implementations, including the reference. A more meaningful regression test will look at the training or validation accuracy when the full workflow is completed, which is a possibility for future work in year 2.

Performance and Analysis

The OpenMP and GPU implementations are measured for runtime. It would be additionally useful to validate the successful rate of the whole pipeline with the training process in place (perhaps in year 2).

The datasets used for experiments are:

Dataset V E
flickr 105938 4633896
amazon 548551 1851744
pokec 1632803 44603928
cit-Patents 3774768 33037894
twitter 7199978 43483326
europe-osm 50912018 108109320

The running times in milliseconds are listed below, for both machines. F stands for the length of features, C stands for the number of children per source, the same as the number of leafs per child, and B notes the batch size that produces the shortest running time on GPU among {512, 1024, 2048, 4096, 8192, 16384}.

The running time on DGX-1 with Tesla P100 GPU:

Dataset F C B Gunrock GPU OpenMP Speedup
flickr 64 10 16384 113.431 1607.468 14.17
flickr 64 25 16384 248.523 2630.659 10.59
flickr 64 100 2048 1139.007 10702.981 9.40
flickr 128 10 16384 192.916 1800.150 9.33
flickr 128 25 8192 442.226 4022.799 9.10
flickr 128 100 2048 2116.955 20655.732 9.76
amazon 64 10 16384 579.342 8905.530 15.37
amazon 64 25 16384 1235.168 18034.229 14.60
amazon 64 100 4096 5486.910 45337.011 8.26
amazon 128 10 16384 976.759 9418.961 9.64
amazon 128 25 16384 2193.159 29645.709 13.52
amazon 128 100 4096 10112.228 80677.594 7.98
pokec 64 10 16384 1744.602 25242.941 14.47
pokec 64 25 16384 3806.404 34517.180 9.07
pokec 64 100 2048 17486.692 133920.000 7.66
pokec 128 10 16384 2950.606 41414.535 14.04
pokec 128 25 8192 6791.829 74983.391 11.04
pokec 128 100 2048 32000.378 297979.438 9.31
cit-Patents 64 10 16384 4005.860 52094.125 13.00
cit-Patents 64 25 16384 8541.500 93715.789 10.97
cit-Patents 64 100 4096 38494.688 293333.781 7.62
cit-Patents 128 10 16384 6784.752 95639.117 14.10
cit-Patents 128 25 8192 15250.170 146539.250 9.61
cit-Patents 128 100 4096 70074.883 530910.375 7.58
twitter 64 10 16384 7594.364 103753.961 13.66
twitter 64 25 16384 16212.790 162819.531 10.04
twitter 64 100 4096 73488.745 636224.625 8.66
twitter 128 10 16384 12753.125 147705.375 11.58
twitter 128 25 16384 28827.354 282497.438 9.80
twitter 128 100 4096 134027.966 1105741.500 8.25
europe_osm 64 10 16384 53521.982 611449.625 11.42
europe_osm 64 25 16384 113740.739 1016479.938 8.94
europe_osm 64 100 4096 509313.472 4441408.500 8.72

The running time on Bowser with a Tesla V100 GPU

Dataset F C B Gunrock GPU OpenMP Speedup vs. CPU Speedup vs. P100
flickr 64 10 16384 39.894 1094.377 27.43 2.90
flickr 64 25 8192 91.810 2177.953 23.72 2.76
flickr 64 100 1024 448.888 8641.063 19.25 2.57
flickr 128 10 16384 65.131 1849.658 28.40 2.95
flickr 128 25 4096 156.965 3981.435 25.37 2.84
flickr 128 100 512 802.966 16471.338 20.51 2.70
amazon 64 10 16384 196.079 6142.780 31.33 2.95
amazon 64 25 8192 423.145 12674.514 29.95 2.92
amazon 64 100 2048 1963.691 49205.359 25.06 2.79
amazon 128 10 16384 324.067 9978.198 30.79 3.10
amazon 128 25 8192 733.181 21726.697 29.63 2.99
amazon 128 100 2048 3361.894 86967.164 25.87 3.01
pokec 64 10 16384 602.116 17111.664 28.42 2.84
pokec 64 25 8192 1379.450 35887.129 26.02 2.71
pokec 64 100 1024 6793.011 140751.234 20.72 2.54
pokec 128 10 16384 1000.310 28987.953 28.98 2.96
pokec 128 25 4096 2366.202 63863.352 26.99 2.82
pokec 128 100 512 11859.789 265325.281 22.37 2.64
cit-Patents 64 10 16384 1374.782 38803.195 28.22 2.95
cit-Patents 64 25 8192 3006.717 78112.516 25.98 2.87
cit-Patents 64 100 2048 13910.529 291278.125 20.94 2.78
cit-Patents 128 10 16384 2276.261 65735.234 28.88 2.99
cit-Patents 128 25 8192 5201.184 141957.922 27.29 2.96
cit-Patents 128 100 2048 23922.527 560321.438 23.42 2.94
twitter 64 10 16384 2575.303 72372.484 28.10 2.91
twitter 64 25 8192 5658.345 148938.422 26.32 2.84
twitter 64 100 2048 26468.915 569925.500 21.53 2.77
twitter 128 10 16384 4270.454 125431.766 29.37 2.98
twitter 128 25 8192 9744.686 267841.563 27.49 2.93
twitter 128 100 2048 45604.444 1098039.375 24.08 2.93
europe_osm 64 10 16384 18440.253 497153.969 26.96 2.90
europe_osm 64 25 16384 39883.421 1008675.500 25.29 2.85
europe_osm 64 100 4096 184128.123 3825227.500 20.77 2.77

Implementation limitations

Comparison against existing implementations

Because computation on each vertex is independent from other vertices, GraphSAGE is an embarrassingly parallel problem when parallelized across vertices. Running a simple test with the pokec dataset, feature length as 64, num_children_per_source and num_leafs_per_child both at 10, the serial run (with omp-threads forced to 1) on the DGX-1 takes 366390.250 ms, as compared to 25242.941 ms using 32 threads; using 32 threads is about 14.5X faster than a single thread, which shows GraphSAGE scales pretty well on the CPU.

Comparing the running time of a 32-thread OpenMP and Gunrock's GPU implementation, the P100 is about 7.5X to 15X faster. Increasing the feature length from 64 to 128 roughly doubles the computation workload, and the speedup does not change very much for datasets with more than about 1M vertices. However, increasing the number of children per source and the number of leafs per child decreases the speedup. The OMP's running time increases slower than the number of neighbors selected, while the GPU's running time increases faster than the number of neighbors selected. This may be attributed to the cache effect: the parallelism on CPU is limited, so data reuse rate is high, even when the number of neighbors increases; however, the number of source or children processed on the GPU at the same time is much larger, with a much bigger working set, and this decreases the cache hit rate, so results in longer running time.

Also interesting is comparing the runtime on V100 and P100: V100 is about 3X faster than P100 when running GraphSAGE. This large performance difference is caused by the different limiting factors when running GraphSAGE on these two GPUs; details are in the the Performance limitations section. Compared to OpenMP, the V100 is about 20X to 30X faster.

Performance limitations

We profiled GraphSAGE with the pokec dataset on a Titan Xp GPU (profiling on P100 caused an internal error in the profiler itself; Titan Xp has roughly the same SM design as P100, but has only 30 SMs vs. 56 on the P100; P100 also has 16 GB HMB2 memory, and Titan Xp only has 12 GB GDDR5X; runtime on Titan Xp and P100 is similar). kernel2 takes up about 60% of the computation of an batch, 10.64 ms out of 17.76 ms for pokec with 64 feature length, 10 neighbors, and batch size at 16384. Kernel1 takes 1.56 ms, or 8.8%, and Kernel3 takes 5.52 ms, or 31.1%.

Further detail on the profile of Kernel2 on the Titan Xp shows the utilization of the memory system:

Type Transactions Bandwidth Utilization
Shared Loads 2129920 31.997 GB/s
Shared Stores 1638400 24.613 GB/s
Shared Total 3768320 56.611 GB/s Idle to low
Local Loads 0 0 GB/s
Local Stores 0 0 GB/s
Global Loads 2685009922 1575.871 GB/s
Global Stores 0 0 GB/s
Texture Reads 671252480 2521.025 GB/s
Unified Total 3356262402 4096.897 GB/s High
L2 Reads 264170192 992.145 GB/s
L2 Writes 10485773 39.381 GB/s
L2 Total 274655965 1031.526 GB/s Low to medium
Device Reads 2181833 8.194 GB/s
Device Writes 543669 2.042 GB/s
Device Total 2725502 10.236 GB/s Idle to low


It's clear that the unified cache is almost fully utilized, at 4 TBps out of the 5 TBps theoretical upper bound, and is the bottleneck. This is because the W arrays and the intermediate arrays are highly reusable.

Running the same experiment on the V100 shows a different picture within the memory system:

Type Transactions Bandwidth Utilization
Shared Loads 2138599 74.166 GB/s
Shared Stores 1640842 56.904 GB/s
Shared Total 3779441 131.070 GB/s Idle to low
Local Loads 0 0 GB/s
Local Stores 0 0 GB/s
Global Loads 419594240 3637.862 GB/s
Global Stores 0 0 GB/s
Texture Reads 177448960 6153.897 GB/s
Unified Total 597043200 9791.759 GB/s Medium
L2 Reads 8209326 71.174 GB/s
L2 Writes 5243176 45.458 GB/s
L2 Total 274655965 116.633 GB/s Idle to low
Device Reads 2402833 20.832 GB/s
Device Writes 571925 4.959 GB/s
Device Total 2974758 25.791 GB/s Idle to low


On the V100, kernel2 only takes 3.982 ms (about 60% of 6.67 ms per batch), and the unified-cache throughput increases to 9.8 TBps, more than double than on the Titan Xp. In fact, the theoretical upper bound of V100's L1 throughput is 28 TBps, resulted from doubling each SM's load throughput from 128 bytes per cycle to 256 bytes per cycle, and increasing the SM count to 80. This is the reason why the V100 can outperform P100 and Titan Xp by about 3X. The performance bottleneck is no longer the memory system, and switches to integer computations, which comes mainly from array index calculation. This particular kernel also takes up 32 registers per thread, which is the limit for full GPU occupancy. Storing intermediate index calculation results would help if the register usage is not so high.

Next Steps

Alternate approaches

Things we tried that didn't really work

An simple implementation that uses a thread to process the per-source or per-child computation is coded, but it runs about 10X slower than the current implementation that uses a block to process such units. One reason is the use of block-level parallel primitives, such as block reduce. Another reason is that by using a whole block, instead of a thread, to process the same computation, the working set is greatly reduced together with the parallelism, and the whole working set can fit into the cache system. When using higher parallelism, the working set is larger, and forced to be evicted into the global memory, and creates a bottleneck. Actually during the experiment, for the thread-level parallelism implementation, reducing the number of blocks of the kernel improves its running time, the opposite to normally what would be expected from running other kernels.

Gunrock implications

One thing that Gunrock does not provide, or intentionally hides, is block-level parallelism. However, it comes in handy when implementing the custom kernels for GraphSAGE: each block can process a one-dimension vector, with each thread holding one element, then use a block-level reduce to get the sum of those elements; this way is highly efficient, and actually reduces the parallelism and with it the size of the working set of data.

Notes on multi-GPU parallelization

The main memory usage is due to feature data, so it is critical to not duplicate features. The side effect is the computation needs to be divided into child-centric and source-centric parts, and exchange data in between. It should be scalable, and easy to implement.

Notes on dynamic graphs

GraphSAGE does not have a dynamic graph component, but it should able to work on a dynamic graph. Some of the data may be reusable if the graph has not been significantly changed, but the resulting memory requirement to store the intermediate data may make data reuse impossible.

Notes on larger datasets

If the dataset is so large that the graph and the per-vertex feature data are larger than the combined GPU memory, it's possible to accumulate the features of leafs and children on CPU, transfer the data on to GPU, and perform the computation. Because of the cost of the transfer, runtime will increase significantly as compared to the cases when the full data can fit in the GPU memory, but whether that will cause an increase above the OpenMP implementation is still unknown.

Notes on other pieces of this workload

The main part of GraphSAGE workflow is actually the training process, which will be outside of Gunrock, provided by TensorFlow, PyTorch or other machine learning libraries. How to connect the training with the Gunrock GPU implementation is the main task for this workload going forward.

Research potential

The GPU implementation of the embedding part runs a lot faster than on the CPU, and hits a few GPU hardware limitations. It should have comparable runtime to other GPU implementations. But it's only useful as a part of the whole workload and achieves comparable prediction accuracy as conventional / reference implementations.

An interesting question raised from the GraphSAGE GPU implementation is whether it is useful to expose block-level parallelism to higher-level programming models/APIs, and if so, how to do that. It's clear that working at the block level provides benefits, such as the ability to use block-level primitives like scan and reduce. But it also comes with costs, most importantly, requiring the programmer to have knowledge about the GPU hardware. It also reduces the portability of the implementation, because two-level parallelism may not exist on other processors.