toc_footers:  Gunrock: GPU Graph Analytics  Gunrock © 2018 The Regents of the University of California.
search: true
full_length: true
Geolocation
Infers user locations using the location (latitude, longitude) of friends through spatial label propagation. Given a graph G
, geolocation examines each vertex v
's neighbors and computes the spatial median of the neighbors' location list. The output is a list of predicted locations for all vertices with unknown locations.
Summary of Results
Geolocation or geotagging is an interesting parallel problem, because it is among the few that exhibits the dynamic parallelism pattern within the compute. The pattern is as follows; there is parallel compute across nodes, each node has some serial work and within the serial work there are several parallel math operations. Even without leveraging dynamic parallelism within CUDA (kernel launches within a kernel), Geolocation performs well on the GPU environment because it mainly requires simple math operations, instead of complicated memory movement schemes.
However, the challenge within the application is load balancing this simple compute, such that each processor has roughly the same amount of work. Currently, in Gunrock, we map Geolocation using the ForAll()
compute operator with optimizations to exit early (performing less work and fewer reads). Even without addressing load balancing issue with a complicated balancing scheme, on the HIVE datasets we achieve a 100x speedup with respect to the CPU reference code, implemented using C++ and OpenMP, and ~533x speedup with respect to the GTUSC implementation. We improve upon the algorithm by avoiding a global gather and a global synchronize, and using 3x less memory than the GTUSC reference implementation.
Summary of Gunrock Implementation
There are two approaches we took to implement Geolocation within Gunrock:
 [Fewer Reads] Global Gather: uses two
compute
operators asForAll()
. The firstForAll()
is agather
operation, gathering all the values of neighbors with known locations for an active vertexv
, and the secondForAll()
uses those values to compute thespatial_center
where the spatial center of a list's points is the center of those points on the earth's surface.
def gather_op(Vertex v):
for neighbor in G.neighbors(v):
if isValid(neighbor.location):
locations_list[v].append(neighbor.location)
def compute_op(Vertex v):
if !isValid(v.location):
v.location = spatial_center(locations_list[v])
 [Less Memory] Repeated Compute: skips the global gather and uses only one
compute
operator as aForAll()
to find the spatial center of every vertex. During the spatial center computation, instead of iterating over all valid neighbors (where valid neighbor is a neighbor with a known location), we iterate over all neighbors for each vertex, doing more random reads than the global gather approach, but using3x
less memory.
def spatial_center(Vertex v):
if !isValid(v.location):
v.location = spatial_median(neighbors_list[v])
 [Optimization] Early Exit: fuses the global gather approach with the repeated compute, by performing one local gather for every vertex within the spatial center operator (without a costly device barrier), and exiting early if a vertex
v
has only one or two valid neighbors:
def spatial_center(Vertex v):
if !isValid(v.location):
if v.valid_locations == 1:
v.location = valid_neighbor[v].location:
exit
else if v.valid_locations == 2:
v.location = mid_point(valid_neighbors[v].location)
else:
v.location = spatial_median(neighbors_list[v])
Comparing Global Gather vs. Repeated Compute
Approach  Memory Usage  Memory Reads/Vertex  Device Barriers  Largest Dataset (P100) 

Global Gather  $O(3 \cdot \cardinality{E})$  # of valid locations  1  ~160M Edges 
Repeated Compute  $O(\cardinality{E})$  degree of vertex  0  ~500M Edges 
Note: spatial_median()
is defined as center of points on earth's surface  given a set of points Q
, the function computes the point p
such that: sum([haversine_distance(p, q) for q in Q])
is minimized. See gunrock/app/geo/geo_spatial.cuh
for details on the spatial median implementation.
How To Run This Application on DARPA's DGX1
Prerequisites
git clone recursive https://github.com/gunrock/gunrock b devrefactor
cd gunrock/tests/geo/
make clean && make
HIVE Data Preparation
Prepare the data, skip this step if you are just running the sample dataset. Assuming we are in tests/geo
directory:
export TOKEN= # get this Authentication TOKEN from
# https://apitoken.hiveprogram.com/#!/user
wget header "Authorization:$TOKEN" \
https://hiveprogram.com/data/_v0/geotagging/instagram/instagram.tar.gz
tar xzvf instagram.tar.gz && rm instagram.tar.gz
cd instagram/graph
cp ../../generatedata.py ./
python generatedata.py
This will generate two files, instagram.mtx
and instagram.labels
, which can be used as an input to the geolocation app.
Running the application
Application specific parameters:
labelsfile
file name containing node ids and their locations.
geoiter
number of iterations to run geolocation or (stop condition).
(default = 3)
spatialiter
number of iterations for spatial median computation.
(default = 1000)
geocomplete
runs geolocation for as many iterations as required
to find locations for all nodes.
(default = false because it uses atomics)
debug
Debug label values, this prints out the entire labels
array (longitude, latitude).
(default = false)
Example commandline:
# geolocation.mtx is a graph based on chesapeake.mtx dataset
./bin/test_geo_10.0_x86_64 graphtype=market graphfile=./geolocation.mtx \
labelsfile=./locations.labels geoiter=2 geocomplete=false
Sample input (labels):
% Nodes Latitude Longitude
39 2 2
1 37.7449063493 122.009432884
2 37.8668048274 122.257973253
4 37.869112506 122.25910604
6 37.6431858915 121.816156983
11 37.8652346572 122.250634008
19 38.2043433677 114.300341275
21 36.7582225593 118.167916598
22 33.9774659389 114.886512278
30 39.2598884729 106.804662071
31 37.880443573 122.230147039
39 9.4276164485 110.640705659
Sample output:
Loading Matrixmarket coordinateformatted graph ...
Reading from ./geolocation.mtx:
Parsing MARKET COO format edgevalueseed = 1539674096
(39 nodes, 340 directed edges)...
Done parsing (0 s).
Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
Labels File Input: ./locations.labels
Loading Labels into an array ...
Reading from ./locations.labels:
Parsing LABELS
(39 nodes)
Done parsing (0 s).
Debugging Labels 
(nans represent unknown locations)
locations[ 0 ] = < 37.744907 , 122.009430 >
locations[ 1 ] = < 37.866806 , 122.257973 >
locations[ 2 ] = < nan , nan >
locations[ 3 ] = < 37.869114 , 122.259109 >
...
locations[ 35 ] = < nan , nan >
locations[ 36 ] = < nan , nan >
locations[ 37 ] = < nan , nan >
locations[ 38 ] = < 9.427616 , 110.640709 >
__________________________
______ CPU Reference _____

Elapsed: 0.267029
Initializing problem ...
Number of nodes for allocation: 39
Initializing enactor ...
Using advance mode LB
Using filter mode CULL
nodes=39
__________________________
0 0 0 queue3 oversize : 234 > 342
0 0 0 queue3 oversize : 234 > 342

Run 0 elapsed: 11.322021, #iterations = 2
Node [ 0 ]: Predicted = < 37.744907 , 122.009430 > Reference = < 37.744907 , 122.009430 >
Node [ 1 ]: Predicted = < 37.866806 , 122.257973 > Reference = < 37.866806 , 122.257973 >
Node [ 2 ]: Predicted = < 9.427616 , 110.640709 > Reference = < 9.427616 , 110.640709 >
Node [ 3 ]: Predicted = < 37.869114 , 122.259109 > Reference = < 37.869114 , 122.259109 >
...
Node [ 35 ]: Predicted = < 37.864429 , 122.199409 > Reference = < 37.864429 , 122.199409 >
Node [ 36 ]: Predicted = < 23.755602 , 115.803055 > Reference = < 37.807079 , 122.134163 >
Node [ 37 ]: Predicted = < 37.053715 , 115.913658 > Reference = < 37.053719 , 115.913628 >
Node [ 38 ]: Predicted = < 9.427616 , 110.640709 > Reference = < 9.427616 , 110.640709 >
0 errors occurred.
[geolocation] finished.
avg. elapsed: 11.322021 ms
iterations: 2
min. elapsed: 11.322021 ms
max. elapsed: 11.322021 ms
load time: 68.671 ms
preprocess time: 496.136000 ms
postprocess time: 0.463009 ms
total time: 508.110046 ms
Output
When quick
mode is disabled, the application performs the CPU reference implementation, which is used to validate the results from the GPU implementation by comparing the predicted latitudes and longitudes of each vertex with the CPU reference implementation. Further correctness checking was performed by comparing results to the HIVE reference implementation.
Geolocation application also supports the quiet
mode, which allows the user to skip the output and just report the performance metrics (note, this will run the CPU implementation in the background without any output).
Performance and Analysis
Runtime is the key metric for measuring performance for Geolocation. We also check for prediction accuracy of the labels, but that is a threshold for correctness. If a certain threshold is not met (while comparing results to the CPU reference code), the output is considered incorrect and that run is invalid. Therefore, for the report we just focus on runtime.
Implementation limitations
Geolocation is also one of the few applications that exhibits a dynamic parallelism pattern:
 Parallel compute across the nodes,
 Serial compute per node, and
 Parallel compute within the serial compute per node.
One way to implement this will use the ForAll()
operator for the parallel compute across the nodes, a simple while loop for the serial compute per node, and finally multiple neighbor_reduce()
operators for the parallel work within the serial while loop. Currently, we do not have a way to support this within Gunrock, but moving forward we can potentially leverage kernel launch within a kernel ("dynamic parallelism") to address this limitation.
Comparison against existing implementations
GPU  Dataset  $\cardinality{V}$  $\cardinality{E}$  Iters  Spatial Iters  GTUSC (16 threads)  Gunrock (CPU)  Gunrock (GPU) 

P100  sample  39  170  10  1000  N/A  0.144005  0.022888 
P100  23731995  82711740  10  1000  8009.491 ms  1589.884033  15.113831  
V100  50190344  488078602  10  1000  N/A  9216.666016  46.108007 
On a workload that fills the GPU, Gunrock outperforms GT's OpenMP C++ implementation by ~533x. Comparing Gunrock's GPU vs. CPU performance, we see that Gunrock's GPU version outperforms the CPU implementation by 100x. There is a lack of available datasets against which we can compare performance, so we use only the provided instagram and twitter datasets, and a toy sample for a sanity check on NVIDIA's P100 with 16GB of global memory and V100 with 32GB of global memory. All tested implementations meet the criteria of accuracy, which is validated against the output of the original python implementation.
 HIVE reference implementation uses distributed PySpark.
 GTUSC implementation uses C++ OpenMP.
Performance limitations
As discussed later in the "Alternate approaches" section, the current implementation of geolocation uses a compute operator with minimal load balancing. In cases where the graph is not so nicely distributed (where there is a great deal of difference in the degrees of vertices), the entire application will suffer significantly from load imbalance.
Profiling the application shows 98.78% of the compute time in GPU activities is in the spatial_median
kernel, which gives us a good direction to focus our efforts on loadbalancing the workloads within the operator. Specifically, we must target the for
loops iterating over the neighbor list for spatial center calculations.
Next Steps
Alternate approaches

Neighborhood Reduce w/ Spatial Center: We can perform better load balancing by leveraging a neighborreduce (
advance
operator +cub::DeviceSegmentedReduce
) instead of using a compute operator. In graphs where the degrees of nodes vary a lot, the compute operator will be significantly slower than a loadbalanced advance + segmented reduce. 
Push Based Approach: Instead of gathering all the locations from all the neighbors of an active vertex, we could instead perform a scatter of valid locations of all active vertices to their neighbors; this is a push approach vs. our current implementation's pull. Similar to the global gather approach, a pushbased geolocation could also suffer from load imbalance, where some vertices will have to broadcast their valid locations to a long list of neighbors, while others will only have few neighbors to update. A pushbased approach will also require a device synchronize before the spatial center computation, but may perform better by using an
advance_op
with an atomic update (note, pull is done using aForAll()
).
Gunrock implications

The
predicted
atomic: Geolocation and some other applications exhibit the same behavior where the algorithm stops when all vertices' labels are predicted or determined. In Geolocation's case, when a location for all nodes is predicted, geolocation converges. We currently implement this with a loop and an atomic. This needs to be more of a core operation (minioperator) such that whenisValidCount(labelsV) == V
, a stop condition is met. Currently, we sidestep this issue by using a numberofiterations parameter to determine the stop condition. 
Parallel > Serial > Parallel: As discussed earlier, Gunrock currently doesn't have a way to address the dynamic parallelism problem, or even a kernel launch within a kernel. In geolocation's case, these minor parallel work inside the serial loop need to be multiple neighbor reduce.
Notes on multiGPU parallelization
The challenging part for a multiGPU Geolocation would be to obtain the updated node location from a separate device if the two vertices on different devices share an edge. An interesting approach here would be leveraging the P2P memory bandwidth with the new NVLink connectors to exchange a small amount of updates across the NVLink's memory lane; other ways are simply using direct accesses or explicit data movement. This is detailed more in the scaling documentation, but the communication model for multiGPU geolocation could be done in the following way:
do
Local geo location updates on local vertices;
Broadcast local vertices' updates;
while no more update.
Notes on dynamic graphs
Streaming graphs is an interesting problem for the Geolocation application, because when predicting the location of a certain node, if another edge is introduced, the location of the vertex has to be recomputed entirely. This can still be done in an iterative manner, where if a node was inserted as a neighbor to a vertex, that vertex's predicted location will be marked invalid and during the next iteration it will be computed again along with all the other invalid vertices (locations).
Notes on larger datasets
If the datasets are larger than a single or multiGPU's aggregate memory, the straightforward solution would be to let Unified Virtual Memory (UVM) in CUDA automatically handle memory movement.
Notes on other pieces of this workload
Geolocation calls a lot of CUDA math functions (sin
, cos
, atan
, atan2
, median
, mean
, fminf
, fmaxf
, etc.). Some of these microworkloads can also leverage the GPU's parallelism; for example, a mean could be implemented using reducemean/sum
. We currently don't have these math operators exposed within Gunrock in such a way they can be used in graph applications.
Research Potential
Further research is required to study Geolocation's dynamic parallelism pattern, it's memory access behavior, compute resource utilization, implementation details (API and core) and load balancing strategies for dynamic parallelism on the GPUs. Studying and understanding this pattern can allow us to create a more generalized approach for load balancing parallel > serial > parallel
type of problems. It further invokes the question of studying when dynamic parallelism is better than mapping an algorithm to a more conventional static approach (if possible).