toc_footers:  Gunrock: GPU Graph Analytics  Gunrock © 2018 The Regents of the University of California.
search: true
full_length: true
Sparse Fused Lasso
Given a graph where each vertex on the graph has a weight, sparse fused lasso (SFL), also named sparse graph trend filter (GTF), tries to learn a new weight for each vertex that is (1) sparse (most vertices have weight 0), (2) close to the original weight in the l2 norm, and (3) close to its neighbors' weight(s) in the l1 norm. This algorithm is usually used in main trend filtering (denoising). For example, an image (grid graph) with noisy pixels can be filtered with this algorithm to get a new image without the noisy pixels, which are "smoothed out" by its neighbors. https://arxiv.org/abs/1410.7690
Summary of Results
The SFL problem is mainly divided into two parts, computing residual graphs from maxflow and renormalizing the weights of the vertices. Maxflow is performed with the parallelizable lockfree pushrelabel algorithm. For renormalization: each vertex has to compute which communities it belongs to, and normalize the weights with other vertices in the same community. SFL iterates on maxflow and renormalization kernels with a global synchronization in between them until convergence. Current analysis show that maxflow is the bottleneck of the whole workflow, with over 90% of the runtime being spent on the maxflow kernels.
GPU SFL runs ~2 times slower than the CPU benchmark on the largest dataset provided. On smaller datasets, GPU SFL is much slower because there just isn't enough work to fill up a GPU and leverage the compute we have available. Analyzing the runs on the larger datasets, show that the parametric maxflow on the CPU converges much faster than our parallel pushrelabel max flow algorithm on the GPU. Investigating the parallelization of parametric maxflow is an interesting research challenge.
Summary of Gunrock Implementation
The loss function is $0.5 \cdot \sum((y'  y)^2) + \lambda_1 \cdot \sum(y_i'  y_j') + \lambda_2 \cdot \sum(y_i')$, where $y$ is the input weight (observation) for each vertex and $y'$ (fitted weight) is the new weight for each vertex. $\lambda_1$ and $\lambda_2$ are required parameters to process the graph in SFL.
The input graph is specified in two files. The first file contains the original vertices' weights and the second file contains the directed graph connectivity without weights (edge pairs only). These two files and a edge_regularization_strength ($\lambda_1$) of the directed graph edge weights are the input to preprocessing. Two extra vertices, source and sink, are added to the original graph as well. They serve as two "labels" of different segments on the graph. This results in a graph where edges that connect to the source or sink have edgeweights as in the vertices' weights
file, and the other edges are assigned an edge weight of $\lambda_1$.
The Gunrock implementation of this application has two parts. The first part is the maxflow algorithm. We choose a lockfree pushrelabel maxflow formulation, which is wellsuited for parallelization on a GPU with Gunrock. Computing a valid maxflow also implies a solution to the mincut problem, which is a segmentation of a graph into two pieces where the division is across a set of edges with the minimum possible weights. The output of this maxflow algorithm is (1) a residual graph where each edge weight is computed as residual_capacity
equals capacity  edge_flow
, and (2) a Boolean array that marks which vertices are reachable from the source after the mincut.
The second part is a renormalization of the residual graph and clustering based on reachability of the vertex. The renormalization is a process where (1) averages of the new weights of vertices that are grouped together as communities are computed, and (2) the new weights then subtract their own community averages. After the renormalization is done, this renormalized residual graph is passed into the maxflow again. Several iterations of maxflow then renormalization are needed before the new weights of different communities converge because vertices can be reassigned to different communities. In each of the SFL iterations, two nonoverlapped subgraphs will be generated by maxflow/mincut, and thus the big communities in the last SFL iteration will have split into small communities. The vertices in a specific community will have the same new weights assigned to them.
The outputs will be the normalized values assigned to each vertex.
Lastly, these values will be passed into a softthreshold function with $\lambda_2$ to achieve the sparse representation by dropping the small absolute values. More specifically, the new weight will be subtracted by $\lambda_2$ if the new weight is positive and larger than $\lambda_2$, or added by $\lambda_2$ if the new weight is negative and smaller than $\lambda_2$. If the new weight is in between $\lambda_2$ to $\lambda_2$, then the new weights will be 0.
Pseudocode for the core SFL algorithm is as follows:
Load the graph and normalize edge weights
Repeat iteration till convergence:
# Part 1: Maxflow
do
lockfree_op (numrepeats, V)
global_relabeling_heuristic // update heights of all vertices
while no more updates
# Lockfree PushRelabel operator
def lockfree_op (numrepeats, V):
for i from 1 to numrepeats do:
for each vertex v in V:
if v.excess > 0:
u := lowest_neighbor_in_residual_network (v)
if u.height < v.height:
Push (v, u)
else:
Relabel (v, u)
# Finding lowest neighbor in residual network phase
def lowest_neighbor_in_residual_network (v):
min_height := infinity
min_u := undefined
for each e<v, u> of v:
if e.residual_capacity <= 0:
continue;
if min_height > u.height:
min_height := u.height
min_u := u
return min_u
# Push phase
def Push (e<v, u>):
move := min(e.residual_capacity, v.excess)
v.excess = move
e.flow += move
u.excess += move
e.reverse.flow = move
# Relabel phase
def Relabel (e<v, u>):
v.height := u.height + 1
# Mincut
Run a BFS to mark the accessibilities of vertices from the source vertex
in the residue graph
# Part 2: renormalization
# Reset available community
for each community comm:
community_weights[comm] := 0
community_sizes [comm] := 0
next_communities [comm] := 0
# Accumulate the weights and count the number of vertices belong to the communities
for each vertex v:
if v is accessible from the source
comm := next_communities[curr_communities[v]];
if (comm == 0) update comm
community_sizes[comm]++
community_weights[comm] += weight[source > v]
else
community_weights[comm] = weight[v > sink]
community_sizes [comm] ++
# Normalize community
for each community comm:
community_weights[comm] /= community_sizes[comm]
community_accus [comm] += community_weights[comm]
# Update the residual graph
for each vertex v:
comm = curr_communities[v]
if (v is accessible from the source):
weight[source>v] = community_weights[comm]
if (weight[source>v] < 0):
Swap(weight[source>v], weight[v>sink])
else
weight[v>sink] += community_weights[comm]
if (weight[v>sink] < 0):
Swap(weight[source>v], weight[v>sink])
# End of Repeats
# Part 3 Sparsify community_accus by \lambda_2
for i in len(each community_accus):
if (community_accus < 0):
community_accus[i] := min(community_accus[i] + \lambda_2, 0)
else:
community_accus[i] := max(community_accus[i]  \lambda_2, 0)
return community_accus
How To Run This Application on DARPA's DGX1
Prereqs/input
CUDA should have been installed; $PATH
and $LD_LIBRARY_PATH
should have been
set correctly to use CUDA. The current Gunrock configuration assumes boost
(1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in
the Makefiles. DARPA's DGX1 has both installed when the tests are performed.
git clone recursive https://github.com/gunrock/gunrock b devrefactor
cd gunrock/tests/gtf
make clean && make
At this point, there should be an executable gtf_main_<CUDA version>_x86_64
in tests/gtf/bin
.
The testing is done with https://github.com/wetliu/gunrock/tree/master using stable
branch at commit a2159ce060d4dbc89a88f8d5fb0fd0d571ba907b
(Nov. 14, 2018), using CUDA 10.1 with NVIDIA driver 390.30.
HIVE Data Preparation
Prepare the data; skip this step if you are just running the sample dataset.
Refer to parse_args()
in taxi_tsv_file_preprocessing.py
for dataset preprocessing options.
If you don't have the e
or n
files:
cd gunrock/tests/gtf/_data
export TOKEN= # get this Authentication TOKEN from https://apitoken.hiveprogram.com/#!/user
mkdir p _data
wget header "Authorization:$TOKEN" \
https://hiveprogram.com/data/_v0/sparse_fused_lasso/taxi/taxismall.tar.gz
tar xzvf taxismall.tar.gz && rm r taxismall.tar.gz
mv taxismall _data/
wget header "Authorization:$TOKEN" \
https://hiveprogram.com/data/_v0/sparse_fused_lasso/taxi/taxi1M.tar.gz
tar xzvf taxi1M.tar.gz && rm r taxi1M.tar.gz
mv taxi1M _data/
python taxi_tsv_file_preprocessing.py
Two files, e
and n
, are generated.
Set $\lambda_1$ (see equation above) in generate_graph.py. If you have the e
and n
files, generated by the above commands or copied from /raid/data/hive/gtf/{small, medium, large}
, you can do the following (/raid/data/hive/gtf/small/
as an example):
cd gunrock/tests/gtf/_data
cp /raid/data/hive/gtf/small/n .
cp /raid/data/hive/gtf/small/e .
python generate_graph.py
A file called std_added.mtx
is generated for Gunrock input.
Running the application
lambda_2 is the sparsity regularization strength
Sample command line with argument.
./bin/test_gtf_10.0_x86_64 market ./_data/std_added.mtx lambda_2 3
Please also note that we have set the number of MF iterations to 10000 as a default. In larger graphs, a larger number may be required.
Output
The code will output two files in the current directory. One is called output_pr.txt
(for CPU reference) and the other is called output_pr_GPU.txt
(for GPU SFL with pushrelabel backend).
Each vertex's new weight will be stored in each line of the two files. These outputs could be further processed into the resulting heatmap.
The printout after running gtf_main_<CUDA version>_x86_64
includes the timing of the application.
Sample txt
output:
0
0
0
0
0
0
11.375
0.307292
0
Sample output on console:
______CPU reference algorithm______ [9/1944]
offset is 58522 num edges 76366
!!!!!!!!!! avg is 0.876037
Iteration 0
...
Iteration 11

Elapsed: 5500.730991 ms
in side that weird function1
offset in GPU preprocess is 58522 num edges 76366
avg in GPU preprocess is 0.876037
Using advance mode LB
Using filter mode CULL
Using advance mode LB
Using filter mode CULL
offset is 58522 num edges 76366
h_community_accus is 0.876037
______GPU SFL algorithm____
enact calling successfully!!!!!!!!!!!

Run 0, elapsed: 3341.358185 ms, #iterations = 12
transfering to host!!!: 8924
[gtf] finished.
avg. elapsed: 3341.358185 ms
iterations: 0
min. elapsed: 3341.358185 ms
max. elapsed: 3341.358185 ms
load time: 138.952 ms
preprocess time: 342.184000 ms
postprocess time: 0.258923 ms
total time: 3845.412016 ms
Performance and Analysis
We measure the runtime and loss function $0.5 \cdot \sum((y'  y)^2) + \lambda_1 \cdot \sum(y_i'  y_j') + \lambda_2 \cdot \sum(y_i')$, where $y$ is the input weight (observation) for each vertex and $y'$ (fitted weight) is the new weight for each vertex.
Implementation limitations

Memory usage Each edge in the graph needs at least 28 bytes of GPU device memory: 64 bits for the residual_capacity value, 64 bits for the capacity value, 64 bits for the flow value, 16 bits for the index of the reverse edge. Each node in the graph needs at least 16 bytes of GPU device memory: 64 bits for the excess value, 16 bits for the height value and 16 bits for the index of the lowest neighbor.

Data type For edges we have the following arrays: residual_capacity, capacity, flow, reverse. For nodes we have the following arrays: height, excess, lowest_neighbor. Residual_capacity, capacity, flow and excess arrays and all computation around them are doubleprecision floating point values (64bit
double
in C/C++). We use an epsilon value of 1e6 to compare floatingpoint numbers to zero. Reverse, height, and lowest_neighbor arrays and all computation around them are 32bit integer values.
Comparison against existing implementations
Graphtv is an official implementation of the sparse fused lasso algorithm with a parametric maxflow backend. It is a CPU serial implementation https://www.cs.ucsb.edu/~yuxiangw/codes/gtf_code.zip. The Gunrock GPU runtime is measured between the application enactor and it is an output of the application.
All datasets in the table below are generated from taxismall.tar.gz
with different timestamps. $\sim 20K$node graph is used as a sample test, $\sim 300K$ is a mediumsized dataset, and the largest available is $\sim 600K$ nodes.
Dataset  time starts  time ends  $\cardinality{E}$  $\cardinality{V}$  Graphtv runtime  Gunrock GPU runtime 

NY Taxi20K  20110626 12:00:00  20110626 14:00:00  20349  8922  0.11s  4.98s 
NY Taxi300K  20110626 00:00:00  20110627 00:00:00  293259  107064  8.71s  143.91s 
NY Taxi600K  20110619 00:00:00  20110627 00:00:00  588211  213360  103.62s  211.07 
Dataset  time starts  time ends  $\cardinality{E}$  $\cardinality{V}$  Graphtv loss  Gunrock GPU loss 

NY Taxi20K  20110626 12:00:00  20110626 14:00:00  20349  8922  132789.32  132789.32 
NY Taxi300K  20110626 00:00:00  20110627 00:00:00  293259  107064  1094282.51  1094282.51 
NY Taxi600K  20110619 00:00:00  20110627 00:00:00  588211  213360  1670947.26  1670947.26 
Performance limitations
We observe a slowdown when we compare Graphtv and Gunrock's current SFL implementation on the three datasets provided. On the smaller datasets there just isn't enough work to fill up a GPU and leverage the compute we have available. Even on the larger dataset we do not see a speedup because of superior serial algorithm within the CPU implementation of maxflow that converges much earlier than our parallel pushrelabel max flow algorithm on the GPU.
We make a detailed performance analysis on the NY Taxi
dataset with time from 20110619 00:00:00
to 20110627 00:00:00
:

The maxflow algorithm uses serial global relabeling and gap heuristics performed in device memory by one CUDA thread. The profiling shows this heuristics computation takes about 42% of MF's whole computation time, while the rest (58%) is push relabel (precisely: Gunrock
RepeatFor
operator running the lockfree pushrelabel operator). Since maxflow takes up about 96% of SFL's computation time, this makes the SFL kernel launchoverheadbound. Further optimization of the maxflow algorithm is discussed below in the "maxflow optimization opportunities" paragraph. 
This dataset, with $\sim 213k$ vertices and $\sim 588k$ edges, is still too small to fill up the GPU. It makes the kernel launching overhead issue much worse than it would with larger graphs.

There are some engineering limitations in our current implementation:

The global relabeling and gap heuristics in maxflow are performed by one thread; although they only run once per a few 1000 iterations, these procedures take almost half the computation time.

The current mincut finding and renormalization are serial on GPU (i.e., only using a single thread to perform the computation).

Maxflow lessons learned

Our first approach to the maxflow problem was the pushrelabel algorithm in its simplest form. This implementation used 3 operators with 2 synchronizations between them. Because we had a large number of iterations of this algorithm, our computation was dominated by kernellaunch overhead.

To reduced the number of kernels and synchronizations, we implemented pushrelabel with a lockfree approach proposed by Bo Hong in "A Lockfree Multithreaded Algorithm for the Maximum Flow Problem" http://people.cs.ksu.edu/~wls77/weston/projects/cis598/hong.pdf. This modification allowed us to reduce the operators down to only one lockfree pushrelabel operator and one synchronization call.

Our next optimization was reducing the iteration count of pushrelabel, thus reducing the number of synchronizations. For this purpose, we used the Gunrock operator
RepeatFor
. TheRepeatFor
operator allowed us to merge several kernel launches into one stacked kernel with a global barrier between each kernel call. We have also experimented with the new CUDA 10 feature calledcudaGraphs
in attempt to reduce the overall kernel launch overhead by creating a task graph of all kernels, but quickly found that a stacked kernel approach worked the best. Now instead of launching hundreds of thousands of iterations (thus kernels) until convergence, we only have to perform tens of iterations with hundreds of thousands of repeated kernels within each iteration.
Next Steps
Maxflow optimization opportunities

We are looking at optimization to speed up the global reabelinggap heuristic. It is currently performed in the Device memory by one thread. This algorithm is based on BFS, which could be parallelized as well.

We would like to explore the parallelization of another approach, the Boykov Kolmogorov algorithm proposed by Yuri Boykov and Vladimir Kolmogorov in "An Experimental Comparison of MinCut/MaxFlow Algorithms for Energy Minimization in Vision" http://www.csd.uwo.ca/faculty/yuri/Papers/pami04.pdf. Instead of finding a new augmenting path or edge (pushrelabel) on the graph in each iteration, the Boykov Kolmogorov algorithm keeps the already found items in two search trees. We would like to use this idea in our pushrelabel implementation as well.
Gunrock implications
SFL is the first application in Gunrock that calls another application (maxflow). Some common data preprocessing on the CPU requires better designs of the APIs to facilitate this usage. For example, gtf_enactor
needs to call mf_problem.reset()
, but because the current maxflow code uses CPU to preprocess the graph, SFL has to transfer the data back and forth between CPU and GPU. Using GPU to preprocess the graph for maxflow would be preferable.
Notes on multiGPU parallelization
To parallelize the pushrelabel algorithm across multiple GPUs, all arrays related to the graph have to be stored on each GPU. Moreover, the GPUs have to update their adjacent neighbors' data. Because the pushrelabel algorithm needs to store at least three arrays of size $O(E)$ and three arrays of size $O(V)$, communicating so much data efficiently between GPUs is challenging.
The SFL renormalization should be able to be easily parallelized across different GPUs, because it is array operations only. However, extra data transfer is necessary if the graph is not copied across multiple GPUs.
Notes on dynamic graphs
Push relabel is not directly related to dynamic graphs. But it should be able to run on a dynamic graph, provided the source and the sink are given at the beginning of the algorithm and the way to access all the nodes and the edges is the same. Capacities of edges from the previous graph can be used as a good starting point, if the edges and the node ids are consistent and the graph is not dramatically changed.
However, SFL would be a significant challenge with dynamic graphs (the topology of the graph would change). The residual graph includes a swapping edge value (see pseudocode above) and we need to know characteristics of the new graph in order to allocate enough memory space for the new edges and vertices.
Notes on larger datasets
SFL renormalization can be done without having its temporary arrays on the same GPU, but extra communication costs are necessary if these arrays are in different GPUs' global memories. Unified memory can also be used to handle larger datasets by oversubscribing and paying the cost of CPU to GPU transfer. Gunrock needs no specific new support to support SFL renormalization on larger datasets.
Research opportunities
Prof. Sharpnack indicates that this implementation could be generalized to multigraph fused lasso. The idea is to set multiple edge values for the edges connecting to source/sink, while keeping the graph topology and edge values $(\lambda_1)$ for the edges in the original graph (excluding source and sink) the same.