toc_footers: - Gunrock: GPU Graph Analytics - Gunrock © 2018 The Regents of the University of California.

search: true

full_length: true

Sparse Fused Lasso

Given a graph where each vertex on the graph has a weight, sparse fused lasso (SFL), also named sparse graph trend filter (GTF), tries to learn a new weight for each vertex that is (1) sparse (most vertices have weight 0), (2) close to the original weight in the l2 norm, and (3) close to its neighbors' weight(s) in the l1 norm. This algorithm is usually used in main trend filtering (denoising). For example, an image (grid graph) with noisy pixels can be filtered with this algorithm to get a new image without the noisy pixels, which are "smoothed out" by its neighbors. https://arxiv.org/abs/1410.7690

Summary of Results

The SFL problem is mainly divided into two parts, computing residual graphs from maxflow and renormalizing the weights of the vertices. Maxflow is performed with the parallelizable lock-free push-relabel algorithm. For renormalization: each vertex has to compute which communities it belongs to, and normalize the weights with other vertices in the same community. SFL iterates on maxflow and renormalization kernels with a global synchronization in between them until convergence. Current analysis show that maxflow is the bottleneck of the whole workflow, with over 90% of the runtime being spent on the maxflow kernels.

GPU SFL runs ~2 times slower than the CPU benchmark on the largest dataset provided. On smaller datasets, GPU SFL is much slower because there just isn't enough work to fill up a GPU and leverage the compute we have available. Analyzing the runs on the larger datasets, show that the parametric maxflow on the CPU converges much faster than our parallel push-relabel max flow algorithm on the GPU. Investigating the parallelization of parametric maxflow is an interesting research challenge.

Summary of Gunrock Implementation

The loss function is $0.5 \cdot \sum((y' - y)^2) + \lambda_1 \cdot \sum(|y_i' - y_j'|) + \lambda_2 \cdot \sum(|y_i'|)$, where $y$ is the input weight (observation) for each vertex and $y'$ (fitted weight) is the new weight for each vertex. $\lambda_1$ and $\lambda_2$ are required parameters to process the graph in SFL.

The input graph is specified in two files. The first file contains the original vertices' weights and the second file contains the directed graph connectivity without weights (edge pairs only). These two files and a edge_regularization_strength ($\lambda_1$) of the directed graph edge weights are the input to preprocessing. Two extra vertices, source and sink, are added to the original graph as well. They serve as two "labels" of different segments on the graph. This results in a graph where edges that connect to the source or sink have edge-weights as in the vertices' weights file, and the other edges are assigned an edge weight of $\lambda_1$.

The Gunrock implementation of this application has two parts. The first part is the maxflow algorithm. We choose a lock-free push-relabel maxflow formulation, which is well-suited for parallelization on a GPU with Gunrock. Computing a valid maxflow also implies a solution to the mincut problem, which is a segmentation of a graph into two pieces where the division is across a set of edges with the minimum possible weights. The output of this maxflow algorithm is (1) a residual graph where each edge weight is computed as residual_capacity equals capacity - edge_flow, and (2) a Boolean array that marks which vertices are reachable from the source after the mincut.

The second part is a renormalization of the residual graph and clustering based on reachability of the vertex. The renormalization is a process where (1) averages of the new weights of vertices that are grouped together as communities are computed, and (2) the new weights then subtract their own community averages. After the renormalization is done, this renormalized residual graph is passed into the maxflow again. Several iterations of maxflow then renormalization are needed before the new weights of different communities converge because vertices can be reassigned to different communities. In each of the SFL iterations, two non-overlapped subgraphs will be generated by maxflow/mincut, and thus the big communities in the last SFL iteration will have split into small communities. The vertices in a specific community will have the same new weights assigned to them.

The outputs will be the normalized values assigned to each vertex.

Lastly, these values will be passed into a soft-threshold function with $\lambda_2$ to achieve the sparse representation by dropping the small absolute values. More specifically, the new weight will be subtracted by $\lambda_2$ if the new weight is positive and larger than $\lambda_2$, or added by $\lambda_2$ if the new weight is negative and smaller than $-\lambda_2$. If the new weight is in between $-\lambda_2$ to $\lambda_2$, then the new weights will be 0.

Pseudocode for the core SFL algorithm is as follows:

Load the graph and normalize edge weights

Repeat iteration till convergence:

    # Part 1: Maxflow
    do
        lockfree_op (num-repeats, V)
        global_relabeling_heuristic // update heights of all vertices
    while no more updates

    # Lock-free Push-Relabel operator
    def lockfree_op (num-repeats, V):
        for i from 1 to num-repeats do:
            for each vertex v in V:
                if v.excess > 0:
                    u := lowest_neighbor_in_residual_network (v)
                    if u.height < v.height:
                        Push (v, u)
                    else:
                        Relabel (v, u)

    # Finding lowest neighbor in residual network phase
    def lowest_neighbor_in_residual_network (v):
        min_height := infinity
        min_u := undefined
        for each e<v, u> of v:
            if e.residual_capacity <= 0:
                    continue;
            if min_height > u.height:
                    min_height := u.height
                    min_u := u
        return min_u

    # Push phase
    def Push (e<v, u>):
        move := min(e.residual_capacity, v.excess)
        v.excess -= move
        e.flow += move
        u.excess += move
        e.reverse.flow -= move

    # Relabel phase
    def Relabel (e<v, u>):
        v.height := u.height + 1

    # Min-cut
    Run a BFS to mark the accessibilities of vertices from the source vertex
    in the residue graph

    # Part 2: renormalization
    # Reset available community
    for each community comm:
        community_weights[comm] := 0
        community_sizes  [comm] := 0
        next_communities [comm] := 0

    # Accumulate the weights and count the number of vertices belong to the communities
    for each vertex v:
        if v is accessible from the source
            comm := next_communities[curr_communities[v]];
            if (comm == 0) update comm
            community_sizes[comm]++
            community_weights[comm] += weight[source -> v]
        else
            community_weights[comm] -= weight[v -> sink]
            community_sizes [comm] ++

    # Normalize community
    for each community comm:
        community_weights[comm] /= community_sizes[comm]
        community_accus [comm] += community_weights[comm]

    # Update the residual graph
    for each vertex v:
        comm = curr_communities[v]
        if (v is accessible from the source):
            weight[source->v] -= community_weights[comm]
            if (weight[source->v] < 0):
                Swap(-weight[source->v], weight[v->sink])
        else
            weight[v->sink] += community_weights[comm]
            if (weight[v->sink] < 0):
                Swap(weight[source->v], -weight[v->sink])

    # End of Repeats

# Part 3 Sparsify community_accus by \lambda_2
for i in len(each community_accus):
    if (community_accus < 0):
        community_accus[i] := min(community_accus[i] + \lambda_2, 0)
    else:
        community_accus[i] := max(community_accus[i] - \lambda_2, 0)

return community_accus

How To Run This Application on DARPA's DGX-1

Prereqs/input

CUDA should have been installed; $PATH and $LD_LIBRARY_PATH should have been set correctly to use CUDA. The current Gunrock configuration assumes boost (1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in the Makefiles. DARPA's DGX-1 has both installed when the tests are performed.

git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/gtf
make clean && make

At this point, there should be an executable gtf_main_<CUDA version>_x86_64 in tests/gtf/bin.

The testing is done with https://github.com/wetliu/gunrock/tree/master using stable branch at commit a2159ce060d4dbc89a88f8d5fb0fd0d571ba907b (Nov. 14, 2018), using CUDA 10.1 with NVIDIA driver 390.30.

HIVE Data Preparation

Prepare the data; skip this step if you are just running the sample dataset.

Refer to parse_args() in taxi_tsv_file_preprocessing.py for dataset preprocessing options.

If you don't have the e or n files:

cd gunrock/tests/gtf/_data

export TOKEN= # get this Authentication TOKEN from https://api-token.hiveprogram.com/#!/user
mkdir -p _data
wget --header "Authorization:$TOKEN" \
  https://hiveprogram.com/data/_v0/sparse_fused_lasso/taxi/taxi-small.tar.gz
tar -xzvf taxi-small.tar.gz && rm -r taxi-small.tar.gz
mv taxi-small _data/

wget --header "Authorization:$TOKEN" \
  https://hiveprogram.com/data/_v0/sparse_fused_lasso/taxi/taxi-1M.tar.gz
tar -xzvf taxi-1M.tar.gz && rm -r taxi-1M.tar.gz
mv taxi-1M _data/

python taxi_tsv_file_preprocessing.py

Two files, e and n, are generated.

Set $\lambda_1$ (see equation above) in generate_graph.py. If you have the e and n files, generated by the above commands or copied from /raid/data/hive/gtf/{small, medium, large}, you can do the following (/raid/data/hive/gtf/small/ as an example):

cd gunrock/tests/gtf/_data
cp /raid/data/hive/gtf/small/n .
cp /raid/data/hive/gtf/small/e .
python generate_graph.py

A file called std_added.mtx is generated for Gunrock input.

Running the application

--lambda_2 is the sparsity regularization strength

Sample command line with argument.

./bin/test_gtf_10.0_x86_64 market ./_data/std_added.mtx --lambda_2 3

Please also note that we have set the number of MF iterations to 10000 as a default. In larger graphs, a larger number may be required.

Output

The code will output two files in the current directory. One is called output_pr.txt (for CPU reference) and the other is called output_pr_GPU.txt (for GPU SFL with push-relabel backend). Each vertex's new weight will be stored in each line of the two files. These outputs could be further processed into the resulting heatmap. The printout after running gtf_main_<CUDA version>_x86_64 includes the timing of the application.

Sample txt output:

0
0
0
0
0
0
-11.375
-0.307292
0

Sample output on console:

______CPU reference algorithm______                                                                                                                                  [9/1944]
offset is 58522 num edges 76366
!!!!!!!!!! avg is -0.876037
Iteration 0
...
Iteration 11
-----------------------------------
Elapsed: 5500.730991 ms
in side that weird function1
offset in GPU preprocess is 58522 num edges 76366
avg in GPU preprocess is -0.876037
Using advance mode LB
Using filter mode CULL
Using advance mode LB
Using filter mode CULL
offset is 58522 num edges 76366
h_community_accus is -0.876037
______GPU SFL algorithm____
enact calling successfully!!!!!!!!!!!
-----------------------------------
Run 0, elapsed: 3341.358185 ms, #iterations = 12
transfering to host!!!: 8924
[gtf] finished.
 avg. elapsed: 3341.358185 ms
 iterations: 0
 min. elapsed: 3341.358185 ms
 max. elapsed: 3341.358185 ms
 load time: 138.952 ms
 preprocess time: 342.184000 ms
 postprocess time: 0.258923 ms
 total time: 3845.412016 ms

Performance and Analysis

We measure the runtime and loss function $0.5 \cdot \sum((y' - y)^2) + \lambda_1 \cdot \sum(|y_i' - y_j'|) + \lambda_2 \cdot \sum(|y_i'|)$, where $y$ is the input weight (observation) for each vertex and $y'$ (fitted weight) is the new weight for each vertex.

Implementation limitations

Comparison against existing implementations

Graphtv is an official implementation of the sparse fused lasso algorithm with a parametric maxflow backend. It is a CPU serial implementation https://www.cs.ucsb.edu/~yuxiangw/codes/gtf_code.zip. The Gunrock GPU runtime is measured between the application enactor and it is an output of the application.

All datasets in the table below are generated from taxi-small.tar.gz with different timestamps. $\sim 20K$-node graph is used as a sample test, $\sim 300K$ is a medium-sized dataset, and the largest available is $\sim 600K$ nodes.

Dataset time starts time ends $\cardinality{E}$ $\cardinality{V}$ Graphtv runtime Gunrock GPU runtime
NY Taxi-20K 2011-06-26 12:00:00 2011-06-26 14:00:00 20349 8922 0.11s 4.98s
NY Taxi-300K 2011-06-26 00:00:00 2011-06-27 00:00:00 293259 107064 8.71s 143.91s
NY Taxi-600K 2011-06-19 00:00:00 2011-06-27 00:00:00 588211 213360 103.62s 211.07
Dataset time starts time ends $\cardinality{E}$ $\cardinality{V}$ Graphtv loss Gunrock GPU loss
NY Taxi-20K 2011-06-26 12:00:00 2011-06-26 14:00:00 20349 8922 132789.32 132789.32
NY Taxi-300K 2011-06-26 00:00:00 2011-06-27 00:00:00 293259 107064 1094282.51 1094282.51
NY Taxi-600K 2011-06-19 00:00:00 2011-06-27 00:00:00 588211 213360 1670947.26 1670947.26

Performance limitations

We observe a slowdown when we compare Graphtv and Gunrock's current SFL implementation on the three datasets provided. On the smaller datasets there just isn't enough work to fill up a GPU and leverage the compute we have available. Even on the larger dataset we do not see a speed-up because of superior serial algorithm within the CPU implementation of maxflow that converges much earlier than our parallel push-relabel max flow algorithm on the GPU.

We make a detailed performance analysis on the NY Taxi dataset with time from 2011-06-19 00:00:00 to 2011-06-27 00:00:00:

Maxflow lessons learned

Next Steps

Maxflow optimization opportunities

Gunrock implications

SFL is the first application in Gunrock that calls another application (maxflow). Some common data pre-processing on the CPU requires better designs of the APIs to facilitate this usage. For example, gtf_enactor needs to call mf_problem.reset(), but because the current maxflow code uses CPU to preprocess the graph, SFL has to transfer the data back and forth between CPU and GPU. Using GPU to preprocess the graph for maxflow would be preferable.

Notes on multi-GPU parallelization

To parallelize the push-relabel algorithm across multiple GPUs, all arrays related to the graph have to be stored on each GPU. Moreover, the GPUs have to update their adjacent neighbors' data. Because the push-relabel algorithm needs to store at least three arrays of size $O(|E|)$ and three arrays of size $O(|V|)$, communicating so much data efficiently between GPUs is challenging.

The SFL renormalization should be able to be easily parallelized across different GPUs, because it is array operations only. However, extra data transfer is necessary if the graph is not copied across multiple GPUs.

Notes on dynamic graphs

Push relabel is not directly related to dynamic graphs. But it should be able to run on a dynamic graph, provided the source and the sink are given at the beginning of the algorithm and the way to access all the nodes and the edges is the same. Capacities of edges from the previous graph can be used as a good starting point, if the edges and the node ids are consistent and the graph is not dramatically changed.

However, SFL would be a significant challenge with dynamic graphs (the topology of the graph would change). The residual graph includes a swapping edge value (see pseudocode above) and we need to know characteristics of the new graph in order to allocate enough memory space for the new edges and vertices.

Notes on larger datasets

SFL renormalization can be done without having its temporary arrays on the same GPU, but extra communication costs are necessary if these arrays are in different GPUs' global memories. Unified memory can also be used to handle larger datasets by oversubscribing and paying the cost of CPU to GPU transfer. Gunrock needs no specific new support to support SFL renormalization on larger datasets.

Research opportunities

Prof. Sharpnack indicates that this implementation could be generalized to multi-graph fused lasso. The idea is to set multiple edge values for the edges connecting to source/sink, while keeping the graph topology and edge values $(\lambda_1)$ for the edges in the original graph (excluding source and sink) the same.