# Application Classification

The application classification (AC) workflow is an implementation of probabalistic graph matching via belief propagation. The workflow takes two node- and edge-attributed graphs as input -- a data graph `G = (U_G, E_G)`

and a pattern graph `P = (U_P, E_P)`

. The goal is to find a subgraph `S`

of `G`

such that the dissimilarity between the node/edge features of `P`

and `S`

is minimized. The matching is optimized via loopy belief propagation, which consists of iteratively passing messages between nodes then updating beliefs about the optimal match.

## Summary of Results

Application classification involves a number of dense-matrix operations, which did not make it an obvious candidate for implementation in Gunrock. However, our GPU implementation using the CUDA CUB library shows substantial speedups (10-50x) over the multi-threaded OpenMP implementations.

However, there are two neighbor reduce operations that may benefit from the kind of load balancing implemented in Gunrock. Thus, it would be useful to either expose lightweight wrappers of high-performance Gunrock primitives for easy intergration into outside projects *or* come up with a workflow inside of Gunrock that makes programming applications with lots of non-graph operations straightforward.

## Summary of CUDA Implementation

We implement application classification from scratch in CUDA using CUB rather than Gunrock. Application classification requires the following kernels:

- compute pairwise distance between rows of dense matrices
- normalize the columns of a dense matrix
- compute maximum of columns in a dense matrix
- gather/scatter rows of dense matrix
- sum/max neighborhood reduce on the data/pattern graphs

Apart from the last one, these kernels do not obviously map to Gunrock's advance/filter model.

Pseudocode for the core belief propagation algorithm is as follows:

```
for iteration in range(num_iterations):
# Update edge messages
edge_msg_f = diff_r[data_edges.srcs] - edge_distances # data.num_edges x pattern.num_edges
edge_msg_r = diff_f[data_edges.srcs] - edge_distances # data.num_edges x pattern.num_edges
# Normalize edge messages
edge_msg_f = normprob(edge_msg_f)
edge_msg_r = normprob(edge_msg_r)
# Max of forward/backward messages
f_null = columnwise_max(diff_f) # 1 x pattern.num_edges
r_null = columnwise_max(diff_r) # 1 x pattern.num_edges
for edge_idx, (src, dst) in enumerate(data_edges):
max_r[src] = max(max_r[src], edge_msg_r[edge_idx], f_null)
max_f[dst] = max(max_f[dst], edge_msg_f[edge_idx], r_null)
# Update beliefs
belief = - node_distances # data.num_nodes x patt.num_nodes
for edge_idx, (src, dst) in enumerate(pattern_edges):
belief[:,dst] += max_f[:,edge_idx]
belief[:,src] += max_r[:,edge_idx]
# Normalize beliefs
belief = normprob(belief)
diff_f = belief[:,pattern_edges.dsts] - max_f # data.num_nodes x pattern.num_edges
diff_r = belief[:,pattern_edges.srcs] - max_r # data.num_nodes x pattern.num_edges
```

where `normprob`

is the column-wise log-softmax function. That is, since `belief`

is `data.num_nodes x patt.num_nodes`

, each node in the `pattern`

graph gets assigned a probability distribution over nodes in the `data`

graph.

Our implementation is based on the PNNL implementation rather than the distributed GraphX reference implementation. On all of the graphs we've tested, the output of our implementation exactly matches the output of the PNNL code. According to PNNL, their implementation may give different results than the HIVE reference implementation (due to e.g., different normalization schemes).

## How To Run This Application on DARPA's DGX-1

### Prereqs/input

```
git clone --recursive \
https://github.com/owensgroup/application_classification
cd application_classification
make clean
make
```

### Running the application

#### Example Command

```
mkdir -p results
./main \
./data/georgiyData.Vertex.csv \
./data/georgiyData.Edges.csv \
./data/georgiyPattern.Vertex.csv \
./data/georgiyPattern.Edges.csv > results/georgiy
```

#### Example output

```
$ head -n 5 results/georgiy
-8.985810e+00
-3.859019e+01
-4.470994e+01
-1.673157e+01
-1.730952e+01
$ tail -n 5 results/georgiy
-2.886186e+01
-1.499165e+01
-4.034595e+01
-1.060496e+01
-7.684015e+00
$ cat results/georgiy | openssl md5
(stdin)= bd57a5126d5f943ad5c15408d410790d
```

### Expected Output

The output of the program is a `data.num_nodes x pattern.num_nodes`

matrix, where each column represents a log-probability distribution of assignments of pattern node `j`

to data node `i`

. This matrix is printed in row major order as a file with `data.num_nodes x pattern.num_nodes`

lines.

In python, you could inspect the output like:

```
import numpy as np
# load results
x = open('./results/georgiy').read().splitlines()
x = [float(xx) for xx in x]
# reshape to (data.num_nodes, pattern.num_nodes)
x = np.reshape(x, (1000, 10))
# exponentiate
x = np.exp(x)
# columns should sum to 1
x.sum(axis=0)
# array([1.0000001 , 1.00000001, 1.00000004, 0.99999999, 0.99999988,
# 0.99999994, 0.99999985, 1.00000001, 1. , 0.99999988])
```

As previously mentioned, our output exactly matches the output of the PNNL implementation on all of the graphs we have tested.

## Performance and Analysis

We measure performance by runtime of the algorithm given

- a node- and edge-attributed data graph
- a node- and edge-attributed pattern graph

Though AC computes probabilities of matches between data and pattern nodes, this is a deterministic and parameterless algorithm, so we do not measure performance in terms of accuracy. Further, runtime does not depend on the *values* of the node/edge attributes, so we can reasonably run experiments using randomly generated values.

### Implementation limitations

As currently implemented, the algorithm allocates a number of arrays of floats:

- 3 arrays of size
`data.num_edges * pattern.num_edges`

- 3 arrays of size
`data.num_nodes * pattern.num_edges`

- 2 arrays of size
`data.num_nodes * pattern.num_nodes`

The combined memory usage of these arrays will be the primary bottleneck for scaling. For example, if

- data.num_nodes = 6M
- data.num_edges = 18M
- pattern.num_nodes = 10
- pattern.num_edges = 20

then the memory footprint will be on the order of 13 GB. (*Note:* It's likely that reordering of operations could reduce the number of intermediate data structures required.)

AC is only applicable to graphs with node and edge features. However, the runtime of the algorithm is only dependent on the *dimension* of these features, rather than the values. Thus, for benchmarking purposes, we can pick a `node_feature_dim`

and `edge_feature_dim`

and use randomly generated features. This is helpful becaus we do not have a good real-world dataset for benchmarking.

The algorithm requires both a data graph and a smaller pattern graph. Both the size and topology of the pattern graph may affect the runtime of the algorithm. However, we do not have examples of actual pattern graphs apart from `georgiyPattern`

, so we are forced to generate them synthetically. Specifically, we do this by sampling a node `q`

, sampling up to `max_pattern_nodes`

number of `u`

to form set `Q`

, and using the subgraph induced by `Q + q`

as the pattern graph.

We suspect the PNNL implementation may have a couple of minor bugs:

- the
`CV`

matrix is log-softmax normalized in the initialization phase, updated with values from`FMax`

and`RMax`

, and log-softmax normalized again. This kind of double normalization seems strange, and is perhaps incorrect. - In our
`RepeatColumnsByDataEdges`

,`FE`

and`RE`

both use the`src`

of an edge, which conflicts with the algorithm description in the paper. One of these should probably be using the`src`

and the other the`dst`

.

These bugs are easy to fix, but we left them "as is" because a) we have no easy way to verify which is correct and b) we'd like consistency of results w/ the PNNL implementation.

### Comparison against existing implementations

*The original reference implementation consisted of a large amount of distributed Spark and GraphX Scala code. For ease of implementation, and to make sure performance comparisons are meaningful, we instead based our implementation on the PNNL ApplicationClassification OpenMP code.*

Overall, the CUDA implementation is between 10x and 100x faster than the best run of the PNNL OpenMP code.

We compare our CUDA implementation to PNNL's C++ OpenMP implementation on several different graphs:

###### georgiyData

- a small graph included w/ real (source unknown) node/edge features (included in PNNL repo)
`|U|=1000 |E|=20135 node_feat_dim=13 edge_feat_dim=17`

###### rmat18

- a scale 18 RMAT synthetic graph w/ random node/edge features (included in PNNL repo)
`|U|=262145 |E|=2008660 node_feat_dim=13 edge_feat_dim=17`

###### JohnsHopkins_random

- Social network graph w/ random node/edge features
`|U|=5157 |E|=186572 node_feat_dim=12 edge_feat_dim=16`

For the first two, we use the `georgiyPattern`

pattern graph included in the PNNL repo. For the latter, we generate a pattern graph using the neighborhood induction mentioned above.

#### georgiyData

implementation | threads | elapsed_ms |
---|---|---|

PNNL OpenMP | 1 | 1635.774136 |

PNNL OpenMP | 2 | 1405.072927 |

PNNL OpenMP | 4 | 1005.914927 |

PNNL OpenMP | 8 | 831.342936 |

PNNL OpenMP | 16 | 793.069839 |

PNNL OpenMP | 32 | 546.305180 |

PNNL OpenMP | 64 | 706.761122 |

Our CUDA | 1xP100 | 42.533 |

**Takeaway:** Our CUDA implementation is approximately 12.8x faster than the fastest OpenMP run (32 threads). However, this problem is quite small and both implementations run in under 1 second.

#### rmat18

implementation | threads | elapsed_ms |
---|---|---|

PNNL OpenMP | 1 | 113337.949038 |

PNNL OpenMP | 2 | 142036.607981 |

PNNL OpenMP | 4 | 109564.634800 |

PNNL OpenMP | 8 | 95680.689096 |

PNNL OpenMP | 16 | 87083.579063 |

PNNL OpenMP | 32 | 88772.798061 |

PNNL OpenMP | 64 | 82495.028973 |

Our CUDA | 1xP100 | 827.573 |

**Takeaway:** Our CUDA implementation is approximately 99x faster than the fastest PNNL run (64 threads). The absolute magnitude of the differences is much more substantial here -- the CUDA implementation runs in < 1 second while the OpenMP version runs in \~ 1.5 minutes.

#### JohnsHopkins_random

implementation | threads | elapsed_ms |
---|---|---|

PNNL OpenMP | 1 | 71190.566063 |

PNNL OpenMP | 2 | 44293.390989 |

PNNL OpenMP | 4 | 31736.392021 |

PNNL OpenMP | 8 | 24292.662144 |

PNNL OpenMP | 16 | 21556.239128 |

PNNL OpenMP | 32 | 17082.650900 |

PNNL OpenMP | 64 | 19473.608017 |

Our CUDA | 1xP100 | 450.897 |

**Takeaway:** Our CUDA implementation is approximately 37x faster than the fastest PNNL run (32 threads). Again, the absolute difference in runtimes is more substantial, with our code running in < 1 second vs. \~ 20 seconds.

### Performance limitations

Profiling (on the RMAT graph) reveals that the distribution of runtime over kernels is flatter in AC than for many applications -- often, a single kernel will account for > 80% of runtime, but here the most expensive kernel only accounts for 12.8% of compute time.

- 12.8% of time spent in
`__reorderColumns`

, which is a gather/scatter operation (memory bandwidth = 281 GB/s) - 10.5% of time spent in
`__transpose`

- 12.1% of time spend in
`__rowSubLog`

, an edgewise arithmetic operation - 9.5% of time computing
`data.num_edges x pattern.num_edges`

similarity matrix - 8.1% of time doing neighborhood reductions (293 GB/s)

All of these are memory bound operations.

As mentioned above, the memory use of the app could be reduced (if need be) by reducing the number of intermediate data structures. This would come at the cost of increased (re)computation time.

## Next Steps

### Alternate approaches + Gunrock implications

In the CUDA implementation, there are a number of places where we could take advantage of multiple streams to reduce runtimes. For example, in the initialization phase, we compute the pairwise distance between a) data/pattern node features and b) data/pattern edge features. These operations are completely independent of one another, and so could happen asynchronously on different streams.

The application was implemented outside of the Gunrock framework because it had a large number of (dense matrix) operations that are not explicitly supported by Gunrock, and a relatively small number of kernels that map well to Gunrock's advance/filter paradigm. However, the code uses CUB's `DeviceSegmentedReduce`

a number of times -- Gunrock recently added a similar operator that is load-balanced for better performance. In the future, it would be worthwhile to see what kind of speedup we could get from the Gunrock version, which should roughly be a drop-in replacement.

### Notes on multi-GPU parallelization

Most of the kernels are either row or column operations (reductions) over dense matrices, and thus relatively easy to partition over multiple nodes. They would either end up being embarrassingly parallel or would require a per-node reduction and then a reduction across nodes. Replacing CUB's `DeviceSegmentedReduce`

with Gunrock's implementation would give us multi-GPU support for the remaining kernel.

Alternatively, depending on the topology of the graph, we may be able to partition the data graph so that we can duplicate the pattern graph across nodes and run an independent instance of application classification on each partition. The partition would need to be constructed in a way that ensures that every subgraph is intact on *some* GPU, which implies some partial duplication of the data graph. If the data graph has a large diameter, or the pattern graph has a small diameter, this may be possible without excessive duplication. If the data graph has a small diameter, we may still be able to partition the graph by e.g. removing edges that are particularly dissimilar from edges in the pattern graph. This kind of approach is clearly very application specific, and may not be possible at all in some cases.

### Notes on dynamic graphs

In practice, it's likely that practitioners would like to run application classification on a dynamic graph (e.g., the HIVE use case was for detecting certain suspicious patterns of communication in a cybersecurity graph). However, it is not obvious how the current algorithm would be applied in a streaming fashion without relatively major modifications. It's more likely that the current AC algorithm would be applied to a dynamic graph via some kind of sliding window.

### Notes on larger datasets

We may be able to use a partitioning scheme like the one described in the multi-GPU section above to handle data graphs that are larger than GPU memory.

### Notes on other pieces of this workload

The documentation on the wiki includes discussion of the various featurization methods used to produce the node/edge attributes. These are beyond the scope of this work, but do include things such as computing degree, number of triangles, betweenness centrality, etc. If we wanted to build a high-performance end-to-end application classification system, we would want to implement some of these featurization methods in Gunrock as well.