Graph Coloring as a Gunrock Operator
A graph $G=(V,E)$ is comprised of a set of vertices $V$ together with a set of edges $E$, where $E \subseteq V \times V$. Graph coloring $C: V \rightarrow N$ is a function that assigns a color to each vertex that satisfies $C(v) \neq C(u)$ $\forall (v,u) \in E$. In other words, graph coloring assigns colors to vertices such that no vertex has the same color as a neighboring vertex.
Two characteristics of graph coloring that makes it an interesting operator are:
 Within a color, none of the vertices share an edge, and
 Each color represents a partitioning of vertices (or edges) of a whole graph.
(1) Asynchrony
Asynchrony is the key benefit of using graph coloring as an operator, as explained above, within a single color, none of the vertices share an edge with each other. "Frog: Asynchronous Graph Processing on GPU with Hybrid Coloring Model" does a great job showing that exploring a hybridsynchronousasynchronous model can actually result is some benefit for primitives such as SSSP, PR, BFS and CC.
Within a singleGPU, this will benefit by avoiding blockwide communication, for a multiGPU environment, we will avoid communicating across GPUs, and within a multinode, we will attempt to minimize communication across nodes. The reason why this is a hybrid asynchronous model is because there is still a synchronization step (called color merging) at the very end where you finally consider other groups of colors which do have edges to each other. We would generally want to avoid this last step as much as possible, and choose applications (primitives) that coloring makes sense to do as an operator.
Yuechao and I, Muhammad, have a workshop paper called "Synchronous vs. Asynchronous GPU Graph Frameworks" analyzing Groute (asynchronous processing model) and Gunrock (bulksynchronous model), and this approach will be a hybridsynchronous model, and will be worth comparing against what we have already analyzed in the past paper.
Adding this new synchronization model also shows the programmability of Gunrock as a graph processing model, and how it can be further abstracted away to a fully asynchronous model, but still have capabilities or switches to go back to BSP where it makes sense. This work will also be a stepping stone to comparing to what Yuxin is working on, which is asynchronous graph algorithms.
(2) MultiFrontiers (Colored Frontiers)
Currently within Gunrock, we mainly use a single input frontier and a single output frontier to express all our primitives and applications. A frontier in Gunrockworld is a group of vertices or edges, in our case we would utilize the graph coloring inherent property of partitioning the graph into colors to generate and build on a singlefrontier approach already present in Gunrock to do multifrontiers (colored frontiers). In an initial approach, we can take a single frontier as an input, and generate multifrontiers where each color represents a frontier.
In the Frog paper, each color (or frontier in Gunrock) is a separate kernel launch, and for very small frontiers they either rely on the CPU to do processing because the overhead of launching the kernel is more than the amount of parallel work available within it, or they just rely on GPU atomics (which according to them are not costly for a few operations  I don't agree with this btw). In Gunrock's case, we won't be launching a new kernel for every frontier that is generated using graph coloring, instead we can rely on multifrontier to be launched together in different schemes. These schemes will inherently answer the loadimbalance problem that is presented after we perform graph coloring:
 Threads/Warps/CTAs: This way we can continue to utilize asynchrony, without worrying about the corner case where we have frontiers with too few nodes that we need to either do them atomically or on the CPU.
if (colored_frontier.num_nodes < X)
launch work on Threads
if (colored_frontier.num_nodes > X && colored_frontier.num_nodes < Y)
launch work on Warps
if (colored_frontier.num_nodes > Y && colored_frontier.num_nodes < Z)
launch work on Blocks

Other gunrock loadbalancing strategies (LB and ALL_EDGES) are still applicable to multifrontier, by just considering each frontier the "subgroup" of a large frontier: LB

Loadbalance and Color: We can choose coloring heuristics such that the resulting frontiers we generate are already balanced. One of the example is considering CUDA today, each CUDA warp has 32 threads, and in the most simplest case we would like to assign each node to a thread within a warp with no communication to any other warps or threads outside of its own (asynchrony within a frontier), we can do coloring such that if a generated color contains 32 nodes, the next node tobeplaced within the same color will instead generate a new color for itself, therefore with a guarantee that each frontier must contain 32 nodes, along with satisfying the asynchrony condition. This can later be expanded to Cooperative Group (or symphonic CUDA), where a warp is not specifically defined to be 32 threads, it can be any arbitrary number. This is one of the many things we can try when coloring with quality instead of minimizing the color counts. This approach also removes the need to do loadbalancing separately, the generated colors will be loadbalanced.
API Design (IMPORTANT)
Have to spend some time thinking about this, in my pov, this will be the hardest thing to do right. We need to be able to capture existing operators for a fuse operation, along with support ways to express algorithms asynchronously (due to coloring).