Binary Decision Diagrams (BDDs) are commonly used in industry for model checking, such as in computer-aided design. We will implement a parallel BDD library that is optimized for many-core machines, such as the Xeon Phi.
A BDD is a data structure used to represent boolean functions in a graph structure. BDD's can be used to model state transformations and evaluate the validity of systems. For example, one use case for BDD's is verifying the correctness of concurrent programs to ensure that multiple threads cannot access a critical section at the same time. In general, evaluation for model checking involves taking an initial BDD and a set of transition rules and computing the "reachibility" of the states in an iterative fashion. Though there are existing existing packages that perform BDD evaluation, none of them utilize modern multi-core processors like the Xeon Phi. In order to take advantage of multi-core processors, we need to be able to represent a BDD in a structure that multiple processors can access and modify at one. In addition, we need to memoize the computation of different parts of the BDD to prevent duplicate work. These two goals can be met with a lockfree global hash, which reduces synchronization and communication costs compared to a lock-based data structure.
One of the big conceptual challenges for this problem is that neither of us have worked with BDD's before. To fully understand the project, we will first need to explore the literature covering BDD's and the algorithms that deal with evaluating them. Since there are no publicly available parallel BDD evaluators yet, we will have to base our code on research papers which have not actually implemented the any realistic system in practice. Another big challenge is getting a fast and correct lockfree hash table. Though there are numerous papers and existing lockfree hash tables, we will likely have to implement our own version to support memory management, resizing, and probing in the way that we need. Specifically, we will need to write our own garbage collection to delete unused BDD nodes to keep the memory footprint at a reasonable size.
We will utilize the Xeon Phi’s on the latedays cluster. We will be starting from scratch, though there is a similar sequential library that is commonly used, BuDDy. Our main literary resource will be Tom van Dijk’s master’s thesis, which presents a survey over the field of BDDs and provides proofs of several BDD operations. We would greatly benefit from access to a Knight’s Landing Xeon Phi, but Professor Kayvon has told us this is impossible.
Goals and Deliverables
Our project will have two main deliverables which are largely independent:
- First, we will have a standalone lockfree hash table implementation that supports memory management and resizing. This hash table will be benchmarked against existing locking and lockfree hash tables to ensure that our implementation is competitive with the state-of-the-art systems that exist.
- Second, we will have an system that takes an initial state (modeled as a BDD) and a transition relation (as a BDD) as input and evaluates the reachibility of the state. Users will be able to interface with our system through an API that allows them to manually construct the BDD's that they wish to evaluate. In order to demo this part of the project, we will write a few example programs that will use BDD's to check the correctness of some system (for example: deadlock checking, or n-Queens problem).
To properly demonstrate the parallelism and scalability of our library, which we intend to be able to handle BDDs with millions of nodes, we need to use hardware with many cores. The Xeon Phi provides a machine with 60 cores, along with a simple target to optimize for. This library will be implemented in C++14, as this is a reasonable language that people seem to like, and will be targeted at all standards-compliant operating systems, with preference given to Linux if we run into issues.
Week 1: Implement a sequential implementation of BDDs and ensure correctness.
Week 2: Implement a parallel hash table for memoization and reduction purposes.
Week 3: Modify the sequential implementation to be parallel. Debug until correct.
Week 4: Performance tweaking.
Week 5: Creation of cool looking test cases/demo.
Work So Far
So far, we have created a detailed specification for our project including the user-facing API and the internal structures we will be using. We have also mostly completed a sequential implementation of the BDD library, which allows the user to construct BDD's using logical operators and do simple analysis like "one SAT" (find one satisfying assignment for the BDD) and "count SAT" (find number of satisfying assignments for the BDD). Our BDD's are maintained in a canonical form in order to save memory.
We also gave considerable thought into designing the parallel hash table in order to implement the unique BDD table and the cache. This task turned out to be much more difficult than we initially expected for a few reasons:
- The hardware we had access to did not have 128-bit compare-and-swap, which makes it significantly harder to implement a lockfree hash table since we need to do compare-and-swaps on a pointer plus a counter.
- We realized that we would have to implement our own garbage collector over the cache, which means we would need to maintain reference counters in order to keep the memory footprint manageable.
We also spent a good amount of time trying to implement a fork-join parallelism model to handle work distribution. However, this turned out to be a dead-end because implementing such a system is significantly out-of-scope for the difficulty of a 418 project, especially when combined with the rest of the project. We ended up deciding to use the Cilk library instead.
- Verify the correctness of the sequential algorithm by implementing n-Queens (Chris)
- Finish work on the sequential RelProd algorithm, which is used for model checking (Chris)
- Finish the lockfree hash table for the cache and uniques table (Matt) 4/25:
- Finish the reachability algorithm for model checking (Chris)
- Finish cache and uniques table by adapting the lockfree hash table to the needs of the library (Matt) 4/29:
- Write some test cases to verify the correctness of the model checker (Chris)
- Finish adding fork-join parallelism via Cilk (Matt) 5/2:
- Benchmark against BuDDy and test the library to ensure correctness, debug as necessary (both)
- Gather data about the parallelism of our system, including the speedup and memory footprint (both) 5/6:
- Final fine tuning (both)
- Finish creating a final presentation for our project (both)
- Begin practicing for the presentation (both) 5/9:
- Finish final writeup (both)
Though we have given it much thought and done background research to ensure that it is reasonable, we have yet to actually test the code on a Xeon Phi using Cilk. Our hope is that Cilk will be enough for our fork-join needs though we a bit concerned that we will hit unforeseen roadblocks while actually implementing it in the Phi's. Our backup plan is to run the code on the GHC machines, which will give some benchmarks, though we may not be able to observe as large of a speedup.
Another big concern is the fact that we are somewhat blindly implementing BDD algorithms according to a few research papers that we have read. So far, things have been going reasonably well, but if we eventually get a correctness bug, it will be incredibly difficult to debug since the algorithms presented in the papers are quite complex and presented in a simplistic fashion. We will try to debug this by manually inspecting the BDD's, though this becomes unfeasible when we try to scale up. This may require help from Professor Bryant if we cannot solve the issue.
For our final presentation we hope to show:
- Implementing a simple BDD with our library to show how easy it is to use.
- Performance benchmarks for creating and operating on large BDD's compared against existing sequential libraries.
- STRETCH: Demonstration of a model checker for verifying the correctness of a piece of parallel code (i.e. deadlock detection).