Setting Options of the Compilation Backend¶
We explore the available options that can be specified when compiling a symbolic circuit. See the notebook on learning a probabilistic circuit for more details about symbolic circuit representations and their compilation. Currently, symbolic circuits can only be compiled using a PyTorch 2+ backend, which allows you to specify a few options, such as the semiring that defines how to evaluate sum and products and a couple of flags related to optimizations. Future versions of cirkit may include compilation backends other than PyTorch, each with their own set of features and compilation options. However, the philosophy of cirkit is to abstract away the design of circuits and their operators from the underlying implementation and deep learning library dependencies. This will foster opportunities arising from connecting different platforms and compiler tool chains, without affecting the rest of the library.
We start by instantiating a symbolic circuit for image data, as shown in the following code. Note that this is completely disentangled from the compilation step and the compilation options we explore next.
from cirkit.templates import data_modalities, utils
symbolic_circuit = data_modalities.image_data(
(1, 28, 28), # The shape of the image, i.e., (num_channels, image_height, image_width)
region_graph='quad-graph', # Select the structure of the circuit to follow the QuadGraph region graph
input_layer='categorical', # Use Categorical distributions for the pixel values (0-255) as input layers
num_input_units=64, # Each input layer consists of 64 Categorical input units
sum_product_layer='tucker', # Use Tucker sum-product layers, i.e., alternate dense sum layers and kronecker product layers
num_sum_units=64, # Each dense sum layer consists of 64 sum units
sum_weight_param=utils.Parameterization(
activation='softmax', # Parameterize the sum weights by using a softmax activation
initialization='normal' # Initialize the sum weights by sampling from a standard normal distribution
)
)
The Pipeline Context object¶
The most important object we introduce in this notebook is the pipeline context, which allows you to specify the compilation backend, as well as compilation options. Since we will use the PyTorch backend, we first set some random seeds and the device to use.
# Set random seeds and the torch device
import random
import numpy as np
import torch
# Set some seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)
# Set the torch device to use
device = torch.device('cuda')
In the next code snippet, we show how to instantiate a pipeline context using the PyTorch backend.
from cirkit.pipeline import PipelineContext
ctx = PipelineContext(
backend='torch', # Use the PyTorch backend with default compilation flags
)
By using this pipeline context, we can compile symbolic circuits as shown in the following code.
circuit = ctx.compile(symbolic_circuit)
An alternative way to compile circuits using a pipeline context is by combining the with statement and the compile function, as shown below.
from cirkit.pipeline import compile
with ctx:
circuit = compile(symbolic_circuit)
# Many circuits can possibly be compiled using the same pipeline context
...
The PyTorch backend allows you to specify three compilation options: (1) a particular semiring that specifies how to evaluate sum and product layers, (2) whether to fold the circuit computational graph as to better exploit parallel architectures like GPUs or not, and (3) whether to optimize the layers and the parameters of each layer by enabling a number of optimization rules. Below, we discuss each of these compilation options.
Choosing a Semiring¶
By default, the semiring used is the usual one defined over the reals (called sum-product), i.e., the semiring $(\mathbb{R},+,\times)$, where $\mathbb{R}$ is the field of real numbers, and $+$ and $\times$ are the usual sum and products over reals. Another popular semiring is the log-sum-exp and sum semiring (called lse-sum), which ensures numerical stability by performing computations "in log-space". In fact, the lse-sum semiring is defined as $(\mathbb{R},\oplus,\otimes)$, where $\oplus$ is the log-sum-exp operation and $\otimes$ is the sum. By specifying lse-sum as semiring, sums compute log-sum-exp operations, while products compute sums, hence avoiding numerical issues such as underflows. A third available semiring is the complex-lse-sum semiring, which extends the lse-sum semiring to the field of complex numbers $(\mathbb{C},\oplus,\otimes)$, by making use of the complex extensions of logarithms and exponentials. This semiring is particularly useful to ensure numerical stability in the case of circuits with negative parameters.
In the following code, we instantiate a pipeline context by specifying the lse-sum semiring.
ctx = PipelineContext(
backend='torch', # Use the PyTorch backend
# Specify the backend compilation flags next
# ---- Specify how to evaluate sum and product layers ---- #
semiring='lse-sum', # In this case we use the numerically-stable 'lse-sum' semiring (R, +, *), i.e.,
# where: + is the log-sum-exp operation, and * is the sum operation.
# -------------------------------------------------------- #
)
Next, we compile the circuit using this pipeline context.
%%time
circuit = ctx.compile(symbolic_circuit)
CPU times: user 4.46 s, sys: 998 ms, total: 5.46 s Wall time: 5.38 s
circuit.to(device); # Move the compiled circuit parameters to the chosen device
Since we have chosen the lse-sum semiring, we expect the compiled circuit to output log-probabilities rather than probabilities. We can quickly check this by evaluating the circuit on some input and observing that the outputs are negative (i.e., they are log-likelihoods).
batch = torch.randint(256, size=(1, 784), device=device)
circuit(batch).item()
-4358.77685546875
In the next section of this notebook, we enable a couple of compilation flags that will speed up the feed-forward evaluation of a circuit. However, why would someone disable the optimizations in the first place? The answer is that disabling optimizations is great for debugging purposes. In fact, the PyTorch backend ensures a one-to-one correspondence between the layers in the symbolic circuit representation and the compiled layers, if no optimizations are enabled, thus simplifying debugging operations such as verifying the correctness of inputs and outputs of each layer separately.
Before proceeding to the next section, we benchmark the feed-forward evaluation of the circuit compiled with the default options, as it will serve as a reference when we will enable folding and other optimizations.
%%timeit
batch = torch.randint(256, size=(128, 784), device=device)
circuit(batch)
if 'cuda' in str(device):
torch.cuda.synchronize(device)
1.37 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Folding your Circuit¶
Circuits typically have layers that can possibly be evaluated independently. Therefore, we can exploit powerful parallel architectures like GPUs to parallelize the computation of such layers. Enabling folding as compilation option fuses layers of the same type (e.g., Kronecker product layers) that can be evaluated in parallel. By doing so, we obtain a much more efficient computational graph in PyTorch, with a negligible overhead in terms of compilation speed.
To initialize a pipeline context that enables folding, we simply need to specify fold=True.
ctx = PipelineContext(
backend='torch', # Use the PyTorch backend
# Specify the backend compilation flags next
semiring='lse-sum', # Use the 'lse-sum' semiring
# --------- Enable circuit folding ---------- #
fold=True,
# ------------------------------------------- #
)
Next, we compile the same symbolic circuit and obtain a folded circuit.
%%time
folded_circuit = ctx.compile(symbolic_circuit)
CPU times: user 4.6 s, sys: 1.01 s, total: 5.62 s Wall time: 5.54 s
folded_circuit.to(device); # Move the compiled circuit parameters to the chosen device
Note that the compilation procedure took a similar amount of time, when compared to the compilation with the default compilation options shown above. In addition, we compare the number of layers of an "unfolded" circuits with the number of layers of a "folded" circuit.
print(f'Number of layers (fold=False): {len(circuit.layers)}')
print(f'Number of layers (fold=True): {len(folded_circuit.layers)}')
Number of layers (fold=False): 4163 Number of layers (fold=True): 26
The "folded" circuit has far fewer layers, since many of them have been fused together. For example, we can check that the first layer of the circuit computing Categorical likelihoods consists of many folds, as many as the number of variables modelling MNIST images.
folded_layer = next(folded_circuit.topological_ordering())
print(f'Type of the input folded layer: {folded_layer.__class__.__name__}')
print(f'Number of folded layers within it: {folded_layer.num_folds}')
Type of the input folded layer: TorchCategoricalLayer Number of folded layers within it: 784
As we see in the next code snippet, enabling folding provided an (approximately) 18.1x speed-up for feed-forward circuit evaluations.
%%timeit
batch = torch.randint(256, size=(128, 784), device=device)
folded_circuit(batch)
if 'cuda' in str(device):
torch.cuda.synchronize(device)
75.8 ms ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Optimizing the Circuit Layers¶
Some circuits have layers and parameterizations whose evaluation can be optimized. Enabling optimizations in a pipeline context tells the compiler to try matching a number of optimization patterns defined over the layers of the circuit. If an optimization pattern matches, then the compiler performs a number of operations to optimize the circuit structure.
A simple example of an optimizable circuit structure is the one that alternates Kronecker product layers with Dense sum layers. The symbolic circuit we have built has already this kind of circuit structure, as we have specified the tucker sum-product layer. We can verify this by observing the types of the layers of the folded circuit have compiled above.
print([layer.__class__.__name__ for layer in folded_circuit.layers])
['TorchCategoricalLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchKroneckerLayer', 'TorchSumLayer', 'TorchSumLayer']
In this case, we can fuse Kronecker and Dense layers in a single layer, which we call Tucker layer, that performs the same computations using an efficient einsum tensorized operation. This optimization is why probabilistic circuit architectures like EinsumNetworks are much more efficient. However, there are many other compilation rules that are currently supported by the PyTorch backend.
The next piece of code shows how to enable optimizations in a pipeline context (i.e., specify optimize=True).
ctx = PipelineContext(
backend='torch', # Use the PyTorch backend
# Specify the backend compilation flags next
semiring='lse-sum', # Use the 'lse-sum' semiring
fold=True, # Enable circuit folding
# -------- Enable layer optimizations -------- #
optimize=True,
# -------------------------------------------- #
)
Next, we compile the same symbolic circuit and obtain an optimized circuit.
%%time
optimized_circuit = ctx.compile(symbolic_circuit)
CPU times: user 4.78 s, sys: 1.01 s, total: 5.79 s Wall time: 5.71 s
optimized_circuit.to(device); # Move the compiled circuit parameters to the chosen device
Note that the compilation took just a little more time than the time for the folded circuit. Moreover, if we look at the list of layers, we observe that some of them are now Tucker layers, which can be much more efficient.
print([layer.__class__.__name__ for layer in optimized_circuit.layers])
['TorchCategoricalLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer', 'TorchTuckerLayer', 'TorchTuckerLayer', 'TorchSumLayer']
Finally, we benchmark the optimized circuit compiled in this way.
%%timeit
batch = torch.randint(256, size=(128, 784), device=device)
optimized_circuit(batch)
if 'cuda' in str(device):
torch.cuda.synchronize(device)
38.6 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note that, we achieved an (approximately) 2.0x speed-up, when compared to the folded circuit compiled above, and an (approximately) 35.5x speed-up, when compared to the circuit compiled with no folding and no optimizations.