Probabilistic Integral Circuits¶

In this notebook, we will show an alternative way of learning the parameters of tensorized folded (probabilistic) circuits.

This technique is actually based on another model class called Probabilistic Integral Circuit (PIC), which extends Probabilistic Circuits (PCs) by adding integral units, which allow modelling continuous latent variables.

Fortunately enough, we do not need to fully understand PICs to apply them! In fact, from an application point of view, all we need to do is replacing every folded tensor parameter with a neural net whose output is an equally-sized tensor! Therefore, the actual parameters we are going to optimize are those of such neural nets, and not the original tensors. This is it -- nothing less, nothing more.

To showcase this alternative parameter learning scheme, we will first instantiate a folded circuit as shown in the learning a probabilistic circuit notebook.

In [1]:

Copied!





from cirkit.templates import utils, data_modalities
from cirkit.pipeline import compile
import random
import numpy as np
import torch


# Set some seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the torch device to use
device = torch.device('cuda')

symbolic_circuit = data_modalities.image_data(
    (1, 28, 28),                # The shape of MNIST image, i.e., (num_channels, image_height, image_width)
    region_graph='quad-graph',  # Select the structure of the circuit to follow the QuadGraph region graph
    input_layer='categorical',  # Use Categorical distributions for the pixel values (0-255) as input layers
    num_input_units=64,         # Each input layer consists of 64 Categorical input units
    sum_product_layer='cp',     # Use CP sum-product layers, i.e., alternate dense sum layers and hadamard product layers
    num_sum_units=64,           # Each dense sum layer consists of 64 sum units
    sum_weight_param=utils.Parameterization(
        activation='none',      # Do not use any parameterization
        initialization='normal' # Initialize the sum weights by sampling from a standard normal distribution
    )
)
circuit = compile(symbolic_circuit)
from cirkit.templates import utils, data_modalities
from cirkit.pipeline import compile
import random
import numpy as np
import torch


# Set some seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed(42)

# Set the torch device to use
device = torch.device('cuda')

symbolic_circuit = data_modalities.image_data(
    (1, 28, 28),                # The shape of MNIST image, i.e., (num_channels, image_height, image_width)
    region_graph='quad-graph',  # Select the structure of the circuit to follow the QuadGraph region graph
    input_layer='categorical',  # Use Categorical distributions for the pixel values (0-255) as input layers
    num_input_units=64,         # Each input layer consists of 64 Categorical input units
    sum_product_layer='cp',     # Use CP sum-product layers, i.e., alternate dense sum layers and hadamard product layers
    num_sum_units=64,           # Each dense sum layer consists of 64 sum units
    sum_weight_param=utils.Parameterization(
        activation='none',      # Do not use any parameterization
        initialization='normal' # Initialize the sum weights by sampling from a standard normal distribution
    )
)
circuit = compile(symbolic_circuit)

The one above is the very same circuit from the learning a circuit notebook. Let's now print some stuff related to its first and second layer.

In [2]:

Copied!

print(circuit.nodes[0])
print(circuit.nodes[0].probs().shape)
print(circuit.nodes[0])
print(circuit.nodes[0].probs().shape)

TorchCategoricalLayer(
  folds: 784  variables: 1  output-units: 64
  input-shape: (784, 1, -1, 1)
  output-shape: (784, -1, 64)
  (probs): TorchParameter(
    shape: (784, 64, 256)
    (0): TorchTensorParameter(output-shape: (784, 64, 256))
    (1): TorchSoftmaxParameter(
      input-shapes: [(784, 64, 256)]
      output-shape: (784, 64, 256)
    )
  )
)
torch.Size([784, 64, 256])

The first layer is an input categorical layer, which models the 784 gray-scale pixels using a folded tensor of shape (784, 64, 1, 256).

In [3]:

Copied!

print(circuit.nodes[1])
print(circuit.nodes[1].weight().shape)
print(circuit.nodes[1])
print(circuit.nodes[1].weight().shape)

TorchSumLayer(
  folds: 1568  arity: 1  input-units: 64  output-units: 64
  input-shape: (1568, 1, -1, 64)
  output-shape: (1568, -1, 64)
  (weight): TorchParameter(
    shape: (1568, 64, 64)
    (0): TorchTensorParameter(output-shape: (1568, 64, 64))
  )
)
torch.Size([1568, 64, 64])

The second layer is a TorchDenseLayer whose shape is (1568, 64, 64).

Converting PCs to Quadrature PCs (QPCs)¶

As we mentioned earlier, all we need to do is replacing every folded tensor parameter with a neural net whose output returns an equally-sized tensor. Therefore, instead of training tensors, we will be training neural nets that output tensors.

We can do such replacement in one line by using the API pc2qpc. Here QPC stands for Quadrature PC, and it is simply a discretization of the underlying PIC which is parameterized by neural nets.

In [4]:

Copied!

from cirkit.backend.torch.parameters.pic import pc2qpc
pc2qpc(circuit, integration_method="trapezoidal", net_dim=256)
from cirkit.backend.torch.parameters.pic import pc2qpc
pc2qpc(circuit, integration_method="trapezoidal", net_dim=256)

Let's now inspect again the first layer of circuit.

In [5]:

Copied!

print(circuit.nodes[0])
print(circuit.nodes[0].probs().shape)
print(circuit.nodes[0])
print(circuit.nodes[0].probs().shape)

TorchCategoricalLayer(
  folds: 784  variables: 1  output-units: 64
  input-shape: (784, 1, -1, 1)
  output-shape: (784, -1, 64)
  (probs): PICInputNet(
    (reparam): TorchSoftmaxParameter(
      input-shapes: [(784, 64, 256)]
      output-shape: (784, 64, 256)
    )
    (net): Sequential(
      (0): FourierLayer(1, 256, sigma=1.0)
      (1): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
      (2): Tanh()
      (3): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
    )
  )
)
torch.Size([784, 64, 256])

We see that the first layer is now parameterized by a neural net (instance of the class PICInputNet), and that its evaluation delivers an equally-shaped tensor as before, i.e. (784, 64, 1, 256).

Let's check now the second layer.

In [6]:

Copied!

print(circuit.nodes[1])
print(circuit.nodes[1].weight().shape)
print(torch.all(torch.isclose(circuit.nodes[1].weight().sum(-1), torch.tensor(1.0))))
print(circuit.nodes[1])
print(circuit.nodes[1].weight().shape)
print(torch.all(torch.isclose(circuit.nodes[1].weight().sum(-1), torch.tensor(1.0))))

TorchSumLayer(
  folds: 1568  arity: 1  input-units: 64  output-units: 64
  input-shape: (1568, 1, -1, 64)
  output-shape: (1568, -1, 64)
  (weight): PICInnerNet(
    (net): Sequential(
      (0): FourierLayer(2, 256, sigma=1.0)
      (1): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
      (2): Tanh()
      (3): Conv1d(256, 256, kernel_size=(1,), stride=(1,))
      (4): Tanh()
      (5): Conv1d(401408, 1568, kernel_size=(1,), stride=(1,))
      (6): Softplus(beta=1.0, threshold=20.0)
    )
  )
)
torch.Size([1568, 64, 64])
tensor(True)

Similarly, the second layer is parameterized by a neural net (instance of the class PICInnerNet), and its evaluation delivers an equally-shaped tensor as before, i.e. (1568, 64, 64). Finally, note that output tensors from neural nets associated to inner layers are already normalized, in the sense that they sum up to 1 over the last dimension, therefore leading to a PC with normalization constant equal to 1.

That is it, we are done! 🎉 We can now even forget about PICs, and just train as in the learning a circuit notebook as we do next. However, if you want to learn more about PICs (and understand the input arguments of pc2qpc) please check the original publication.

Learning a Probabilistic (Integral) Circuit¶

In [7]:

Copied!





from torch import optim
from torch.utils.data import DataLoader
from torchvision import transforms, datasets

# Load the MNIST data set and data loaders
transform = transforms.Compose([
    transforms.ToTensor(),
    # Flatten the images and set pixel values in the [0-255] range
    transforms.Lambda(lambda x: (255 * x.view(-1)).long())
])
data_train = datasets.MNIST('datasets', train=True, download=True, transform=transform)
data_test = datasets.MNIST('datasets', train=False, download=True, transform=transform)

# Instantiate the training and testing data loaders
train_dataloader = DataLoader(data_train, shuffle=True, batch_size=256)
test_dataloader = DataLoader(data_test, shuffle=False, batch_size=256)

# Initialize a torch optimizer of your choice,
#  e.g., Adam, by passing the parameters of the circuit
optimizer = optim.Adam(circuit.parameters(), lr=0.005)
from torch import optim
from torch.utils.data import DataLoader
from torchvision import transforms, datasets

# Load the MNIST data set and data loaders
transform = transforms.Compose([
    transforms.ToTensor(),
    # Flatten the images and set pixel values in the [0-255] range
    transforms.Lambda(lambda x: (255 * x.view(-1)).long())
])
data_train = datasets.MNIST('datasets', train=True, download=True, transform=transform)
data_test = datasets.MNIST('datasets', train=False, download=True, transform=transform)

# Instantiate the training and testing data loaders
train_dataloader = DataLoader(data_train, shuffle=True, batch_size=256)
test_dataloader = DataLoader(data_test, shuffle=False, batch_size=256)

# Initialize a torch optimizer of your choice,
#  e.g., Adam, by passing the parameters of the circuit
optimizer = optim.Adam(circuit.parameters(), lr=0.005)

In [8]:

Copied!





num_epochs = 10
step_idx = 0
running_loss = 0.0
running_samples = 0

# Move the circuit to chosen device
circuit = circuit.to(device)

for epoch_idx in range(num_epochs):
    for i, (batch, _) in enumerate(train_dataloader):
        # The circuit expects an input of shape (batch_dim, num_channels, num_variables),
        # so we unsqueeze a dimension for the channel.
        batch = batch.to(device)

        # Compute the log-likelihoods of the batch, by evaluating the circuit
        log_likelihoods = circuit(batch)

        # We take the negated average log-likelihood as loss
        loss = -torch.mean(log_likelihoods)
        loss.backward()
        # Update the parameters of the circuits, as any other model in PyTorch
        optimizer.step()
        optimizer.zero_grad()
        running_loss += loss.detach() * len(batch)
        running_samples += len(batch)
        step_idx += 1
        if step_idx % 200 == 0:
            average_nll = running_loss / running_samples
            print(f"Step {step_idx}: Average NLL: {average_nll:.3f}")
            running_loss = 0.0
            running_samples = 0
num_epochs = 10
step_idx = 0
running_loss = 0.0
running_samples = 0

# Move the circuit to chosen device
circuit = circuit.to(device)

for epoch_idx in range(num_epochs):
    for i, (batch, _) in enumerate(train_dataloader):
        # The circuit expects an input of shape (batch_dim, num_channels, num_variables),
        # so we unsqueeze a dimension for the channel.
        batch = batch.to(device)

        # Compute the log-likelihoods of the batch, by evaluating the circuit
        log_likelihoods = circuit(batch)

        # We take the negated average log-likelihood as loss
        loss = -torch.mean(log_likelihoods)
        loss.backward()
        # Update the parameters of the circuits, as any other model in PyTorch
        optimizer.step()
        optimizer.zero_grad()
        running_loss += loss.detach() * len(batch)
        running_samples += len(batch)
        step_idx += 1
        if step_idx % 200 == 0:
            average_nll = running_loss / running_samples
            print(f"Step {step_idx}: Average NLL: {average_nll:.3f}")
            running_loss = 0.0
            running_samples = 0

Step 200: Average NLL: 798.034
Step 400: Average NLL: 699.926
Step 600: Average NLL: 684.053
Step 800: Average NLL: 677.998
Step 1000: Average NLL: 671.159
Step 1200: Average NLL: 661.711
Step 1400: Average NLL: 658.074
Step 1600: Average NLL: 655.282
Step 1800: Average NLL: 653.680
Step 2000: Average NLL: 651.430
Step 2200: Average NLL: 650.717

We evaluate our probabilistic circuit on the test data by computing the average log-likelihood and bits per dimension.

In [10]:

Copied!





with torch.no_grad():
    test_lls = 0.0

    for batch, _ in test_dataloader:
        # The circuit expects an input of shape (batch_dim, num_channels, num_variables),
        # so we unsqueeze a dimension for the channel.
        batch = batch.to(device)

        # Compute the log-likelihoods of the batch
        log_likelihoods = circuit(batch)

        # Accumulate the log-likelihoods
        test_lls += log_likelihoods.sum().item()

    # Compute average test log-likelihood and bits per dimension
    average_ll = test_lls / len(data_test)
    bpd = -average_ll / (28 * 28 * np.log(2.0))
    print(f"Average test LL: {average_ll:.3f}")
    print(f"Bits per dimension: {bpd:.3f}")
with torch.no_grad():
    test_lls = 0.0

    for batch, _ in test_dataloader:
        # The circuit expects an input of shape (batch_dim, num_channels, num_variables),
        # so we unsqueeze a dimension for the channel.
        batch = batch.to(device)

        # Compute the log-likelihoods of the batch
        log_likelihoods = circuit(batch)

        # Accumulate the log-likelihoods
        test_lls += log_likelihoods.sum().item()

    # Compute average test log-likelihood and bits per dimension
    average_ll = test_lls / len(data_test)
    bpd = -average_ll / (28 * 28 * np.log(2.0))
    print(f"Average test LL: {average_ll:.3f}")
    print(f"Bits per dimension: {bpd:.3f}")

Average test LL: -645.984
Bits per dimension: 1.189