Training

This page is the shared training recipe every model on the ladder uses — the gradient framework, the surrogate, the gradient-stabilisation flag, the loss and optimiser, the readout options, the firing-rate regulariser, and the tasks the networks are exercised against.

Backpropagation through time

Backpropagation Through Time (BPTT) is how gradients are computed in any network whose output at time $t$ depends on state from earlier timesteps. The idea is to unroll the recurrent computation into a deep feedforward graph and run standard backpropagation through it.

Consider a recurrent system with hidden state $h^t$ that evolves according to:

h^t = f(h^{t-1}, x^t; \theta), \quad y^t = g(h^t; \theta)

where $x^t$ is the input, $y^t$ is the output, and $\theta$ are the parameters (shared across time). Iterating for $T$ steps produces a chain $h^0 \to h^1 \to \cdots \to h^T$ . For gradient computation, we treat this chain as a deep feedforward network of depth $T$ where layer $t$ ‘s weights are tied to layer $t-1$ ‘s.

The gradient of the loss with respect to a parameter $\theta_k$ is a sum over all timesteps at which $\theta$ influenced the computation:

\frac{\partial \mathcal{L}}{\partial \theta_k} = \sum_{t=1}^T \frac{\partial \mathcal{L}}{\partial h^t} \cdot \frac{\partial h^t}{\partial \theta_k}

The recursion contains a product of Jacobians $\partial h^{t+1}/\partial h^t$ across every timestep; if the Jacobian norms are consistently greater than 1, this product explodes exponentially in $T$ , and if consistently less than 1, it vanishes.

SNNs are a natural fit for BPTT: each timestep of the simulation is one step of the recursion, and the hidden state includes membrane potentials, synaptic conductances, and refractory counters. Simulating 200 ms at $\Delta t = 0.1$ ms gives $T = 2000$ unrolled steps. Biophysical state variables carry physical units (mV, μS), so the Jacobians can be wildly scaled — voltage updates involve tiny factors like $\Delta t / C_m$ while surrogate gradients through spikes are $O(1)$ . This scale mismatch is the origin of the gradient-stabilisation flag below.

Surrogate gradients

The spike function $S = \mathbf{1}[U \geq \theta]$ has zero gradient almost everywhere, so backward passes use a surrogate.

Pinglab implements two surrogates:

Fast-sigmoid (fast_sigmoid_spike) — used by every model except CubaPingNet. Forward is the hard step, backward is

\frac{\partial \tilde S}{\partial U} = \frac{k}{(1 + k\,|U - \theta|)^2}

This matches snntorch’s FastSigmoid surrogate so equal- $k$ comparisons against the snntorch reference are pure update-rule comparisons, not surrogate comparisons.

Arctan (arctan_spike) — used inside CubaPingNet for both the E and I spike emissions. Forward is the hard step, backward is

\frac{\partial \tilde S}{\partial U} = \frac{k}{1 + (k\pi\,(U - \theta))^2}

This decays faster than fast-sigmoid around threshold, which keeps the I-population’s spike gradient from overwhelming the E-population gradient when both are far from threshold.

Both surrogates take their slope from the module-level constant SURROGATE_SLOPE = 5.0, overridable per-run with --surrogate-slope.

Gradient stabilisation: `--v-grad-dampen`

The biophysical models (COBA, PING) face a gradient-scale problem. Naive backprop through a 2000-step trial produces NaN within a few batches. Pinglab handles this with the costate-control framework of Burghi, Pugliese Carratelli & Rule 2024 (“Costate control for nonlinear system identification with applications to excitable systems”), implemented as the single CLI flag --v-grad-dampen.

Why the Jacobian blows up in excitable systems

The forward dynamics of a conductance-based neuron, in continuous time, are

\dot x = f_\theta(t, x, u), \qquad x = (V, g_e, g_i, \ldots),

with $u(t)$ the input and $\theta$ the parameters. The Jacobian $\nabla_x f_\theta$ has block structure

\nabla_x f_\theta = \begin{pmatrix} \partial \dot V / \partial V & \partial \dot V / \partial g \\ \partial \dot g / \partial V & \partial \dot g / \partial g \end{pmatrix}.

The diagonal blocks decay (membrane and conductance both have leak terms). The off-diagonal blocks are the cross-coupling and carry the destabilising spectrum — large transient eigenvalues during firing events, which is the very feature that lets the neuron spike fast. Chained across $T = 2000$ steps the product blows up by many orders of magnitude.

The costate equation

BPTT computes parameter gradients by running an adjoint equation backward in time. Define the costate $\lambda(t) \in \mathbb{R}^d$ :

\lambda(T) = \nabla_{x(T)} \mathcal{L}, \qquad \dot \lambda(t) = -\nabla_x f_\theta(t, x, u)^\top \lambda(t).

This is exactly what PyTorch’s autograd computes; the costate equation is linear in $\lambda$ with time-varying coefficient $\nabla_x f_\theta$ , so its solution norm tracks the integrated exponent of the Jacobian and blows up whenever $\nabla_x f_\theta$ has large transient eigenvalues.

Costate control — proportional high-gain

Burghi et al. modify the costate equation — and only the costate equation, not the forward pass — by adding a controller $K_\theta(t, x)$ :

\dot \lambda = -[\nabla_x f_\theta + K_\theta(t, x)]^\top \lambda.

Pinglab implements the simplest controller they describe: proportional high-gain. Pick a scalar $\gamma > 0$ and scale the backward voltage gradient by $1/\gamma$ at every timestep:

\frac{\partial \mathcal{L}}{\partial V_i^t} \leftarrow \frac{1}{\gamma}\,\frac{\partial \mathcal{L}}{\partial V_i^t}.

In code this is one line — _scale_grad(dv, 1.0 / v_grad_dampen) inside the LIF step. Typical values are $\gamma \approx 80$ for the unitless standard SNN and $\gamma \approx 1000$ for COBA/PING; the latter is what the canonical recipes in nb025 and downstream use.

The forward dynamics are unchanged. The forward state $x(t)$ is identical with or without the controller; only the gradient flow is modified. Burghi et al.’s Proposition 1 establishes that controlled costate descent has the same critical points as true gradient descent provided $K_\theta$ is chosen so the modified costate is uniformly bounded — Adam converges to the same place, it just takes a different route. This is why proportional high-gain is not just gradient clipping in disguise: clipping changes the vector field; $K^{\text{prop}}$ preserves the fixed points.

The cost of the uniform scaling is that the diagonal voltage gradient — the leak/refractory pathway, which we’d rather keep — is attenuated by the same factor as the destabilising off-diagonal block we wanted to suppress. Empirically, large $\gamma$ can silently cap achievable accuracy. The fix is to use the smallest $\gamma$ that still prevents NaN; on long-trial tasks this is open future work.

The training loop

The network output is passed to cross-entropy loss:

L_\text{CE} = -\frac{1}{B}\sum_{b=1}^{B} \log \frac{\exp(\hat y_{b,\,c_b})}{\sum_k \exp(\hat y_{b,k})}

where $B$ is the batch size, $\hat y_b$ is the logit vector for sample $b$ , and $c_b$ is the true class. Chance-level loss on a 10-class problem is $\ln 10 \approx 2.30$ .

The optimiser is Adam. Gradients are clipped to unit norm (GRAD_CLIP = 1.0) before each step. The best model state (by test accuracy) is saved at each epoch; the saved weights.pth is the best-epoch state, not the final-epoch state.

Readout

The linear readout is configurable via --readout. Four modes are available:

spike-count — accumulate last-hidden spikes over the trial and project linearly. $\hat y = (\sum_t s^{\text{hid}}_t)\, W_\text{out} + b_\text{out}$ . Equivalent to spike-rate up to a constant.
mem-mean — pass spikes through a final LIF without resetting and average its membrane potential across the trial. This is the default for the COBA/PING recipes and is what nb025 and the streaming entries use.
li — leaky integrator: a non-spiking LIF whose final-step membrane potential is the logit.
rate — softmax over per-trial spike rates.

The readout choice is one of the few knobs that changes where in the network the gradient enters. mem-mean lets gradient flow through the output LIF’s membrane state at every timestep; spike-count only sees the aggregate.

Firing-rate regularisation

Many recipes add a penalty on excessive or insufficient hidden firing via --fr-reg-upper-theta, --fr-reg-upper-strength, and the matching lower pair. The penalty is

\mathcal{L}_\text{fr} = s_u \cdot \mathrm{ReLU}(\bar r - \theta_u) + s_l \cdot \mathrm{ReLU}(\theta_l - \bar r)

where $\bar r$ is the per-layer mean firing rate (per-neuron or population, set by --fr-reg-mode). It’s the mechanism behind the $\theta_u$ sweeps in nb025 and the rate-floor framing in ar009 / ar010.

Weight init

Feedforward weights are sampled from a fan-in-normalised half-normal (Dale’s law) or normal (signed):

W \sim \mathcal{N}(\mu, \sigma^2), \qquad W \leftarrow W / N_{\text{pre}}

with optional sparsity $s \in [0, 1)$ : a fraction $s$ of entries are zeroed and surviving entries are rescaled by $1/(1-s)$ so the total expected synaptic input per post-neuron is preserved.

Dale’s law under Adam

When Dale’s law is enforced, weight matrices are clamped to $W \geq 0$ in the forward pass and explicitly projected back into the non-negative cone by project_dales() after every optimiser step. The forward clamp keeps the simulation honest; the post-step projection keeps the trainable parameters themselves from drifting arbitrarily far into the negative orthant under Adam’s momentum.

Tasks and inputs

The models are exercised against three classes of input.

Synthetic conductance

Direct conductance injection into layer-1 E neurons, used for baseline oscillation studies where we want PING dynamics in isolation from any encoding stage. Drive is generated Börgers-style as a step function with per-neuron heterogeneity $X_i$ plus an Ornstein–Uhlenbeck noise process:

g_i^{\text{ext}}(t) = T_E(t)(1 + \sigma_e X_i) + \eta_i(t)

where $T_E$ switches between async and PING-regime values during a stimulus window, $X_i \sim \mathcal{N}(0, 1)$ is per-neuron heterogeneity, and $\eta$ is a discrete Ornstein–Uhlenbeck process with $\tau_\eta = 3$ ms. Drive is calibrated at $\Delta t_{\text{cal}} = 0.1$ ms and rescaled at runtime so that the steady-state AMPA conductance is invariant across $\Delta t$ .

Synthetic spikes

Poisson spike trains over $N_{\text{in}}$ input neurons with a rate that steps between a baseline and a stimulus level during a fixed window. Used when we want a spiking input with no spatial structure.

Image and audio datasets

Supported datasets are scikit-digits ( $8 \times 8$ , 10 classes), MNIST ( $28 \times 28 = 784$ , 10 classes), and SHD (Spiking Heidelberg Digits, 700 channels, 20 classes). MNIST pixels become Poisson spike trains: for a pixel with normalised intensity $x \in [0, 1]$ , input neuron $i$ fires a Bernoulli spike at each step with probability

p_i(t) = x_i \cdot r_{\max} \cdot \Delta t / 1000, \qquad r_{\max} = 25 \text{ Hz}

so the per-neuron rate is $x_i \cdot r_{\max}$ Hz, independent of $\Delta t$ . Default trial length is $T_\text{ms} = 200$ ms. Encoding is stochastic so the network sees a different spike realisation of the same image every epoch — a form of data augmentation intrinsic to the rate code. At evaluation time, a seeded torch.Generator is threaded through the encoder so train-time eval and standalone infer on the same weights produce identical spike trains.

SHD ships as spike trains directly, so no encoder is needed; the trial is the recording.