Training
This page is the shared training recipe every model on the ladder uses — the gradient framework, the surrogate, the gradient-stabilisation flag, the loss and optimiser, the readout options, the firing-rate regulariser, and the tasks the networks are exercised against.
Backpropagation through time
Backpropagation Through Time (BPTT) is how gradients are computed in any network whose output at time depends on state from earlier timesteps. The idea is to unroll the recurrent computation into a deep feedforward graph and run standard backpropagation through it.
Consider a recurrent system with hidden state that evolves according to:
where is the input, is the output, and are the parameters (shared across time). Iterating for steps produces a chain . For gradient computation, we treat this chain as a deep feedforward network of depth where layer ‘s weights are tied to layer ‘s.
The gradient of the loss with respect to a parameter is a sum over all timesteps at which influenced the computation:
The recursion contains a product of Jacobians across every timestep; if the Jacobian norms are consistently greater than 1, this product explodes exponentially in , and if consistently less than 1, it vanishes.
SNNs are a natural fit for BPTT: each timestep of the simulation is one step of the recursion, and the hidden state includes membrane potentials, synaptic conductances, and refractory counters. Simulating 200 ms at ms gives unrolled steps. Biophysical state variables carry physical units (mV, μS), so the Jacobians can be wildly scaled — voltage updates involve tiny factors like while surrogate gradients through spikes are . This scale mismatch is the origin of the gradient-stabilisation flag below.
Surrogate gradients
The spike function has zero gradient almost everywhere, so backward passes use a surrogate.
Pinglab implements two surrogates:
- Fast-sigmoid (
fast_sigmoid_spike) — used by every model except CubaPingNet. Forward is the hard step, backward is
This matches snntorch’s FastSigmoid surrogate so equal- comparisons against the snntorch reference are pure update-rule comparisons, not surrogate comparisons.
- Arctan (
arctan_spike) — used inside CubaPingNet for both the E and I spike emissions. Forward is the hard step, backward is
This decays faster than fast-sigmoid around threshold, which keeps the I-population’s spike gradient from overwhelming the E-population gradient when both are far from threshold.
Both surrogates take their slope from the module-level constant SURROGATE_SLOPE = 5.0, overridable per-run with --surrogate-slope.
Gradient stabilisation: --v-grad-dampen
The biophysical models (COBA, PING) face a gradient-scale problem. Naive backprop through a 2000-step trial produces NaN within a few batches. Pinglab handles this with the costate-control framework of Burghi, Pugliese Carratelli & Rule 2024 (“Costate control for nonlinear system identification with applications to excitable systems”), implemented as the single CLI flag --v-grad-dampen.
Why the Jacobian blows up in excitable systems
The forward dynamics of a conductance-based neuron, in continuous time, are
with the input and the parameters. The Jacobian has block structure
The diagonal blocks decay (membrane and conductance both have leak terms). The off-diagonal blocks are the cross-coupling and carry the destabilising spectrum — large transient eigenvalues during firing events, which is the very feature that lets the neuron spike fast. Chained across steps the product blows up by many orders of magnitude.
The costate equation
BPTT computes parameter gradients by running an adjoint equation backward in time. Define the costate :
This is exactly what PyTorch’s autograd computes; the costate equation is linear in with time-varying coefficient , so its solution norm tracks the integrated exponent of the Jacobian and blows up whenever has large transient eigenvalues.
Costate control — proportional high-gain
Burghi et al. modify the costate equation — and only the costate equation, not the forward pass — by adding a controller :
Pinglab implements the simplest controller they describe: proportional high-gain. Pick a scalar and scale the backward voltage gradient by at every timestep:
In code this is one line — _scale_grad(dv, 1.0 / v_grad_dampen) inside the LIF step. Typical values are for the unitless standard SNN and for COBA/PING; the latter is what the canonical recipes in nb025 and downstream use.
The forward dynamics are unchanged. The forward state is identical with or without the controller; only the gradient flow is modified. Burghi et al.’s Proposition 1 establishes that controlled costate descent has the same critical points as true gradient descent provided is chosen so the modified costate is uniformly bounded — Adam converges to the same place, it just takes a different route. This is why proportional high-gain is not just gradient clipping in disguise: clipping changes the vector field; preserves the fixed points.
The cost of the uniform scaling is that the diagonal voltage gradient — the leak/refractory pathway, which we’d rather keep — is attenuated by the same factor as the destabilising off-diagonal block we wanted to suppress. Empirically, large can silently cap achievable accuracy. The fix is to use the smallest that still prevents NaN; on long-trial tasks this is open future work.
The training loop
The network output is passed to cross-entropy loss:
where is the batch size, is the logit vector for sample , and is the true class. Chance-level loss on a 10-class problem is .
The optimiser is Adam. Gradients are clipped to unit norm (GRAD_CLIP = 1.0) before each step. The best model state (by test accuracy) is saved at each epoch; the saved weights.pth is the best-epoch state, not the final-epoch state.
Readout
The linear readout is configurable via --readout. Four modes are available:
spike-count— accumulate last-hidden spikes over the trial and project linearly. . Equivalent to spike-rate up to a constant.mem-mean— pass spikes through a final LIF without resetting and average its membrane potential across the trial. This is the default for the COBA/PING recipes and is what nb025 and the streaming entries use.li— leaky integrator: a non-spiking LIF whose final-step membrane potential is the logit.rate— softmax over per-trial spike rates.
The readout choice is one of the few knobs that changes where in the network the gradient enters. mem-mean lets gradient flow through the output LIF’s membrane state at every timestep; spike-count only sees the aggregate.
Firing-rate regularisation
Many recipes add a penalty on excessive or insufficient hidden firing via --fr-reg-upper-theta, --fr-reg-upper-strength, and the matching lower pair. The penalty is
where is the per-layer mean firing rate (per-neuron or population, set by --fr-reg-mode). It’s the mechanism behind the sweeps in nb025 and the rate-floor framing in ar009 / ar010.
Weight init
Feedforward weights are sampled from a fan-in-normalised half-normal (Dale’s law) or normal (signed):
with optional sparsity : a fraction of entries are zeroed and surviving entries are rescaled by so the total expected synaptic input per post-neuron is preserved.
Dale’s law under Adam
When Dale’s law is enforced, weight matrices are clamped to in the forward pass and explicitly projected back into the non-negative cone by project_dales() after every optimiser step. The forward clamp keeps the simulation honest; the post-step projection keeps the trainable parameters themselves from drifting arbitrarily far into the negative orthant under Adam’s momentum.
Tasks and inputs
The models are exercised against three classes of input.
Synthetic conductance
Direct conductance injection into layer-1 E neurons, used for baseline oscillation studies where we want PING dynamics in isolation from any encoding stage. Drive is generated Börgers-style as a step function with per-neuron heterogeneity plus an Ornstein–Uhlenbeck noise process:
where switches between async and PING-regime values during a stimulus window, is per-neuron heterogeneity, and is a discrete Ornstein–Uhlenbeck process with ms. Drive is calibrated at ms and rescaled at runtime so that the steady-state AMPA conductance is invariant across .
Synthetic spikes
Poisson spike trains over input neurons with a rate that steps between a baseline and a stimulus level during a fixed window. Used when we want a spiking input with no spatial structure.
Image and audio datasets
Supported datasets are scikit-digits (, 10 classes), MNIST (, 10 classes), and SHD (Spiking Heidelberg Digits, 700 channels, 20 classes). MNIST pixels become Poisson spike trains: for a pixel with normalised intensity , input neuron fires a Bernoulli spike at each step with probability
so the per-neuron rate is Hz, independent of . Default trial length is ms. Encoding is stochastic so the network sees a different spike realisation of the same image every epoch — a form of data augmentation intrinsic to the rate code. At evaluation time, a seeded torch.Generator is threaded through the encoder so train-time eval and standalone infer on the same weights produce identical spike trains.
SHD ships as spike trains directly, so no encoder is needed; the trial is the recording.