Back to Blog

Exploring Stochastic Modeling in Visual Place Recognition

Introduction

You know how VAEs have this really neat trick of handling uncertainty? Well, after diving deep into VAEs in my previous blog post, I had this "wait a minute" moment. What if we could bring that same magic to visual place recognition (VPR)? For those who aren't familiar with VPR, it's basically teaching robots to recognize places they've been before - kind of like how you'd recognize your favorite coffee shop, but for machines.

Now, I should probably mention upfront that this is one of those "failed experiment" stories. But hey, sometimes the attempts that don't work out teach us the most interesting lessons, right? So buckle up - I'm going to share my journey of trying (and ultimately failing) to make this work, and all the fascinating insights I gained along the way.

1. Motivation

Here's the thing about current VPR systems - they're a bit like that friend who's super confident about everything but then gets completely lost when things change slightly. They face some pretty real challenges:

Looking at these challenges, I had this thought: "Hey, isn't uncertainty kind of the common thread here?" I mean, VAEs handle uncertainty beautifully by saying "we're not 100% sure about this, but here's our best guess and how confident we are about it." So I wondered - what if we could bring that same mindset to place recognition?

2. The Stochastic VPR Framework

2.1 Traditional Approach: Contrastive Learning

Before diving into my stochastic approach, let's look at how VPR typically works. The traditional approach uses contrastive learning, where we train a network to map images to point embeddings in a way that makes similar places close together and different places far apart.

Usually, we have a backbone network \(f\) that maps an image \(x\) to a feature vector:

\[f_\theta: X \rightarrow \mathbb{R}^d\]

And we train it with contrastive loss (like InfoNCE):

\[L_{cont} = -\log \frac{\exp(f(x_i)^T f(x_j)/\tau)}{\sum_{k \neq i} \exp(f(x_i)^T f(x_k)/\tau)}\]

where \((x_i, x_j)\) are images of the same place, and \(\tau\) is a temperature parameter.

2.2 The Stochastic Approach

Here's where I had what I thought was a clever idea: What if each image of a place isn't just a point in feature space, but a sample from some underlying probability distribution? Think about it - a place under different conditions (lighting, weather, viewpoint) could be seen as different samples from a place-specific distribution.

This approach felt promising for a couple of reasons. First, it seemed like a natural way to handle uncertainty. Instead of forcing every image to sit rigidly at a single point in feature space, I thought, "What if each place could be fluid, represented as a probability cloud that shifts slightly with lighting, weather, or viewpoint?" It's like saying, "I'm pretty sure this is the same place—but here's how confident I am."

Secondly, the sampling process itself felt like a hidden bonus—kind of like getting free data augmentation without the extra work. Each time we sample from the distribution, it's as if we're seeing the place from a slightly different perspective, making the model naturally more robust to real-world variations.

2.3 Mathematical Formulation

For an input image \(x_i\), our backbone network now outputs distribution parameters:

\[f_\theta(x_i) = (\mu_i, \Sigma_i)\]

This defines a latent distribution for each image:

\[q_i = \mathcal{N}(\mu_i, \Sigma_i)\]

We generate features through the reparameterization trick:

\[\varepsilon \sim \mathcal{N}(0, I)\] \[z_i = \mu_i + \Sigma_i^{1/2} \varepsilon\]

Final descriptor after pooling:

\[d_i = g(z_i)\]

The training objective combines two components:

1. Distribution Alignment Loss (for positive pairs):

\[L_{KL} = D_{KL}(q_i || q_j) + D_{KL}(q_j || q_i)\]

2. Contrastive Learning Loss:

\[L_{cont} = -\log \frac{\exp(d_i^T d_j/\tau)}{\sum_{k \neq i} \exp(d_i^T d_k/\tau)}\]

Total Loss:

\[L_{total} = L_{cont} + \lambda L_{KL}\]

Training Objective:

\[\theta^* = \arg\min_\theta \mathbb{E}_{x_i, x_j \in P}[L_{total}]\]

where \(P\) is the set of positive pairs (images of the same place).

Note that while I used InfoNCE-style contrastive loss here, the framework is flexible - we could swap in other contrastive losses like triplet loss or circle loss. The key innovation is in the stochastic modeling of the features, not the specific choice of contrastive loss.

3. Experimental Setup

3.1 Implementation Details

The model was implemented with the following architecture:

The probabilistic InfoNCE loss:

$\mathcal{L} = -\mathbb{E}\left[\log \frac{\exp(\text{sim}(I, I^+)/\tau)}{\sum_{I^-} \exp(\text{sim}(I, I^-)/\tau)}\right]$

4. Initial Attempts and Observations

4.1 The Runaway Variance Problem

In my first training attempts, I noticed something interesting - the variance kept increasing without bound. This was puzzling at first, but after diving deeper into the mathematics, it made sense: Unlike VAEs which have a reconstruction loss to constrain the variance, our contrastive learning setup wasn't providing enough gradient signal to control the variance effectively.

4.2 Why VAEs Work: A Natural Variance Control

In VAEs, the reconstruction loss has an explicit dependence on variance:

\[-\log N(x; \hat{x}, \sigma^2I) = \frac{1}{2\sigma^2}\|x-\hat{x}\|^2 + \frac{D}{2}\log\sigma^2 + const\]

This provides natural bounds:

4.3 Single-Sample Contrastive: The Problem

Consider a positive pair \((x_i, x_j)\). In the standard contrastive setup:

1. Sample one point from each distribution:

\[z_i = \mu_i + \sigma_i \epsilon_i, \quad \epsilon_i \sim N(0, I)\] \[z_j = \mu_j + \sigma_j \epsilon_j, \quad \epsilon_j \sim N(0, I)\]

2. The contrastive loss sees:

\[\|z_i - z_j\|^2 = \| (\mu_i + \sigma_i \epsilon_i) - (\mu_j + \sigma_j \epsilon_j) \|^2\]

3. The gradient w.r.t \(\sigma_i\) for a single dimension \(d\) is:

\[\frac{\partial \|z_i - z_j\|^2}{\partial \sigma_{i,d}} = 2(z_i - z_j)_d \epsilon_{i,d}\]

This is highly problematic because:

4.4 Multi-Sample: How It Helps

Instead of one sample, we draw \(K\) samples per image:

\[z_i^{(k)} = \mu_i + \sigma_i \epsilon_i^{(k)}, \quad k = 1...K\]

The loss becomes an average over all sample pairs:

\[L_{multi} = \frac{1}{K^2} \sum_{k=1}^K \sum_{l=1}^K \|z_i^{(k)} - z_j^{(l)}\|^2\]

Taking expectation:

\[\mathbb{E}[L_{multi}] = \|\mu_i - \mu_j\|^2 + \sigma_i^2 + \sigma_j^2\]

What's fascinating is how this multi-sample approach quietly reshapes the learning process. Unlike the single-sample setup where variance feels like an unruly guest, here, variance pulls up a chair at the table, explicitly woven into the expected loss. No more hiding in the background—it's front and center.

Plus, drawing multiple samples per image smooths out the noisy chaos of gradient updates. Imagine trying to understand a song by hearing just one note—that's single-sample training. But when you hear the whole melody, with harmonies layered in? That's what multi-sample training feels like. It gives the model a clearer, more consistent sense of how variance behaves, turning random noise into a well-orchestrated tune.

5. The Performance Trade-off

5.1 The Batch Size Dilemma

While the multi-sample approach solved our variance stability issues, it introduced a new problem. The final model performed about 2% worse than a traditional deterministic approach. Why? The answer lies in how contrastive learning really works:

5.2 The Implicit Trade-off

We essentially traded batch diversity for sample diversity. While our noised samples provided some additional positive pairs, they weren't as valuable as the hard negatives we lost by reducing the batch size. The multiple samples per image were essentially giving us "easy" positive pairs, while what we really needed were those challenging negative examples that come with larger batch sizes.

6. Key Insights from the Exploration

Stepping back from this experiment, I realized that the value wasn't just in the results, but in what the process revealed about the intersection of stochastic modeling and visual place recognition (VPR). This journey surfaced a few pivotal insights that reshaped my understanding of the problem space:

First, there's an inherent tension between richness and efficiency. Stochastic models are great at capturing nuanced variability, but they come at a steep computational cost. The more we tried to model uncertainty, the more we had to wrestle with memory constraints, ultimately forcing trade-offs that chipped away at performance.

Then came the revelation about what really matters in VPR. I went in thinking that uncertainty modeling would be the star of the show, but it turns out that distinguishing places that look frustratingly similar—those hard negatives—is the real MVP. It's less about embracing uncertainty and more about sharpening the system's ability to say, “No, this isn't the same place.”

Lastly, I underestimated the subtle yet powerful influence of batch dynamics. In contrastive learning, batch size isn't just a technical parameter—it fundamentally shapes how well the model learns. The need to reduce batch size to accommodate stochastic sampling inadvertently weakened the model's capacity to find meaningful negative examples, which turned out to be critical for success.

Conclusion

This wasn't the outcome I had hoped for, but it was exactly the outcome I needed. The experiment with stochastic modeling in VPR didn't yield a groundbreaking new method, but it did unravel some important truths about the problem itself. Sometimes, exploring an idea to its limits helps clarify not just what works, but why certain things don't.

What I've learned is that VPR isn't just about handling uncertainty—it's about mastering distinctiveness. The ability to confidently differentiate between places, even under challenging conditions, trumps the elegance of probabilistic models when they come with too much computational baggage. Stochastic modeling may have its place, but not in the way I initially imagined.

And that's the beauty of failed experiments. They don't just close doors—they light up the paths you might've overlooked. While I won't be pursuing this direction further, I leave with a deeper appreciation for the practical nuances of VPR and a clearer sense of where real innovation might lie.

Citation

Please cite this work as:

@misc{stochastic-modeling-vpr,
  author    = {Tianrun Hu},
  title     = {Exploring Stochastic Modeling in Visual Place Recognition},
  year      = {2025},
  publisher = {GitHub},
  url       = {https://h-tr.github.io/blog/posts/stochastic-modeling-vpr.html}