ICML 2026

Structure Abstraction and Generalization
in a Hippocampal-Entorhinal
Inspired World Model
Structure
Abstraction and
Generalization
in a Hippocampal-
Entorhinal Inspired
World Model

Tianqiu Zhang*1 Muyang Lyu*1 Xiao Liu2 Si Wu1

1Peking University 2HHMI Janelia

Paper Code (coming soon) Cite

TL;DR: A hippocampal-entorhinal inspired world model separates content-rich episodic states from reusable transition structures, enabling prediction and zero-shot structural transfer across objects, scenes, and simulated environments.

Abstract

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

HPC-MEC Circuit

From Cognitive Map to World Model

The hippocampal-entorhinal circuit suggests a useful division of labor: HPC binds content-rich episodic scenes, while MEC maintains compact relational structure and supports path integration. Our model turns this biological correspondence into a self-supervised visual world model.

Functional correspondence between the hippocampal-entorhinal circuit and the proposed world model.

Model Overview

Observation-Only Videos Become Reusable Transition Structures

The full pipeline starts from raw visual sequences, separates content and structure, infers latent transitions, then rolls the structured state forward to synthesize future frames.

1

Encode observations

Frames are mapped to visual embeddings with a pretrained multi-scale VQ-VAE.

2

Separate HPC and MEC

HPC preserves content-rich episodic states, while MEC compresses relational structure.

3

Infer latent transitions

The inverse model extracts low-dimensional dynamics from consecutive MEC states.

4

Predict through integration

Latent transitions drive CANN-inspired path integration to generate future observations.

Detailed model architecture for the hierarchical HPC-MEC world model.
The model first extracts visual embeddings, separates content-rich HPC states from compact MEC structure states, and predicts future observations through latent transition-driven dynamics.

Method

Hierarchical World Model Inspired by HPC-MEC Coupling

Visual Inference

Video frames are encoded into observation embeddings, then lifted into hippocampal embeddings that preserve content-rich episodic details before being compressed into medial-entorhinal embeddings.

Path Integration

A CANN-inspired MEC dynamics module updates abstract states with velocity-like latent transitions, producing the next MEC embedding as a structured phase shift.

Inverse Dynamics

An inverse model distills consecutive MEC embeddings into low-dimensional latent transitions, encouraging transitions to capture content-free dynamics rather than appearance.

Overview of the HPC-MEC coupling model and velocity-like path integration mechanism.

Visual inference flow

\[ \mathbf{o}_{1:T} \rightarrow \mathbf{s}^{\mathrm{inf}}_{1:T} \rightarrow \mathbf{p}^{\mathrm{inf}}_{1:T} \rightarrow \mathbf{g}^{\mathrm{inf}}_{1:T} \]

Observations are first embedded visually, then encoded into HPC states and compressed into MEC states.

Latent transition inference

\[ \mathbf{z}_{t}=f_{\mathrm{inverse}}( \mathbf{g}^{\mathrm{inf}}_{t+1} \ominus \mathbf{g}^{\mathrm{inf}}_{t}) \]

The inverse model reads the difference between consecutive MEC embeddings and extracts content-free transition structure.

Path integration in MEC

\[ \mathbf{g}^{\mathrm{gen}}_{t+1}= \mathbf{g}^{\mathrm{gen}}_{t}\oplus f_{\mathrm{forward}}(\mathbf{z}_{t}, \mathbf{g}^{\mathrm{gen}}_{t}) \]

The transition acts like a velocity input, shifting the MEC state on a CANN-inspired manifold.

Visual feedback correction

\[ \mathbf{g}^{\mathrm{gen}}_{t+1}= \mathbf{g}^{\mathrm{inf}}_{t}\oplus f_{\mathrm{forward}}(\mathbf{z}_{t}, \mathbf{g}^{\mathrm{inf}}_{t}) \]

When observations are available, inferred MEC states correct accumulated path-integration error.

Manifold Analysis

HPC Keeps Content, MEC Reveals Shared Structure

The manifold analysis probes whether the hierarchy actually separates appearance from transition structure. Rotation datasets make this visible because objects can share the same transformation while differing in shape, texture, and periodicity.

Periodic and in-class sharing

MEC embeddings form cleaner shared rotation trajectories across objects, while HPC embeddings preserve more object identity. This matches the intended division: HPC binds scene-specific episodic content; MEC abstracts relational dynamics that can be reused.

  • Periodic objects produce structured circular trajectories in latent space.
  • MEC embeddings overlap more strongly within object classes.
  • Transition probes decode dynamics best from latent transitions and MEC differences.
UMAP and classification analysis of HPC and MEC embeddings.
MEC embeddings form clearer shared rotation structures, while HPC embeddings retain more object-specific content.

Embedding and transition probes

Probe HPC state MEC state HPC transition MEC transition Latent transition
Transformation decoding accuracy \(0.3330 \pm 0.0163\) \(0.3486 \pm 0.0156\) \(0.8386 \pm 0.0263\) \(0.8868 \pm 0.0212\) \(0.9064 \pm 0.0145\)
Robotic sequence cosine similarity \(0.024 \pm 0.061\) \(0.146 \pm 0.063\) \(0.114 \pm 0.056\) \(0.152 \pm 0.057\) \(0.235 \pm 0.021\)

Results

Abstraction and Generalization

Once transition structure is separated from visual content, the model can use it in two ways: predict future observations and transfer dynamics into new visual contexts.

Abstraction

Latent transitions support predictive world modeling

For one-step prediction, the model extracts a transition from the input video and generates the next frame with matching dynamics. In autoregressive rollouts, it composes latent transitions over time; visual feedback can then correct accumulated path-integration error.

One-step and autoregressive prediction results across SSv2 and COIL-100.
The model supports one-step and autoregressive prediction, and visual feedback reduces accumulated path integration error.

Generalization

Reusable structures transfer across objects and scenes

Latent transitions extracted from one sequence can be applied to another context. The generated frames preserve target-scene content while following source-sequence dynamics, demonstrating zero-shot structural reuse across human-object videos and simulated object transformations.

Structural generalization results transferring latent transitions across contexts.
Latent transitions extracted from one context can be reused in different objects and scenes, demonstrating structural generalization.

Ablation setup

We evaluate OOD structure reuse by extracting latent transitions from a source sequence and applying them to a different target context. A successful model should preserve target content while following source dynamics. The ablations isolate the main design choices:

  • Unified latent space: removes the MEC layer and performs transition inference directly in the content-rich HPC space, testing whether hierarchical separation is necessary.
  • Without CANN: replaces CANN-inspired path integration with a standard MLP state-to-state transition, testing whether structured MEC dynamics matter.
  • VQ-VAE ablation: removes the HPC-MEC hierarchy and learns transitions directly on pretrained visual embeddings, testing whether the structure comes merely from the visual encoder.

Quantitative comparison on OOD structure reuse

Model R one-step ↑ R autoreg. ↑ SSIM one-step ↑ SSIM autoreg. ↑ LPIPS one-step ↓ LPIPS autoreg. ↓
Our model w/ unified latent space \(2.054 \pm 0.521\) \(1.542 \pm 0.246\) \(0.901 \pm 0.007\) \(0.886 \pm 0.008\) \(0.126 \pm 0.008\) \(0.179 \pm 0.008\)
Our model w/o CANN \(2.403 \pm 0.553\) \(1.859 \pm 0.396\) \(0.894 \pm 0.022\) \(0.888 \pm 0.009\) \(0.149 \pm 0.009\) \(0.177 \pm 0.010\)
VQ-VAE ablation \(2.035 \pm 0.229\) \(1.796 \pm 0.173\) \(0.892 \pm 0.009\) \(0.883 \pm 0.009\) \(0.158 \pm 0.009\) \(0.177 \pm 0.009\)
Our model \(3.201 \pm 0.435\) \(2.482 \pm 0.460\) \(0.902 \pm 0.010\) \(0.891 \pm 0.009\) \(0.120 \pm 0.008\) \(0.156 \pm 0.008\)

BibTeX

@inproceedings{zhang2026structure,
  title = {Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model},
  author = {Zhang, Tianqiu and Lyu, Muyang and Liu, Xiao and Wu, Si},
  booktitle = {Proceedings of the International Conference on Machine Learning},
  year = {2026}
}