What Deep Learning Theory Teaches Us About AI Memory

> December 15, 2024 | > Gregory Dickson | > 11 min read

How rate reduction and lossy compression principles from Berkeley’s new textbook could reshape how we build persistent memory for LLMs

The Memory Problem No One Talks About

Every AI coding assistant you use today has the same dirty secret: it forgets everything the moment your session ends. That brilliant debugging session where Claude figured out your codebase architecture? Gone. The context about your team’s coding conventions that took 20 messages to establish? Evaporated.

We’re building MemoryGraph to solve this problem—a graph-based memory system that gives LLMs persistent, queryable memory across sessions. But as we dove deeper into the architecture, we kept hitting the same fundamental question:

What does it actually mean to “remember” something well?

It’s not enough to just store text and retrieve it. Human memory doesn’t work that way. We compress experiences into schemas, organize knowledge hierarchically, and somehow retrieve exactly what’s relevant from decades of accumulated experience in milliseconds.

Then we discovered a new textbook that changed how we think about this problem entirely.

”Learning Deep Representations of Data Distributions”

In August 2025, Yi Ma’s lab at Berkeley released Learning Deep Representations of Data Distributions—an open-source textbook that presents a unified mathematical framework for understanding deep learning through the lens of compression.

Their central thesis is deceptively simple:

“We compress to learn, and we learn to compress.”

The book argues that intelligence—whether biological or artificial—fundamentally involves discovering low-dimensional structure in high-dimensional data and transforming that data into compact, structured representations.

This isn’t just philosophy. They provide rigorous mathematics showing that popular neural network architectures (ResNets, Transformers, CNNs) can be derived as iterative optimization steps that maximize something called rate reduction—a measure of how well representations compress data while preserving important distinctions.

Reading this, we realized: this framework maps directly onto the memory storage problem.

Rate Reduction: A New Way to Think About Memory Quality

The book introduces a principle called Maximal Coding Rate Reduction (MCR²). Here’s the intuition:

Imagine you have a collection of memories from different categories—bug fixes, architectural decisions, API documentation, team preferences. A good memory representation should do two things simultaneously:

Maximize expansion between categories: Memories about bug fixes should live in a completely different “region” of your representation space than memories about team preferences. You want these categories to be as distinguishable as possible.
Maximize compression within categories: All your bug fix memories should cluster tightly together. They share common structure—problem, cause, solution—and your representation should capture that.

Mathematically, this is expressed as:

ΔR = R(all memories) - Σ R(memories in each category)

Where R is the “coding rate”—essentially, how many bits you’d need to encode the data. You want to maximize ΔR: the total coding rate should be high (diverse memories), but the sum of within-category rates should be low (similar memories cluster).

This gives us a concrete metric for memory quality that goes beyond simple retrieval accuracy.

How This Applies to LLM Memory Systems

The Problem with Flat Embeddings

Most vector databases treat all memories the same way: convert text to a 384 or 768-dimensional embedding, store it, retrieve by cosine similarity.

But this ignores the structure we know exists in the data. A memory about a “person” is fundamentally different from a memory about a “code pattern.” Treating them identically wastes representational capacity and makes retrieval harder.

The Berkeley framework suggests a different approach: type-specific subspaces.

Structured Embedding Spaces

Instead of one flat embedding space, imagine memories organized into learned subspaces:

┌─────────────────────────────────────────────────────────────┐
│                    Embedding Space (384-dim)                │
│                                                             │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐     │
│   │   Person     │  │   Project    │  │   Solution   │     │
│   │  Subspace    │  │  Subspace    │  │  Subspace    │     │
│   │   (64-dim)   │  │  (128-dim)   │  │   (96-dim)   │     │
│   │              │  │              │  │              │     │
│   │  • Alice     │  │ • ProjectX   │  │ • Fix#123    │     │
│   │  • Bob       │  │ • MemGraph   │  │ • Fix#456    │     │
│   │  • Carol     │  │ • API-v2     │  │ • Pattern#7  │     │
│   └──────────────┘  └──────────────┘  └──────────────┘     │
│                                                             │
│        ↑ Orthogonal subspaces (maximally separated)         │
└─────────────────────────────────────────────────────────────┘

Each memory type gets projected into its own subspace. These subspaces are learned to be orthogonal—maximizing separation between types. Within each subspace, similar memories cluster together—maximizing compression.

This is rate reduction in action: expand between categories, compress within them.

The Graph as a Compression Mechanism

Here’s where things get interesting for MemoryGraph specifically.

The Berkeley book shows that neural network layers can be understood as iterative compression steps. Each layer transforms representations to be more compact and more structured.

We realized: a knowledge graph already does this.

Consider how MemoryGraph stores information:

Raw Input: "Alice fixed the authentication bug in the login
            service yesterday by adding proper token validation"

Graph Representation:
  (Alice:Person) --[AUTHORED]--> (Fix#892:Solution)
  (Fix#892:Solution) --[RESOLVES]--> (AuthBug:Error)
  (AuthBug:Error) --[AFFECTS]--> (LoginService:Project)
  (Fix#892:Solution) --[INVOLVES]--> (TokenValidation:CodePattern)

This graph representation is a lossy compression of the original text. We’ve extracted the essential structure—who, what, where, how—and discarded the rest. The entities are compressed representations (cluster centers), and the relationships define how to navigate between them.

In the language of the Berkeley book:

Entities = compressed representations of many observations (low-dimensional subspace centers)
Relations = transformation operators between subspaces
Observations = high-dimensional raw data that gets compressed into entity updates

The graph structure itself encodes the low-dimensional manifold that rate reduction seeks to discover.

Practical Implementation Ideas

Based on this theoretical framework, we’re exploring several enhancements to MemoryGraph:

1. Type-Aware Embeddings

Instead of treating all content identically, project embeddings through type-specific learned transformations:

class TypeAwareEmbedding:
    def __init__(self):
        self.base_encoder = SentenceTransformer("all-MiniLM-L6-v2")

        # Learned projections for each entity type
        # Dimensions chosen based on type complexity
        self.projections = {
            "person": LinearProjection(384, 64),
            "project": LinearProjection(384, 128),
            "solution": LinearProjection(384, 96),
            "error": LinearProjection(384, 48),
            "code_pattern": LinearProjection(384, 64),
        }

    def embed(self, content: str, entity_type: str) -> np.ndarray:
        base = self.base_encoder.encode(content)
        projection = self.projections.get(entity_type)
        return projection(base) if projection else base

The dimensionality of each subspace reflects the inherent complexity of that type. People are relatively simple to characterize; projects have more nuance.

2. Progressive Memory Consolidation

The book describes how deep networks progressively compress representations layer by layer. We can apply this to memory over time:

Layer 0: Raw Observations (high-dimensional, ephemeral)
         "User mentioned they prefer tabs over spaces"
         "User asked about Python formatting"
         "User corrected a spacing issue in code"
              │
              ▼ Compression (after session)

Layer 1: Working Memory (mid-dimensional, session-persistent)
         "User has strong code formatting preferences"
              │
              ▼ Compression (after multiple sessions)

Layer 2: Consolidated Knowledge (low-dimensional, long-term)
         Entity property: coding_style = "strict_formatting"
              │
              ▼ Compression (over time)

Layer 3: Core Identity (minimal, permanent)
         Entity: User with trait "detail_oriented"

This mirrors how human memory consolidation works—episodic memories compress into semantic knowledge over time.

3. Expansion-Compression Retrieval

Current semantic search optimizes only for similarity (compression). The rate reduction framework suggests we should also optimize for distinctiveness (expansion):

def retrieval_score(query: str, memory: Memory,
                    other_retrieved: List[Memory]) -> float:
    # Compression term: how relevant is this memory?
    similarity = cosine_similarity(
        embed(query),
        embed(memory.content)
    )

    # Expansion term: how distinct is this from other results?
    distinctiveness = 1.0 - mean([
        cosine_similarity(embed(memory.content), embed(other.content))
        for other in other_retrieved
    ])

    # Balance both objectives
    return alpha * similarity + (1 - alpha) * distinctiveness

This prevents retrieval from returning redundant results—a common failure mode of pure similarity search.

The graph structure defines a manifold that memories lie on. Retrieval can exploit this:

async def graph_aware_search(query: str, depth: int = 2):
    # 1. Find entry points via embedding similarity
    seeds = await semantic_search(query, limit=5)

    # 2. Expand along graph edges (follow the manifold)
    expanded = set(seeds)
    frontier = seeds

    for _ in range(depth):
        for entity in frontier:
            # Relations define valid transitions on the manifold
            neighbors = await get_related_entities(entity)
            expanded.update(neighbors)
        frontier = expanded - set(seeds)

    # 3. Re-rank by combined graph + semantic relevance
    return rank_by_rate_reduction_score(expanded, query)

The graph provides a strong inductive bias about which memories are likely relevant together.

What’s Next

We’re actively researching how to integrate these ideas into MemoryGraph. Some open questions we’re exploring:

Learning the projections: How do we train type-specific embeddings without massive labeled datasets? Self-supervised approaches using the graph structure itself look promising.

Quantifying compression quality: Can we use rate reduction as an actual metric during development? This would give us a principled way to evaluate memory architectures beyond just retrieval benchmarks.

Adaptive consolidation: When should observations be compressed into entity updates? Too aggressive and we lose detail; too conservative and memory bloats.

Cross-type relationships: How do we handle embeddings when a query spans multiple types? “Find solutions that Alice worked on for projects using Python” crosses Person, Solution, and Project subspaces.

The Bigger Picture

The Berkeley book’s framework suggests something profound: compression isn’t just a storage optimization—it’s the fundamental operation of learning itself.

Every time you explain a complex system by its key components, you’re doing rate reduction. Every time you recognize a pattern across multiple experiences, you’re finding low-dimensional structure. Every time you organize knowledge hierarchically, you’re building a manifold.

For AI memory systems, this means we shouldn’t think of memory as a retrieval problem with storage attached. Memory is a compression problem with retrieval as a side effect.

Get the compression right—find the true low-dimensional structure in the experiences—and retrieval becomes almost trivial. The memories that matter will naturally cluster together, and the right memory for a given context will be the one that reduces uncertainty the most.

That’s what we’re building toward with MemoryGraph. Not just a database that stores what AI assistants experience, but a system that truly learns from those experiences—compressing them into structured knowledge that makes every future interaction more intelligent.

MemoryGraph is an open-source graph-based memory system for AI assistants. Check out the project at github.com/gregorydickson/memory-graph or try the cloud platform at memorygraph.dev.

The Berkeley textbook “Learning Deep Representations of Data Distributions” is freely available at ma-lab-berkeley.github.io/deep-representation-learning-book.

References

Buchanan, S., Pai, D., Wang, P., & Ma, Y. (2025). Learning Deep Representations of Data Distributions. Online textbook.
Chan, K.H.R., Yu, Y., You, C., Qi, H., Wright, J., & Ma, Y. (2022). ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction. Journal of Machine Learning Research.
Yu, Y., Buchanan, S., Pai, D., et al. (2024). White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is? Journal of Machine Learning Research.

Gregory Dickson is a Senior AI Developer & Solutions Architect specializing in AI/ML development and cloud architecture. He’s the creator of MemoryGraph, an open-source MCP memory server using graph-based relationship tracking.