All You Need To Know About Memory In LLMs

Large Language Models (LLMs) like GPT-4 are incredibly powerful at generating human-like text and retrieving vast amounts of information. However, they lack true memory—instead, they simulate it through various techniques. Here’s a breakdown of how memory works in LLMs and the systems built around them.

1. Baked into the Model — “Parametric Memory”

This refers to the knowledge encoded during training and accessed during inference.

Model Weights: Store statistical patterns from training data.
Activations: Temporary representations during inference.
Key-Value Cache: Speeds up decoding by caching attention keys/values.
Optimizer States: Only relevant during training for efficient weight updates.

These components are not personalized—they don’t retain user interactions.

2. Application-Layer Memory

Memory is simulated around the LLM to create continuity.

Short-Term Memory

→ Context Window (e.g., 8K–128K tokens)

Everything in the prompt: instructions, chat history, current input.

Long-Term Memory

→ Retrieval-Augmented Generation (RAG): Uses vector databases to fetch relevant past data.

→ Lightweight Personalization: Like ChatGPT’s user memory feature.

3. Memory in Agentic Systems

For autonomous LLM agents, structured memory modules are essential:

1️⃣ Episodic Memory

Stores past interactions, observations, and agent actions (often in a vector DB).

2️⃣ Semantic Memory

Holds external knowledge (docs, wikis, domain data) retrieved via RAG.

3️⃣ Procedural Memory

Defines agent functions—tools, prompts, actions, and system instructions.

4️⃣ Short-Term (Working) Memory

Active context: retrieved memory, recent exchanges, and task state.

You Should Know:

Practical Implementation of LLM Memory

1. Using Vector Databases for Long-Term Memory (RAG)

Example: Storing and retrieving data with FAISS (Facebook AI Similarity Search)
import faiss
import numpy as np

Create embeddings (e.g., using OpenAI's API)
embeddings = np.random.rand(100, 1536).astype('float32') Mock embeddings

Build FAISS index
index = faiss.IndexFlatL2(1536)
index.add(embeddings)

Query similar embeddings
query_embedding = np.random.rand(1, 1536).astype('float32')
k = 3 Retrieve top 3 matches
distances, indices = index.search(query_embedding, k)
print(f"Retrieved indices: {indices}")

2. Managing Context Window in LLM Prompts

Example: Truncating text to fit within a token limit (using tiktoken)
pip install tiktoken

import tiktoken

def truncate_text(text, max_tokens=4000, model="gpt-4"):
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
if len(tokens) > max_tokens:
truncated = tokens[:max_tokens]
return enc.decode(truncated)
return text

3. Using Key-Value Cache for Faster Inference

Example: Implementing KV caching in Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(inputs, use_cache=True, max_length=50)
print(tokenizer.decode(outputs[bash]))

4. Linux Commands for Managing LLM Workflows

Monitor GPU usage (for LLM inference)
nvidia-smi

Stream logs from an LLM API
journalctl -u my_llm_service -f

Check memory usage
free -h

5. Windows Commands for AI Workloads

Check system resources
Get-Counter '\Processor(_Total)\% Processor Time'

Manage Python environments (for LLM development)
python -m venv llm-env
.\llm-env\Scripts\activate

What Undercode Say

LLMs don’t have true memory—they rely on clever engineering to simulate it. For persistent memory, use RAG, vector databases, and structured agent frameworks. Optimize performance with KV caching and context management.

Expected Output:

A structured guide on LLM memory with practical code snippets and commands for implementation.

References:

Reported By: Shivanivirdi Llms – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

Join Our Cyber World:

💬 Whatsapp | 💬 Telegram

Listen to this Post