Listen to this Post
Large Language Models (LLMs) like GPT-4 are incredibly powerful at generating human-like text and retrieving vast amounts of information. However, they lack true memory—instead, they simulate it through various techniques. Here’s a breakdown of how memory works in LLMs and the systems built around them.
1. Baked into the Model — “Parametric Memory”
This refers to the knowledge encoded during training and accessed during inference.
- Model Weights: Store statistical patterns from training data.
- Activations: Temporary representations during inference.
- Key-Value Cache: Speeds up decoding by caching attention keys/values.
- Optimizer States: Only relevant during training for efficient weight updates.
These components are not personalized—they don’t retain user interactions.
2. Application-Layer Memory
Memory is simulated around the LLM to create continuity.
Short-Term Memory
→ Context Window (e.g., 8K–128K tokens)
Everything in the prompt: instructions, chat history, current input.
Long-Term Memory
→ Retrieval-Augmented Generation (RAG): Uses vector databases to fetch relevant past data.
→ Lightweight Personalization: Like ChatGPT’s user memory feature.
3. Memory in Agentic Systems
For autonomous LLM agents, structured memory modules are essential:
1️⃣ Episodic Memory
Stores past interactions, observations, and agent actions (often in a vector DB).
2️⃣ Semantic Memory
Holds external knowledge (docs, wikis, domain data) retrieved via RAG.
3️⃣ Procedural Memory
Defines agent functions—tools, prompts, actions, and system instructions.
4️⃣ Short-Term (Working) Memory
Active context: retrieved memory, recent exchanges, and task state.
You Should Know:
Practical Implementation of LLM Memory
1. Using Vector Databases for Long-Term Memory (RAG)
Example: Storing and retrieving data with FAISS (Facebook AI Similarity Search)
import faiss
import numpy as np
Create embeddings (e.g., using OpenAI's API)
embeddings = np.random.rand(100, 1536).astype('float32') Mock embeddings
Build FAISS index
index = faiss.IndexFlatL2(1536)
index.add(embeddings)
Query similar embeddings
query_embedding = np.random.rand(1, 1536).astype('float32')
k = 3 Retrieve top 3 matches
distances, indices = index.search(query_embedding, k)
print(f"Retrieved indices: {indices}")
2. Managing Context Window in LLM Prompts
Example: Truncating text to fit within a token limit (using tiktoken) pip install tiktoken import tiktoken def truncate_text(text, max_tokens=4000, model="gpt-4"): enc = tiktoken.encoding_for_model(model) tokens = enc.encode(text) if len(tokens) > max_tokens: truncated = tokens[:max_tokens] return enc.decode(truncated) return text
3. Using Key-Value Cache for Faster Inference
Example: Implementing KV caching in Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(inputs, use_cache=True, max_length=50)
print(tokenizer.decode(outputs[bash]))
4. Linux Commands for Managing LLM Workflows
Monitor GPU usage (for LLM inference) nvidia-smi Stream logs from an LLM API journalctl -u my_llm_service -f Check memory usage free -h
5. Windows Commands for AI Workloads
Check system resources Get-Counter '\Processor(_Total)\% Processor Time' Manage Python environments (for LLM development) python -m venv llm-env .\llm-env\Scripts\activate
What Undercode Say
LLMs don’t have true memory—they rely on clever engineering to simulate it. For persistent memory, use RAG, vector databases, and structured agent frameworks. Optimize performance with KV caching and context management.
Expected Output:
A structured guide on LLM memory with practical code snippets and commands for implementation.
References:
Reported By: Shivanivirdi Llms – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅



