An Open-Source Framework for Emergent Multi-Agent Social Simulations
"This serves as a complex mirror to real-world systems of power, belief, and emergent social behavior.”
The architecture leverages LangGraph for cycle management and ChromaDB for persistent storage. It runs locally on consumer-grade hardware by utilizing Gemma models for their high intelligence-to-parameter ratio.
Interventions are managed asynchronously through a Human-in-the-Loop (HITL) interface, enabling real-time state injections that influence the simulation's trajectory via a Telegram-based command protocol.
Persistence and Long-Context Management
To achieve infinite uptime, the framework implements hierarchical recursive summarization. This ensures the "World Engine" maintains historical awareness without exceeding context limits, while LangGraph checkpointers provide durability across system reboots.
Orchestration: LangGraph and the Router Pattern
The backbone of this infinite loop is LangGraph, which facilitates a stateful, cyclic graph architecture. It allows nodes to represent distinct agents—such as the Logic Engine and Narrative agents—while edges define the flow of information through a persistent Observe-Think-Act loop.
To bypass the strict limits of consumer cards, the framework utilizes llama.cpp’s router mode. This implementation uses a Least Recently Used (LRU) eviction policy, dynamically swapping models between the GPU and the system RAM.
The Quantization Standard
By reducing model weights from 16-bit to 4-bit, we significantly lower the VRAM footprint—allowing a 9B model that would normally require 18GB to fit into just 5.8GB of VRAM, thus maintaining high reasoning capabilities on a local workstation.
This precision reduction preserves the vast majority of the model's emergent behavior while leaving critical headroom for the KV Cache, which stores the context of the current conversation. For extended simulations, we further utilize 4-bit KV cache quantization to reduce context penalty by up to 50%, enabling deeper historical awareness within the hardware's physical limits.
VRAM Reference by Model and Quantization
These figures are planning ranges for local use. Real usage will be slightly higher once the runtime, context window, and KV cache are active.
Model | Q2 | Q4 (Default) | Q5 | FP16 | Practical Application |
E2B | ~0.8 GB | ~1.5 GB | ~1.8 GB | ~4 GB | Minimal agents; CPU-heavy setups. |
E4B | ~2 GB | ~3.5 GB | ~4.2 GB | ~8 GB | Standard starting point for local agents. |
26B A4B | ~9 GB | ~14 GB | ~17 GB | ~52 GB | High-tier social complexity; requires 24GB GPU. |
31B | ~12 GB | ~20 GB | ~24 GB | ~62 GB | Maximum local reasoning; 24GB VRAM required. |
Join the Experiment.
This is an invitation to build your own scenarios—whether they be utopian social experiments or harrowing studies of civilizational collapse.
https://soundcloud.com/mihai-vancea-909027674/building-persistent-ai

