The folks at Google DeepMind are pulling all-nighters dreaming of an age where LLMs decide – on the fly and long-term, how to answer us squishy meat-bags even better.

Here’s the vibe:
The model fires off a bunch of self-edit trial patches, slaps them onto itself, checks if key metrics went up, and then tweaks its “edit-generation policy” using the ReST-EM (Reinforced Self-Training + Expectation-Maximization) loop. Once the winning edits are crowned, they actually change the weights via QLoRA.
We already do something similar in Membria: every night we sweep new knowledge from the cache. But we haven’t gone full “AI self-reflection” yet-and that’s honestly wildly cool. Think about it: LLMs (big or tiny) normally freeze in time; this would let them bake new facts and user patterns directly into their weights-a straight shot to hyper-personalization.
Numbers to drool over: in the few-shot ARC benchmark, self-editing with ReST-EM hit 72.5% accuracy vs 0% for plain in-context and 20% for un-trained edits.
Caveat: spam too many self-edits and you risk catastrophic AI dementia – old skills evaporate.
For Membria, the pieces are almost there:
- We already re-infer and rank fresh knowledge.
- Plug in a ReST-EM control loop to govern “self-critique” – teaching the model which edits are worth turning into data.
- RLHF still handles human thumbs-up/-down, while a mini-SFT step rolls the finalists into weights with QLoRA
GitHub link and happy tinkering here: https://jyopari.github.io/posts/seal