Story 1

Bruh, Shut the f up.

Have you noticed reasoning models sometimes rambles for paragraphs before answering a simple question. Researchers at arXiv have now formally named—and solved—this problem.

A March 2026 paper, SmartThinker: Progressive Chain-of-Thought… demonstrates that “overthinking” is a real, measurable phenomenon.
SmartThinker dynamically calibrates how long a model’s reasoning chain should be per problem—compressing chain length by up to 52.5% on average
Surprisingly it also improved accuracy by up to 16.6% - The key insight: more tokens ≠ more accuracy. For simpler tasks, excessive reasoning actually degrades model output.

❝

How to apply this (in your career):

Running your own model? Distilled QwQ or DeepSeek-R1 both support fine-tuning — you can apply SmartThinker-style length control directly
Paying per token? Log your output token counts. A reasoning model producing 3× the tokens on simple tasks is burning money talking to itself

Story 2

LoRA’s powerful Journey: From bullied to superhero.

When you fine-tune a large AI model you're adjusting billions of numbers that control how the model thinks costing you a lot in computing power.

LoRA thought can we change how the model thinks without having to touch billions of parameters?

LoRA’s Solution:

Don't touch the original model at all
Instead, insert two small (adapter) matrices directly into the model's layers
Train only those — nothing else

She was popular. She was efficient. But there was some haters.

Some people would complain that LoRA can't change the base model's behavior enough.

Well that is odd? I mean LoRA has been more successful before so I’m not sure why they are complaining. But sure ok.

For people experiencing this issue they would often have to increase rank - a number that controls how large the “adapter matrices” are. If you don’t know what a rank or adapter matrices mean, don’t worry but know this.

⬆️ adapter matrices/rank ⬆️ potential to deviate from the base model.

But that would mean buying more compute power. So, frustrated they might have gone down to the store to buy another GPU cursing out LoRA the whole way there.

But then enter the new research paper - Why LoRA Resists Label Noise, arXiv:2602.00084

Here's what they found.

The people complaining about LoRA… actually it was a skill issue. Well actually a dataset issue - probably.

We all know the phrase - Garbage in = Garbage Out.

But with LoRA garbage in does not always mean garbage out. Here's why:

Rank means LoRA can only update the model in a limited number of directions (AKA rank #)
- Good data is coherent — hundreds of examples pushing toward the same answer, stacking signals, naturally dominating those directions
- Bad labels are incoherent — pointing in conflicting directions, fighting each other
If you are using a garbage dataset, the bad labels are fighting each other — their signals cancel out and can't compete with the coherent data that's all pushing in the same direction. The limited directions get claimed by the strongest signal. Noise is weak by definition.

So maybe the people who were complaining about not seeing enough change it behavior actually had bad data. LoRA filters garbage. The coherent signals win. The incoherent ones never get in.

The new mental model after this paper:

How much compute do I have? ✓
(new) How much do I trust my data? ✓
- Clean, curated, hand-verified data? Crank rank up (32–64)
- Messy, scraped, real-world data? Drop rank low (4–8)
- Not sure? Start at rank 8 and watch your validation loss

So the next time someone tells you LoRA can't change base model behavior enough — ask them what rank they're running. Ask them what their data looks like… and stop bullying poor LoRA because being able to filter out garbage data is kinda a superpower.

❝

How to get started:

Swap full fine-tuning for LoRA if you haven't already — this is the free win. Most teams using noisy real-world data are still doing full fine-tuning without realizing they're actively making the noise problem worse.
Set your initial rank low (try rank 8). Run training and log your validation loss per epoch.
Look for the "bend" — an early plateau followed by a slow climb. That climb is when noise memorization starts. Stop there.
If accuracy is lower than you need, incrementally raise the rank and repeat. You're trading noise-robustness for capacity, so move slowly.
Once you've found a rank that works, that number is now a data quality benchmark for your pipeline — if a new dataset needs a lower rank to stay stable, your labels got worse.

Story 3

How To: Go from being delulu to thriving in 30 minutes or less

RFT (Reinforcement Fine-Tuning — is a technique that rewards a model when its reasoning leads to correct answers). Confidence goes up. Hard problems get easier. Then it wanders into territory it doesn't actually know — and instead of saying I'm not sure, it reasons its way to an answer anyway.

The first intervention — reverse KL regularization did not allow model to change behavior from training at all. It stopped the hallucinations but killed the growth. She became scared to develop any new reasoning patterns at all. Not delulu anymore, but not thriving either. Just stuck.

CARE-RFT (Confidence-Anchored Regularized Reinforcement Finetuning, arXiv:2602.00085) applies a smarter leash.

Before — Reverse KL:

Before training — base model exists, behaviours established
During training — RFT rewards good reasoning and the reverse KL penalty simultaneously pushes back against any deviation from the original model. Both forces active, every update, no exceptions
Result — the model learns a little, but the leash is so uniform it's scared to explore anything genuinely new

After — CARE-RFT:

Before training — base model exists, behaviours established
During training — RFT rewards good reasoning and the SRKL (Skew Reverse KL) penalty is active on every update, but now it's reading the model's track record. Directions that keep getting rewarded → penalty stays bounded, model keeps developing. Directions that keep getting it wrong → penalty grows unbounded, model gets pulled back hard
Result — the model builds genuine reasoning capability where it's earning it, and stays anchored where it isn't

❝

How to apply this:

1. Get comfortable with a base RFT setup first. The easiest on-ramp is DeepLearning.AI's short course "Reinforcement Fine-Tuning LLMs with GRPO" — free, hands-on, covers GRPO end-to-end which is the main algorithm CARE-RFT builds on.

2. Use the TRL library. Hugging Face's TRL (Transformer Reinforcement Learning) library has GRPO support built in. Install it with pip install trl and you can run a basic GRPO training loop in ~50 lines of Python. That's your starting point before swapping in the SRKL regularizer.

3. Implement the SRKL swap. The CARE-RFT modification is in the KL penalty term of the training loss. Once you have a working GRPO loop, you replace the standard reverse KL term with the skew version from the paper's equations (Section 3). It's a targeted change, not a full rewrite.

4. Pick a small verifiable task. CARE-RFT shines on tasks where you can automatically check if an answer is correct — math problems, code execution, structured outputs. If you're testing it, start with something like GSM8K (grade school math word problems) or a domain-specific Q&A set where you have ground-truth answers.

We r cooked

Shadow Agents

Not as cool as it sound, sorry for the clickbait. two years ago, the threat model for AI was simple—prompt injection, data leakage from chatbots, maybe a poisoned training set. Annoying but containable. That era is over.

The current problem is “Shadow Agents”: autonomous AI systems—agents that browse the web, write and execute code, send emails, call APIs, and take multi-step actions—that are being deployed inside organizations without formal security review. The same way shadow IT (employees using Dropbox before IT approved it) created compliance nightmares in the 2010s, shadow agents are now doing the same thing—except instead of just storing files, they’re taking actions on your behalf with your credentials. AI-related security incidents rose 56.4% from 2023 to 2024. In 2026, enterprises are in what researchers are calling the “post-hype integration phase”: LLMs are baked into core workflows, and the attack surface has exploded accordingly.

Hope Core 🌱

Less water consumption?!

Corintis has developed a bio-inspired chip cooling system that embeds microfluidic channels directly into the chip itself. Think of it like capillaries in the human body: instead of cooling the chip from the outside, coolant flows through microscopic pathways inside the chip, removing heat exactly where it’s generated.

In a September 2025 research partnership with Microsoft, Corintis demonstrated that their system removes heat up to 3× more efficiently than conventional cold plates. The ecological upside: their architecture supports higher coolant temperatures, meaning data centers can use warmer reclaimed water rather than chilled fresh water—directly reducing fresh water consumption at scale.

Well that is a wrap for this week folks, stay tuned for next week's latest research in AI and tech!

XOXO,

Dose AI.