yash sawant

Adaptive LoRA Rank Allocation Works for SFT. It Fails Under GRPO. Here’s Why.

2026-05-09T00:00:00+00:00

If you’ve fine-tuned a large language model in the last two years, you’ve probably used LoRA. The trick is so good and so cheap that it feels like a cheat code: instead of updating the model’s billions of parameters, you train a tiny pair of low-rank matrices alongside each weight matrix you care about. The model behaves as if you’d fine-tuned the whole thing, but you’ve touched maybe one or two percent of the parameter count.

The natural follow-up question, once you’ve used LoRA enough, is: why do all layers get the same rank?

Empirically, transformer layers don’t contribute equally to whatever you’re fine-tuning on. Some are doing heavy lifting; some are barely moving. So if you have a fixed parameter budget, why not concentrate it where it matters?

This question kicked off an entire subfield of “adaptive rank allocation” methods over the past two years — and the punchline of every paper has been the same. Profile the gradients, allocate rank where the gradients are largest, save parameters, get the same accuracy. AdaLoRA. GoRA. IGU-LoRA. ILA. Aletheia. Five different angles on the same recipe, all under supervised fine-tuning, all reporting the same kind of win.

So I tried the recipe under GRPO — the RL algorithm DeepSeek popularized for training reasoning models.

It didn’t transfer. Adaptive allocation made the model worse than uniform.

This post is about why. The short version is that the gradient structure of GRPO is qualitatively different from SFT — flatter, more spread out, and more entangled with rank itself — and a recipe that depends on a peaked, sparse importance map quietly stops working when the importance map isn’t peaked or sparse anymore. The longer version has a surprise in it that has uncomfortable implications for the entire profile-then-reallocate paradigm.

The adaptive-rank story under SFT

Quick recap of the supervised case, because the rest of this post hinges on the contrast.

The original LoRA paper (Hu et al., 2022) set rank uniformly across layers. Every adapter got, say, rank 8 or 16, regardless of what layer it was attached to. This was a reasonable default — you’re already saving 99% of parameters, so optimizing the last bit isn’t a priority.

AdaLoRA (Zhang et al., 2023) was the first major paper to push back on this. The argument: not all weight updates matter equally during fine-tuning, so under a fixed parameter budget, you should pour rank into the high-importance directions and prune it away from the low-importance ones. AdaLoRA does this dynamically during training, parameterizing the LoRA decomposition as an SVD and pruning the small singular values as it learns which ones matter.

The key empirical finding behind AdaLoRA, and behind everything that followed, is that the importance distribution is peaked. Some layers and modules carry a lot of signal; others carry almost none. If you visualize the per-layer gradient magnitude during SFT, you see hot spots — usually concentrated in middle-to-late attention layers — and cold zones where the gradient is essentially noise.

ILA (Shi et al., 2024) made this concrete in a clean way: under SFT, roughly 30% of layers carry over 80% of the gradient signal. The rest are essentially passengers. Reduce their rank, redistribute that capacity into the hot layers, and you get the same downstream accuracy with a smaller parameter count. This is the free lunch that adaptive rank allocation has been compounding for two years.

GoRA (He et al., 2025) extended the idea to dynamic during-training profiling rather than static profiling. IGU-LoRA (Cui et al., 2026) added a per-layer initialization variance correction so that adapters with different ranks start with comparable activation magnitudes. Aletheia (Saket, 2026) brought in a more rigorous information-theoretic objective on top of the same skeleton.

Different angles on the same recipe. Distilled, the recipe is:

Profile the per-layer gradient magnitude during a uniform-rank run.
Allocate rank proportional to gradient importance, under a fixed total budget.
Train.
Win on the parameter-accuracy frontier.

This works because the gradient signal under SFT is sparse. Take that sparsity away and the recipe loses its purchase.

What changes under RL

GRPO — Group Relative Policy Optimization, from the DeepSeekMath paper — looks superficially like another fine-tuning algorithm. You take a base model, compute per-token gradients, update the weights. The gradient just comes from a different objective: instead of cross-entropy against ground truth, you compute a reward over generated completions and shape the gradient using group-relative advantages.

So the natural question is: does the SFT recipe transfer? Profile the gradients during a GRPO run, redistribute rank where the gradients are large, train, save parameters?

The setup of the experiment was deliberately boring. Qwen 2.5 1.5B Instruct. GSM8K. GRPO with two rewards (format compliance and correctness). LoRA on all seven projection modules per layer (q/k/v/o/up/down/gate) — 28 layers × 7 modules = 196 adapters in total. Total rank budget held fixed at 896 across configurations (= 28 × 32, the uniform baseline) so every configuration had the same parameter count. The only thing varying was where that parameter mass was concentrated.

The strategies tested:

Uniform — every layer rank 32. Baseline.
Proportional — gradient-aware. Hot layers get more rank, cold layers get less. Same total budget.
Reduced 70% — gradient-aware at 70% of the parameters. (How much budget can the recipe save?)
Random — same total budget, allocation drawn at random. Control.

I expected the standard SFT result: proportional ≥ uniform > random. What I got was:

Strategy	Rank Range	Params	GSM8K Acc
Base model (no LoRA)	—	0	66.0%
Reduced 70%	r=12–31	25.8M	65.0%
Random	r=16–48	36.9M	67.5%
Proportional (gradient-aware)	r=20–40	36.9M	70.0%
Uniform (r=32)	r=32–32	36.9M	74.5%

Uniform won. By 4.5 points over the gradient-aware proportional allocation that “should” have been the optimum.

The interesting twist: gradient-aware did beat random by 2.5 points. The gradient signal was real — proportional knew which layers were hotter than random did. It just wasn’t enough to overcome the cost of redistributing rank.

Why: GRPO has a flatter gradient landscape

The first thing that came out of profiling was that the gradient distribution under GRPO looks nothing like the SFT story.

Under SFT (per ILA’s findings), the top 30% of layers carry over 80% of the gradient signal. Under GRPO in this run, the top 30% carry only ~36%. The hottest layer (15) carries 4.68% of total gradient. The coldest (26) carries 2.15%. Max-to-min ratio: 2.17x.

For comparison, SFT fine-tunes on similar architectures often show ratios of 10x or more. GRPO is essentially flat.

The flatness isn’t an artifact of averaging over noisy phases either. The early-vs-late training correlation of the importance map is 0.962 — it stabilizes within the first 100 steps and stays put. The structure is real, it’s just shallow.

This already gives you a clean story for the negative result. Adaptive rank allocation banks on the existence of idle layers whose capacity you can safely steal. Under SFT there are. Under GRPO there aren’t. Drop a layer’s rank from 32 to 20 and you’ve broken something the model needs.

But there was a second thing that came out of the profiling that I didn’t expect.

The amplification effect

I’d profiled gradients during every run, not just the uniform baseline, mostly to sanity-check that the rank allocation was applied correctly. When I plotted the gradient distribution for the proportional run alongside uniform:

The spread widens. Uniform: 2.17x max/min. Proportional (same total budget): 3.00x. Reduced: 3.57x.

Hot layers given more rank absorbed more gradient. Cold layers given less rank went quieter. The allocation was creating a positive feedback loop — rank was concentrating gradient onto whichever layers had been given the rank.

That made me suspicious. Was the gradient just following itself? Was the proportional configuration simply amplifying its own profiling?

I ran the random allocation as a control. Layer 1 is normally one of the coldest layers — 3.09% of gradient under uniform. The random allocation gave it rank 48. Its gradient share jumped to 4.21%, making it the second hottest layer in the network. Layer 15, normally the hottest, got rank 24 in the random allocation, and its gradient share dropped from 4.68% to 3.98%.

The correlation between allocated rank and gradient shift is 0.972 for random, 0.946 for proportional.

Read that line again. Even when rank is allocated with no relation to gradient importance whatsoever, the gradient follows. Rank determines gradient importance, not the other way around.

This is the part of the result I didn’t expect, and it has uncomfortable implications for the entire profile-then-reallocate paradigm. The “important” layers your profiler identifies aren’t intrinsically important. They’re whatever layers happen to have rank to express themselves through. Move the rank, and the importance map moves with it.

This doesn’t necessarily falsify what the SFT papers found — under SFT, the importance map appears to be more anchored in the data and less in the rank distribution. But the moment you move to RL, where the gradient is shaped by sparse, noisy reward signals rather than dense supervision, the relationship inverts. Rank becomes a cause of importance, not a consequence.

A practical aside: PEFT’s silent wildcard

Halfway through this work, my proportional run had 34.6M params instead of the expected 36.9M and no errors were thrown. Worth flagging because if you’re doing non-uniform LoRA in PEFT, you will hit this.

I had set up rank_pattern like this:

rank_pattern = {
    "model.layers.15.*.q_proj": 40,
    ...
}

PEFT’s rank_pattern does not treat * as a glob. It’s a regex match against the full module path, and * in regex means “zero or more of the previous character.” So model.layers.15.*.q_proj matched essentially nothing in the actual module tree, every match silently fell back to the default rank, and the only signal anything was wrong was a slightly-off parameter count.

Fix: use exact paths.

rank_pattern = {
    "model.layers.15.self_attn.q_proj": 40,
    "model.layers.15.mlp.up_proj": 40,
    ...
}

Verify trainable parameter count after applying the config.

What I’d take from this

Don’t naively port SFT-era rank allocation to RL training. SFT and RL appear to have qualitatively different gradient structures — peaked under SFT, flat under GRPO — and the techniques that exploit one don’t transfer. This is true even when the algorithm “looks like” fine-tuning at the gradient-step level.

The amplification finding has a stronger implication: static profiling can’t be the answer for RL. Whatever you measure during a uniform-rank run will shift the moment you reallocate. If adaptive rank is going to work under RL — and it might — it probably has to be dynamic, adjusting continuously during training (the AdaLoRA way), rather than committing to a fixed allocation upfront based on a profiling run.

The good news is that under GRPO, uniform is fine. It’s also the cheapest thing to do. If you’re allocating effort across an LLM training stack, you can move rank-allocation work down the priority list for RL workloads and spend the cycles somewhere they’ll actually pay off.

Negative results don’t always make great paper material. But this one was clarifying for me about what “transfer” actually means between fine-tuning regimes — and about how much of what we think we know about which layers matter is downstream of capacity allocation rather than upstream of it.

The full paper is on arXiv (cs.CL, awaiting announcement). Code: github.com/yashsawant22/adaptive-lora-rank-grpo. Submitted to the CoLoRAI workshop at ICML 2026.

Personalizing LLMs for High-Stakes Decisions: Four Lessons

2026-05-09T00:00:00+00:00

Most LLM personalization research assumes the easy case: writing style, tone, topical preferences. What happens when personalization has to survive real consequences — when “getting the user what they want” and “getting the user what they need” actively diverge?

I spent the last several months building a personalized investment assistant for my own portfolio as a research testbed. The goal wasn’t a product — it was to see which parts of the standard personalization stack (RAG over user history, preference modeling, instruction tuning) actually hold up when the downstream decision costs money and the user’s stated preferences contradict their behavior.

Four things broke in ways I didn’t expect. This post is about those four things, because I think they generalize to any domain where LLM personalization meets consequential decisions — healthcare, legal, career coaching, long-horizon planning.

Background: A Thesis-Centric Architecture

Before the lessons, a quick mental model of the system, because the design choice matters for what follows.

Most LLM-over-finance setups start with price data and retrieve relevant news or filings. I inverted it: the primary unit of memory is the user’s stated reasoning for holding a position — a structured “thesis” — and every downstream component (reports, alerts, evaluation) scores new evidence against that thesis, not against the market in the abstract.

DATA SOURCES
  Brokerage (Robinhood) · Market Data (yfinance)
  Earnings Calendar · Your Interactions (CLI/Web/Chat)
          │
          ▼
CORE ENGINE
  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
  │ Living      │  │ Conviction  │  │ Behavioral  │
  │ Thesis      │  │ Scoring     │  │ Memory      │
  │ Manager     │  │ Engine      │  │ Extractor   │
  │             │  │             │  │             │
  │ Per-holding │  │ CONFIRMED   │  │ preferences │
  │ hypotheses  │  │ UNCHANGED   │  │ beliefs     │
  │ with break  │  │ WEAKENED    │  │ patterns    │
  │ conditions  │  │ BROKEN      │  │ rules       │
  └──────┬──────┘  └──────┬──────┘  │ risk toler. │
         │               │         └──────┬──────┘
         └───────┬────────┴───────────────┘
                 ▼
ANALYSIS LAYER
  Drift Detection · Pattern Matching · Position Grading
          │
          ▼
OUTPUT LAYER
  Daily Reports · Alerts · AI Chat · Web Dashboard

A thesis is not free-form text. It’s a structured object with fields the LLM must evaluate against:

@dataclass
class Thesis:
    ticker: str
    conviction_statement: str       # "AI infra capex is in a multi-year buildout"
    validation_triggers: list[str]  # signals that would strengthen the thesis
    break_conditions: list[str]     # signals that would falsify it
    catalysts: list[str]            # upcoming events that resolve uncertainty
    time_horizon: str               # "6-18 months"
    conviction_score: float         # 0-10, updated over time

When new evidence arrives (an earnings print, a macro data point, a news headline), the scoring prompt forces the model to map the evidence onto validation_triggers and break_conditions explicitly — not to produce a free-form “bullish / bearish” verdict. The structured format is what makes the four problems below tractable at all.

Four Things That Broke

1. The user’s “profile” contradicts itself — and the contradiction is the signal

Most personalization systems model the user as a stable preference vector. You extract preferences from history, you RAG over them, you condition generation on them. Done.

In a high-stakes domain, the user’s stated rules and their behavior routinely disagree — and flattening that disagreement into a single “preference” throws away the most useful signal in the data.

Concretely: I have a stated rule of “never average down into a position whose thesis is weakening.” My trade history shows I do exactly that during high-volatility stretches. A naive preference model either encodes the rule (and gives advice I’ll ignore) or encodes the behavior (and endorses the mistake). Neither is useful.

The fix was to stop collapsing these into one profile. Behavioral memory is split into five typed stores with different decay rates:

MEMORY_TYPES = {
    "preference":    {"decay_days": 180},  # "prefers dividend payers"
    "belief":        {"decay_days": 30},   # "thinks rates will cut in Q3"
    "pattern":       {"decay_days": 90},   # "buys dips during VIX spikes"
    "rule":          {"decay_days": 365},  # "never averages into broken theses"
    "risk_tolerance":{"decay_days": 365},  # stable over long horizons
}

At inference time, the system retrieves from all five and explicitly surfaces conflicts — “your stated rule R contradicts observed pattern P” — rather than resolving them silently. The contradiction is treated as first-class output, not an inconsistency to be smoothed away.

This one design change was the single biggest behavioral improvement. A flagged contradiction at the moment of decision is worth more than ten correct retrievals.

2. Retrieval isn’t enough — you need evaluative consistency across time

This is the subtlest of the four. Vanilla RAG over a user’s notes gives you access to their reasoning. It doesn’t give you consistency in how that reasoning is applied to new evidence.

Suppose the user bought a semiconductor stock six weeks ago on the thesis “AI infrastructure capex is in a multi-year buildout.” Earnings come in light. A stateless LLM, prompted to interpret the earnings, will produce a locally plausible “earnings miss means bearish” read. A RAG-augmented LLM will retrieve the original thesis note and produce something slightly more nuanced — but still anchored on whatever framing is most salient in the retrieved context.

The question the system should be answering is narrower: does this specific evidence map to a break_condition the user wrote down at the time of purchase, or not? If yes, downgrade conviction. If no, hold. That’s a classification problem, not a generation problem — and the structured thesis format above is what makes it one.

The scoring prompt looks roughly like:

Given this thesis:
  conviction_statement: {conviction_statement}
  break_conditions:    {break_conditions}
  validation_triggers: {validation_triggers}

And this new evidence:
  {evidence}

For each break_condition and validation_trigger, answer:
  - Does the evidence directly address it? (yes/no)
  - If yes, does it trigger/validate/weaken it? (one-word label)
  - Cite the exact phrase from the evidence that supports your answer.

Return a JSON object with per-condition verdicts. Do NOT produce an
overall bullish/bearish judgment.

The key move is forbidding the model from producing a free-form verdict. Forcing it to commit to per-condition verdicts eliminates most of the drift you get when the same thesis is re-interpreted week after week.

3. The objective you actually want is the opposite of the objective personalization usually optimizes

In most personalization work, the north star is preference-matching: the system that best predicts what the user wants is the best system. In high-stakes domains, that objective inverts. A system that reliably tells you what you want to hear is a system that reliably confirms your biases.

This isn’t a hypothetical. Sanz-Cruzado et al. (Personalized Financial Advisors and LLM Personas, 2025) found that users consistently preferred LLM financial advisors with more extroverted, confident personas — even when those advisors gave objectively worse advice. User satisfaction and advice quality were negatively correlated. Optimizing for the former actively hurts the latter.

The architectural response is to stop treating user satisfaction as a proxy for quality. Concretely, the report generator has two categories of content it is required to emit whenever relevant, regardless of whether the user wants to see it:

Drift observations — cases where recent actions contradict stated theses or rules.
Pattern counterexamples — cases where the user’s stated reasoning for a current decision matches a historical pattern that previously led to a mistake.

These aren’t generated by asking the model “is there anything the user should hear?” — that invites the sycophancy the paper above documents. They’re generated by separate deterministic checks (comparing recent trades against thesis conviction scores, matching current justifications against a pattern store) and injected into the report as mandatory sections. The LLM writes them up, but it doesn’t decide whether they appear.

The uncomfortable implication: the best version of this kind of system is one the user sometimes actively dislikes.

4. You cannot evaluate the system by whether the user’s outcomes improved

This is the hardest one, and it’s a general problem for any personalization work in a consequential domain: the outcome signal is too noisy, too delayed, and too confounded to use as a training or evaluation target.

In investing specifically, a position held on a sound thesis can lose money because of an unrelated macro shock. A reckless impulse trade can make money because the market went up that week. If you grade the system on P&L, you will eventually train it — or train the user it’s advising — to gamble, because gambling and investing look identical on any individual trade and only diverge over hundreds of decisions.

The alternative is to grade process, not outcomes. When a position closes, the evaluator answers three questions independent of the return:

Directionality — was the thesis’s causal claim about the world directionally supported by what actually happened? (Not: did the stock go up.)
Timing — did the entry and exit correspond to the catalysts the thesis specified, or to unrelated noise?
Sizing consistency — was position size proportional to the conviction score at entry?

A losing trade on a thesis that was directionally correct, entered on the right catalyst, and sized appropriately scores higher than a winning trade that violated all three. Over time, this grading signal is what the behavioral memory layer learns from — not the P&L.

This is an instance of a general pattern that shows up any time you want to evaluate an LLM system operating in a domain with noisy ground truth: decompose the decision into intermediate artifacts that you can evaluate directly, and grade those, rather than waiting for the outcome. The same move works in clinical decision support (grade whether the differential was complete, not whether the patient recovered), legal research (grade whether the cited precedents are on point, not whether the case was won), and long-horizon planning (grade the checkpoint milestones, not the final goal).

Why This Generalizes Beyond Finance

The four problems above aren’t really about investing. They show up whenever LLM personalization meets a domain with three properties:

Consequential decisions — there’s a real cost to being wrong, so “what the user prefers” and “what actually helps the user” can diverge.
Temporally extended commitments — the user’s state at decision time matters, and the system has to carry structured reasoning across weeks or months, not just across a conversation.
Noisy outcome signals — you can’t close the loop with a simple reward, so you have to evaluate the process, not the result.

Healthcare decision support, legal research assistants, long-horizon career and education coaching, and any agent that manages an ongoing plan all have these properties. Finance was useful as a testbed precisely because P&L is measurable enough to seem like ground truth while being noisy enough that using it as one quietly destroys the system.

The pattern I’d offer if you’re building something in this space:

Represent the user as multiple typed stores with different decay rates, and surface contradictions between them as output rather than hiding them.
Force structured evaluation against user-authored break conditions, not free-form generation over retrieved context.
Separate “content the user will like” from “content the user needs to see”, and generate the second deterministically.
Grade intermediate artifacts, not outcomes, when outcomes are noisy or delayed.

None of these are new ideas on their own. What I underestimated going in was how aggressively the default LLM personalization stack — RAG over a flat profile, preference-conditioned generation, implicit reward modeling — resists each of them. Getting the four to work together required treating personalization less as a retrieval problem and more as a structured reasoning problem with the user’s own prior commitments as the spec.

The easy version of LLM personalization is mostly solved. The version that survives contact with consequential decisions is, as far as I can tell, still wide open.

The full position paper is on arXiv: High-Stakes Personalization: Rethinking LLM Customization for Individual Investor Decision-Making — submitted to the CustomNLP4U workshop at ACL 2026.

Gradient-Based LoRA Rank Allocation Fails in GRPO

2026-05-08T00:00:00+00:00

Adaptive rank allocation for LoRA — giving more capacity to layers that “matter” and less to layers that don’t — is one of those ideas that keeps getting validated. AdaLoRA, GoRA, IGU-LoRA, Aletheia, ILA — every recent paper says the same thing: profile the gradients, allocate rank where the gradients are large, save parameters, get the same accuracy.

So I did the obvious thing: ran the same recipe under GRPO instead of supervised fine-tuning, expecting a clean transfer.

It didn’t work. Adaptive allocation made the model worse than uniform.

This post is what I found, why I think it happens, and a debugging story along the way.

The setup

Model: Qwen 2.5 1.5B Instruct
Dataset: GSM8K (grade-school math word problems)
Method: GRPO (the algorithm from DeepSeekMath / R1) with multi-reward (format compliance + answer correctness)
LoRA target: all seven projection modules per layer (q/k/v/o/up/down/gate). 28 layers × 7 = 196 adapters total
Training: 1000 steps, batch=4, K=4 generations per prompt
Profiling: logged the per-layer gradient L2 norm at every step. 392,000 data points across the run

For each rank-allocation strategy, I kept the total rank budget fixed at 896 (= 28 × 32, the uniform baseline). So same parameter count, just distributed differently.

What happened

Strategy	Range	Params	GSM8K Acc
Base model (no LoRA)	—	0	66.0%
Reduced 70%	r=12–31	25.8M	65.0%
Random	r=16–48	36.9M	67.5%
Proportional (gradient-aware)	r=20–40	36.9M	70.0%
Uniform (r=32)	r=32–32	36.9M	74.5%

Uniform won. By 4.5 points over the gradient-aware proportional allocation that “should” have been the optimum.

The interesting thing: gradient-aware did beat random. The signal is real — proportional knew which layers were hotter than random did. But it still lost to uniform. Knowing the signal didn’t help.

What’s going on inside GRPO

First weird thing: the gradient landscape under GRPO is far flatter than what people report under SFT.

Hottest layer (15) carries 4.68% of total gradient. Coldest layer (26) carries 2.15%. Max-to-min ratio: 2.17x.

Compare that to ILA’s findings under SFT, where the top 30% of layers carry >80% of the signal. Under GRPO, the top 30% carry only ~36%. The signal is spread out — every layer is doing work.

Second weird thing: the importance map is rock-stable across training. Early-vs-late training correlation is 0.962. So the flatness isn’t an averaging artifact over noisy phases. It’s a structural property of how GRPO distributes its learning signal.

So one explanation for the negative result writes itself: under SFT there are genuinely idle layers whose capacity you can safely redistribute. Under GRPO there aren’t. Reduce a layer’s rank from 32 to 20 and you’ve broken something the model was relying on.

But there’s something stranger going on.

The amplification effect

I profiled gradients during every run, not just the uniform baseline. Here’s what the proportional run looked like vs uniform:

The spread widens. Uniform: 2.17x max/min. Proportional (same total budget): 3.00x. Reduced: 3.57x.

Hot layers given more rank absorb more gradient. Cold layers given less rank go quieter. The allocation creates a positive feedback loop — the rich get richer, the poor get poorer.

I ran the random allocation as a control to check whether this was just gradient-following gradient. Layer 1 is normally one of the coldest layers (3.09%). Random gave it rank 48. Its gradient share jumped to 4.21% — making it the second hottest layer in the network. Layer 15, normally the hottest, got rank 24 in the random allocation and dropped from 4.68% to 3.98%.

The correlation between allocated rank and gradient shift is 0.972 for random, 0.946 for proportional.

Read that again. Even when you allocate rank with no relation to gradient importance whatsoever, the gradient follows. Rank determines gradient importance, not the other way around.

This is the part I didn’t expect, and it has uncomfortable implications for the entire profile-then-reallocate paradigm. The “important” layers your profiling identifies aren’t intrinsically important — they’re whatever layers happen to have rank to express themselves through. Move the rank, and the importance map moves with it.

A debugging note: PEFT’s silent wildcard

About halfway through this work, my proportional run had 34.6M params instead of the expected 36.9M. Something was wrong but no errors were thrown.

I had set up rank_pattern like this:

rank_pattern = {
    "model.layers.15.*.q_proj": 40,  # nope
    ...
}

Turns out PEFT’s rank_pattern doesn’t treat * as a glob. It’s a regex match on the full module path, and * means “zero or more of the previous character.” So model.layers.15.*.q_proj matched essentially nothing in the actual module tree, and every module silently fell back to the default rank.

Fix: use exact paths.

rank_pattern = {
    "model.layers.15.self_attn.q_proj": 40,
    "model.layers.15.mlp.up_proj": 40,
    ...
}

Confirmed by inspecting actual LoRA matrix shapes after init. If you’re doing non-uniform LoRA with PEFT and your param count is suspicious, this is probably why.

What I’d take from this

Don’t naively port SFT-era rank allocation to RL training. The two regimes have qualitatively different gradient structures — flat under GRPO, peaked under SFT — and the techniques that exploit one don’t transfer.

The amplification effect also means static profiling can’t be the answer for RL. Whatever you measure during a uniform-rank run is going to shift the moment you reallocate. If adaptive rank is going to work under RL, it probably has to be dynamic — adjusting continuously during training, AdaLoRA-style, rather than committing to a fixed allocation upfront.

The good news in the negative result: uniform is fine. It’s also the cheapest thing to do.

The full paper is going up on arXiv shortly (submitted under cs.CL). Code is at github.com/yashsawant22/adaptive-lora-rank-grpo. This work was submitted to the CoLoRAI workshop at ICML 2026.