Teaching a 3B model to write about music

I built a small macOS app called Ficino. It watches Apple Music, and every time the track changes it pulls the editorial notes, credits, charts and sample data from MusicKit and Genius, hands all of that to Apple's on-device foundation model, and slides a little floating panel out from the corner of the screen with a paragraph about the song.

No API calls. No subscription. No token meter. Everything happens on the laptop.

That last sentence sounds like a victory lap. It is not.

What you get for free

macOS 26 ships a foundation model built into the OS: roughly three billion parameters, two-bit quantization-aware training, a 4,096-token context window exposed to third-party apps through a Swift framework called FoundationModels.

You open a LanguageModelSession, you respond(to:), you get a string back. The model never leaves the device.

Apple is very precise about what this model is for. The official line is: "summarization, entity extraction, text understanding, short dialog, creative content, classification." And then, in the same breath: "not designed for world knowledge, not designed for advanced reasoning."

That second sentence is the whole shape of the problem, because music writing is world knowledge.

The rephrasing trick

So I don't ask the 3B who Steely Dan is. I tell it. MusicKit and Genius do the knowing; the model does the talking.

The prompt looks roughly like:

Now playing: "Aja" by Steely Dan. Known facts: Billboard #3, 1977. Grammy for Best Engineered Recording. Producers: Gary Katz. Session musicians: Wayne Shorter, Steve Gadd. Using only the facts above, write a short commentary.

It is retrieval-augmented generation with all of the retrieval done by Apple's APIs and a 3B model standing in as the stylist.

This works, in the sense that the model says correct things. It fails, in every other sense.

Preambles. It opens every response with Here is a short description of the song or Sure! Here are three sentences about.

Hallucinations. Handed Kenshi Yonezu's IRIS OUT, a double A-side with a B-side called JANE DOE, it confidently announced a vocalist named Jane Doe.

Misattributions. Handed Don Toliver's Body from the album OCTANE, it wrote a paragraph about a song called OCTANE, because OCTANE was the most prominent word on the page.

Boilerplate. It parrots editorial copy verbatim — every record is a highly anticipated sophomore album.

These are not Apple's bugs to fix. They are what a two-bit quantized three-billion-parameter model does when you ask it to behave in a domain it has not been trained on.

I tried prompts. I wrote fourteen versions, then seventeen, then eighteen. My best pure-prompt score on an LLM-as-judge rubric across 81 tracks was 13.0 out of 15, with eight preambles, seven hallucinations and two misattributions still sitting in the failure log.

I had reached the point where prompt engineering was a closed surface.

Why the adapter

LoRA — low-rank adaptation — is the cheap way to nudge a large pretrained model toward a specific task without retraining the whole thing. You train a small set of additive matrices that ride alongside the original weights.

Apple's Adapter Training Toolkit produces a ~160 MB .fmadapter package that the OS loads at runtime against the same base model.

Going in, I was clear about what LoRA could not do. It cannot inject knowledge. It cannot make the 3B know that Body is a track and OCTANE is the album it sits on.

But I had a hypothesis: the preambles, the hallucinations and the misattributions are, underneath, the same failure. The model has been trained to be helpful, and helpful means filling gaps with plausible-sounding text.

If I could show it three thousand examples of the opposite — of taking the facts you were given and stopping the moment you ran out — maybe the pattern would carry. The hallucinations would go away not because the model learned the facts, but because it learned to stop reaching for them.

So I generated three thousand synthetic examples with Claude Haiku, against the same MusicKit and Genius context the app would produce at runtime, filtered for faithfulness, and trained an adapter on a rented H100.

Two hours. Under ten dollars.

What came back

The same 81 tracks, the same judge, the same rubric: 13.9 out of 15.

The half-point jump over the best prompt is not the interesting number. The interesting number is the failure log: zero preambles, zero hallucinations, zero misattributions across every response. The worst output the adapter produced scored higher than the average output from the best prompt.

For IRIS OUT, the model now correctly identifies Utada Hikaru as the featured vocalist on the B-side. Body is Body. The model just starts the sentence — no Here is.

A bigger model would have known the music. The 3B cannot, and no amount of fine-tuning will change that. But once I stopped asking it for knowledge and started asking it for voice, there was nothing left for it to fail at.

That is the part I keep thinking about. The line between what an adapter can fix and what it cannot is not size. It is whether the task you handed the model is the same task it is actually being asked to do.