It's a shame that large language models are mostly moving to 4 bit weights for inference, and a bunch of papers have shown promising techniques for training in 4 bit too...
Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...
brucethemoose2 774 days ago [-]
There is some overhead from the quantization, and right now the operations themself are sometimes done at higher precision than the weights in RAM.
And widespread hardware 4 bit will take some time. If the HW makers started designing 4 bit silicon in 2022, then we are still years away.
isoprophlex 774 days ago [-]
What?! Can you also train with quantization? Incredible! I'd have thought the gradients were way too ugly for any convergence with 4 bits.
Any particularly good papers you can recommend me on the topic?
woadwarrior01 774 days ago [-]
Here's a recent paper on training transformers with 4 bit integer weights.
Their best performing 4-bit number format uses 1 sign bit, 3 exponent bits, and no mantissa bits!
Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.
Dylan16807 774 days ago [-]
Does it really make much of a difference?
You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.
Also in the extreme I've seen powers of 4 get used.
londons_explore 774 days ago [-]
At just 4 bits, there are only 16 possible numbers. It becomes lookup table territory - and there is no need to have the numbers on your numberline be linearly or exponentially spaced - you can assign them arbitarily. For example, you could have a number system consisting of: (+-) 0.5, 1, 2, 3, 5, 10, 1000, 1000000 - getting some nice accuracy in the middle of the number line where you expect most values to lie, plus some extreme values so convergence doesn't take forever if some big activation/gradient needs to be propagated.
kamilner 774 days ago [-]
The more recent 4 bit quantizations are almost along these lines. Q4_1 in ggml for example takes a block of 32 weights and gives each block a scaling factor 'd' and takes the minimum of the weights 'm' to be the quantized '0', so the final weights from a quantized weight 'q' is q * d + m, and taking a relatively small block size makes it more likely that those are all within a reasonable quantization range. Notably, d and m can be stored with more accuracy without sacrificing too much space, since the overhead is divided by 32. Q4_k goes a bit further, and takes 'superblocks' of 8 blocks, and applies another scaling factor 'd_s' and minimum 'm_s' to that, so the final weight is (q * d + m) * d_s + m_s, and the additional factors are stored as 6 bits instead of 4.
In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684
londons_explore 774 days ago [-]
Very efficient for storage and memory bandwidth, but such a scheme is a headache for high throughput hardware implementations (at least compared to regular 4 bit math, which can be packed really really densely)
vmirnv 773 days ago [-]
Also I would highly recommend Q5_K_M for both 7B and 13B models.
I dimly remember reading that the mathematical compute-per-density optimum is around 3.x bits in a „brain like structure“, I don’t remember any details though or the precise context. Does this ring a bell with anyone?
maximilianburke 774 days ago [-]
Is it possible we will we eventually see 1-bit weights in use?
brucethemoose2 774 days ago [-]
There are already papers on it, and there is 2-bit quant in llama.cpp.
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.
touisteur 773 days ago [-]
For some time I played with Brevitas and Xilinx's FINN, you could quantize like crazy. I haven't looked since transformers took over the AI world where they were.
dlewis1788 774 days ago [-]
Confirmed Apple M1 lacks bfloat16 support completely -
M1:
hw.optional.arm.FEAT_BF16: 0
vs
M2:
hw.optional.arm.FEAT_BF16: 1
londons_explore 774 days ago [-]
Luckily BF16 is just a truncated FP32. That means that the hardware can do BF16, just you don't get any performance benefit compared to FP32 (and depending on the hardware design, you might also have to space the data 4 bytes apart rather than 2), so you lose the memory bandwidth and RAM usage benefits too.
sillysaurusx 774 days ago [-]
At that point it’d be better to do everything in fp32. The hardware can’t do bf16 in the way you’re saying; the conversions would consume all your time.
BooneJS 774 days ago [-]
Compute in F32, but then round and pack a pair of BF16 into 4 bytes.
brrrrrm 773 days ago [-]
The conversions are just a mask and shift? Super cheap
stephencanon 773 days ago [-]
You still get a perf benefit from half the memory traffic and keeping twice as much data in caches, since you can do the expansion to f32 when loading into registers.
pklausler 774 days ago [-]
Conversions from IEEE-32 to BF16 don't round?
londons_explore 774 days ago [-]
I don't believe the standard defines it. I believe implementations truncate (ie. round towards zero).
Remember BF16 was invented specifically to be able to be backwards compatible with existing silicon - and pulling 2 bytes out of 4 is a far cheaper operation than any rounding.
kelnos 774 days ago [-]
Just to elaborate, as I was confused about this and had to look it up: BF16 is indeed designed to just be a truncated F32: you can grab the top 16 bits of a F32 value and it'll still "make sense": the sign bits are in the same place in both (unsurprisingly), and the exponent part of BF16 and F32 are both 8 bits. In the case of the mantissa, you end up grabbing the top 7 bits of the F32's 23-bit mantissa, so it all works out, as this will "round" the value toward zero.
pclmulqdq 773 days ago [-]
There's no standardized definition of BF16.
dlewis1788 774 days ago [-]
Somehow missed this from WWDC23, but it looks like Sonoma will add support for bfloat16 with Metal, and there's an active PR to add support with the PyTorch MPS back-end (PR #99272). Since M2 added bfloat16 support at the hardware level, I'm assuming this will only be supported on M2 Macs.
That maxed out Mac Studio M2 w/ 192GB of memory now looks more appealing...
This matfp instruction computes an outer product and is a kernel for matrix multiplication.
dlewis1788 774 days ago [-]
I didn't even know about Apple's AMX instructions until I clicked on your link. Very interesting - thanks!
my123 774 days ago [-]
bf16 in Metal on macOS 14 is supported on all Macs. Emulated in software transparently.
LoganDark 774 days ago [-]
Yeah, Metal is pretty great because it runs the same on all Macs. Apple is really really good at this.
victor106 773 days ago [-]
I think the trillion dollar question is: can Apple ever make Mac's / GPU's to compete with NVIDIA?
hospitalJail 774 days ago [-]
Maybe someone can help me understand why people are investing into this.
Inhousing typically means falling behind in technology but having lower operating costs. That makes the company win, not the users.
If you hinge your career on Apple, they might make your technology obsolete on a dime.
Its not the fastest, its not the best, its not the cheapest, its not some combination either.
> 'compute per watt'
With AI? The local LLM models are near useless already. There will be a time to cut down on power, but from what I've read, there is currently ~no value even with a 4090 with 512 RAM.
I suggest avoiding Windows/M$, I am annoyed with Linux bugs, and google cannot be trusted. But all of that could be said about Apple as well.
I just don't see a future with Apple hardware, it gives me some serious Nintendo vibes where they are going to be some quirky niche that is just enough for marketers to sell it. Compute per watt seems like a wiimote that no one asked for, but suddenly claim is ultra important.
Maybe someone can change my view. I don't see who buys this when they are educated on the possible options.
brucethemoose2 774 days ago [-]
> Maybe someone can help me understand why people are investing into this.
Buying a Mac for running LLMs is kinda like buying a Mac for gaming. Its thoeretically interesting, but I don't think thats a serious driver of Mac sales.
But:
- Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data. And they are getting better at other niches like code completion for specific languages, or summarization.
- Remember that a huge selling point for Macs is iPhone/iPad development. The market for AI App Store apps is not small.This is also a reason to believe there will be some stability with the ML support.
MuffinFlavored 774 days ago [-]
> - Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data.
I can't see how they don't hallucinate/are leagues away from GPT-3.5 let alone GPT-4 level of quality of output. Am I mistaken?
brucethemoose2 774 days ago [-]
They are better than GPT 3.5 (which I am generally not impressed with), but not as good as GPT4.
Again, the specialized variants perform very well in their niches.
astrange 772 days ago [-]
Hallucinations are exactly what you want in a gaming model. That's another way of saying "creativity".
gmerc 773 days ago [-]
You seem to assume hallucinations are a fatal flaw. You give it a document to summarize and see how often it hallucinates. Very little. Human performance.
Now often does a human make random shit up about general knowledge questions?
774 days ago [-]
Me1000 774 days ago [-]
There are a lot of ML applications outside of LLMs. Why would a developer invest in it? Because there are hundreds of millions of iOS devices out there where computer vision, text recognition, etc would be useful features.
Geee 774 days ago [-]
Desktop computers are heat-limited. We could have much faster computers if we found a way to cool them down. Thus, compute per watt is the ultimate metric to optimize for. If your cooling capacity is 500W, then obviously you'll want to fit as much compute in that as possible.
Mobile devices are energy-limited. You'll want to do as much compute as possible on a limited battery.
IOT_Apprentice 774 days ago [-]
My question to you is what are you currently using as an alternative for the COU/SOC in your personal & work environments?
Intel? AMD Ryzen?
Apple has taken their ARM approach and scaled it to all their platforms.
Amazon now is on what, Gen 2 or 3 for their graviton platform in AWS.
And what OS are you using if you don’t trust Microsoft, Linux or Apple?
brucethemoose2 774 days ago [-]
CPU arch isnt't even that critical here, as Apple is talking about Metal.
minimaxir 774 days ago [-]
I'm still confused by the proliferation of bf16. Although it certainly doesn't hurt compared to fp16, in my testing even with A100 GPUs optimized for it, both training speed and inference quality are the same between bf16 and fp16.
redox99 774 days ago [-]
Sometimes during training, fp16 will cause networks that would converge on fp32, to explode to Infs or NaNs with fp16, because of the limited range. bf16 generally speaking fixes that.
It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.
voz_ 774 days ago [-]
Yea, I spent a few months comparing the two, and empirically i had a lot more issues with various normalized entropy problems (explosion, not converging, converging slower) with fp16 than with bf16.
The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16
dlewis1788 774 days ago [-]
My understanding is for certain types of networks BF16 will train better than FP16, given the additional protection against exploding gradients and loss functions with the extended range of BF16 - at the loss of precision.
YetAnotherNick 774 days ago [-]
bf16 is generally easier to train neural network than fp16 on due to no need for scaling. And most model training and inference performs the same with fp32 and bf16.
bravura 774 days ago [-]
Despite the other answers, I will tell you the grim truth: Your mileage might vary.
It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.
For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.
gok 774 days ago [-]
Fp16 makes it easy to accidentally overflow, especially around summation operations.
bobbylarrybobby 774 days ago [-]
(Not an ML guy.) bf16 and fp16 should be comparable if the weights are of the same magnitude, but what happens in a network where the weights are poorly regularized?
dlewis1788 774 days ago [-]
Someone commented below that with enough batchnorm/layernorm/etc. and/or gradient clipping you can manage it, but BF16 just makes life easier if you can live without some precision.
Remember that switching from 16 bit to 4 bit lets you have 4x as many weights, 4x as many weights loaded from RAM per second, and ~1/16 of the silicon area for the calculations (a multiplier scales with approximately the number of bits squared). That smaller silicon area will let you do more per $ too...
And widespread hardware 4 bit will take some time. If the HW makers started designing 4 bit silicon in 2022, then we are still years away.
Any particularly good papers you can recommend me on the topic?
https://arxiv.org/abs/2306.11987
Ie. All weights, activations and gradients become powers of two! Which means all multiplications become simple bit shifts. That really changes mathematics and silicon design.
You're usually feeding a ton of multiplies into an accumulator. You can handle one or two mantissa bits as the same bit shifting except that it outputs two or three numbers to accumulate. And accumulators are very easy to scale.
Also in the extreme I've seen powers of 4 get used.
In practice this seems to get very good results, while being cheap to implement and relatively space efficient, Q4_K for example works out to 4.5 bits per weight instead of 4. The PR adding it has more details: https://github.com/ggerganov/llama.cpp/pull/1684
It has the best balance between quality and weight of the model and almost indistinguishable from original f16: https://www.reddit.com/r/LocalLLaMA/comments/142q5k5/updated...
But it seems to be past the point of diminishing returns, where you mind as well use a model with fewer parameters... For now.
There was another scheme in a paper where the "sparse" majority of the model was highly quantized, while the "dense" part was left in FP16, with good results.
Remember BF16 was invented specifically to be able to be backwards compatible with existing silicon - and pulling 2 bytes out of 4 is a far cheaper operation than any rounding.
That maxed out Mac Studio M2 w/ 192GB of memory now looks more appealing...
This matfp instruction computes an outer product and is a kernel for matrix multiplication.
Inhousing typically means falling behind in technology but having lower operating costs. That makes the company win, not the users.
If you hinge your career on Apple, they might make your technology obsolete on a dime.
Its not the fastest, its not the best, its not the cheapest, its not some combination either.
> 'compute per watt'
With AI? The local LLM models are near useless already. There will be a time to cut down on power, but from what I've read, there is currently ~no value even with a 4090 with 512 RAM.
I suggest avoiding Windows/M$, I am annoyed with Linux bugs, and google cannot be trusted. But all of that could be said about Apple as well.
I just don't see a future with Apple hardware, it gives me some serious Nintendo vibes where they are going to be some quirky niche that is just enough for marketers to sell it. Compute per watt seems like a wiimote that no one asked for, but suddenly claim is ultra important.
Maybe someone can change my view. I don't see who buys this when they are educated on the possible options.
Buying a Mac for running LLMs is kinda like buying a Mac for gaming. Its thoeretically interesting, but I don't think thats a serious driver of Mac sales.
But:
- Finetuned local LLMs are good for specific niches, like roleplaying, text games, and helper bots for your own pile of data. And they are getting better at other niches like code completion for specific languages, or summarization.
- Remember that a huge selling point for Macs is iPhone/iPad development. The market for AI App Store apps is not small.This is also a reason to believe there will be some stability with the ML support.
I can't see how they don't hallucinate/are leagues away from GPT-3.5 let alone GPT-4 level of quality of output. Am I mistaken?
Again, the specialized variants perform very well in their niches.
Now often does a human make random shit up about general knowledge questions?
Mobile devices are energy-limited. You'll want to do as much compute as possible on a limited battery.
Intel? AMD Ryzen?
Apple has taken their ARM approach and scaled it to all their platforms.
Amazon now is on what, Gen 2 or 3 for their graviton platform in AWS.
And what OS are you using if you don’t trust Microsoft, Linux or Apple?
It's true also that fp16 is often manageable with enough batch/layer norm and gradient clipping.
The transfer pipeline I wrote for fp32->fp16 also took a lot more work than fp32->bf16
It's an empirical question and depends upon the nature of your problem and data. You should try all three fp32, fp16, and bf16 as part our model selection / hyperparameter tuning.
For example, in audio generative models (where typical output is 16-bit), I've sometimes found that fp16 and bf16 just don't produce good output as fp32 weights.