It is a little bit of a weird article. It seems like they alternate between pointing out that the new Epyc knocks the socks off the Altra Max, but also they the Max is a 2 year old chip and that we don’t know enough about the successor to compare it to the Epyc.
It is of course always possible that Ampere has completely shit the bed and the AmpereOne won’t be an improvement, but I’ll give them the benefit of the doubt I guess.
awill 782 days ago [-]
It's not like there's that much magic to it. Ampere is just using off-the-shelf Arm-designed cores. Altra was an Arm Neoverse N1.
Presumably Altra-Next will just use Neoverse N2 (or N3).
rgbrenner 782 days ago [-]
No, they’re using their own designs for AmpereOne, and all chips going forward.
awill 782 days ago [-]
Interesting. They they'd better be much faster than an off-the-shelf Arm core.
That takes some serious investment. Qualcomm tried it, and eventually gave up. Isn't the only company outdoing Arm Apple?
StillBored 782 days ago [-]
Right, its easy to name a half dozen companies that designed their own cores that turned out to be worse or no better than the arm offerings.
The real hubris at these companies is thinking they can build a team that can create a better chip in one generation. Apple is on what the 10th public generation of their own cores, from a team they acqihired that was already producing cpus? How many generations/respins did it take before they replaced the ARM ip with their own designs, and then how many generations was it before they were faster? Not only that but Arm seems to have gotten serious a few years ago and the IPC is within striking distance of the best amd/intel products. They are no longer doing obviously stupid things, so it seems odd a company like ampere doesn't have another respun Altra with a N2/V2 sitting on the sidelines as a fallback when their own design fails.
The specs there sound seriously impressive to me, and like they might be getting ready to leave AMD and Intel behind in terms of IPC (for their highest performing chips).
ARM's clock speeds are much lower, so single core performance will probably be worse. But I'd guess server clock speeds may be similar.
_a_a_a_ 782 days ago [-]
Having an N-wide dispatcher means nothing unless the software can use it, and server clock speeds tend to be lower than desktops.
Disclaimer: I don't know what I'm talking about
celrod 782 days ago [-]
Given that the M1 and M2 perform similarly to AMD and Intel CPUs with far higher clock speeds, it seems most software can use wider dispatch.
Note that dispatch doesn't mean vector width, which is harder for software to take advantage of.
It means how many uops the pipeline can handle/clock cycle.
_a_a_a_ 782 days ago [-]
A basic block is usually taken to be about 6 instructions. I suppose if you take one or two speculated branches as well then you might easily get to your 10 dispatches possible. Perhaps.
As for higher clock speeds, there's a whole lot more that matters such as pipelining instructions, cache sizes, and any number of other things. Clock speed by itself isn't particularly revealing.
celrod 781 days ago [-]
FWIW, IIRC my Skylake-X CPU normally has around 2 instructions per clock when I run perf.
It has a pipeline width of 4 uops internally.
So it's falling far short of a typical basic block/clock cycle.
I would also not expect the X4 to utilize that full width in any real workload, but it only needs a fraction of its full width to get more IPC than the X64 competition.
But I'd expect it is a reasonably balanced chip (why waste silicon?), and thus the wide pipeline is an indicator of the chip itself being wide with immense out of order capability.
Branch prediction rates tend to be extremely high. The X4's frontend also has 10 frontend pipeline stages. Which means ideally, it'd be correctly predicting all branches at least 10 cycles into the future, so that on clock cycle `N-10`, the frontend can get started on the correct instructions that'll be needed on clock cycle `N`.
The difference between 1 basic block/cycle and >1 basic block/cycle is really small; it already needs a long history of successful predictions to get 1.
But of course, each mispredict is extremely costly.
As for bringing up clock speeds and the M1, my point there was that the M1 has already left Intel and AMD behind in terms of IPC; it achieves similar performance despite much lower clock speeds.
My original comment said that the ARM Cortex X4 looks like it is starting to leave Intel and AMD behind in terms of IPC, and I used the width as an indicator.
You responded saying that the software has to actually allow for this. Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve.
So you could argue that, unlike the M1, the Cortex X4 will not be able to realize such an advantage.
While plausible, if it does fail to do so, we at least won't be able to blame the software, because the M1 is able to do so despite the software.
It'd have to be some deficiency of the X4 relative to the M1 -- such as cache sizes, memory bandwidth...
Hopefully it does turn out to be a great chip! But that remains to be seen.
_a_a_a_ 781 days ago [-]
I can't imagine getting every instruction in a basic block started at every clock. There is almost certainly dependencies within the block. I talked about basic blocks because that would mean that if you want to kick-off more instructions than in the block, you'd have to speculate about the branch taken. And I am sure can start executing instructions down are speculated jump, only I don't know how far. I also don't know if you can speculate past a write to RAM. I'd like to know.
> Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve
My point is, shortening the pipeline needed to execute an instruction would also get higher performance. Perhaps they invented a better cache, perhaps larger, perhaps more associative, perhaps…? There's more than one way of increasing performance besides IPC and clock.
celrod 781 days ago [-]
I'd say the ways to increase performance are
1. decrease the number of instructions needed (has to be done in software, but also dependent on ISA, e.g. using AVX512 can help a lot here, so long as you don't end up executing more scalar epilogue iterations).
2. increase IPC (obviously software can help a lot here)
3. increase clocks (not much software can do here; wider instructions are generally worth it, so if choosing between "1." and "3." in software, it's generally better to favor "1." (especially on more recent CPUs that don't have downclocking problems).
Design of the CPU can also influence all three of these.
Things like better cache, better branch predictors and shorter pipelines, will all help IPC.
> My point is, shortening the pipeline needed to execute an instruction would also get higher performance.
This wouldn't increase throughput if 100% of branches are predicted correctly -- except for the the extra cycles before instructions start executing.
It'd decrease branch mispredict penalties though, which is a big deal and would help in practice. The Cortex X4 did shave off a frontend pipeline stage relative to the Cortex X3 (11 -> 10). This is better than Intel Alder Lake. One contributor is probably that it is easier to decode ARM instructions in parallel without needing mutliple pipeline stages (e.g., one to find out where variable width instructions end before the instruction byte stream can be sent to decoders [not an issue if instructions are already in the uop cache]).
> There's more than one way of increasing performance besides IPC and clock.
I assume by "IPC" here you mean dispatch width?
Things like a better cache for fewer misses, better prefetching, better branch prediction, larger reorder buffers so that it can speculate further ahead before stalling, all help IPC.
Zen1 CPUs (6 uops) were already wider than Intel Skylake (4 uops) and Ice/Tiger lake (5 uops), matching Alder Lake (6 uops).
But they were obviously far behind in IPC (and Zen1 in particular also decoded AVX2 instructions into 2 uops).
Zen1 has SMT, which was part of the reason to go wide early on: the frontend wasn't good enough to feed that width with a single thread, but using two threads could mitigate that. Early on, Zen1 (and the Zen family) generally did better in multithreaded than single threaded benchmarks thanks to that approach.
The ARM Cortex X4 doesn't have SMT, so it's taking a different approach to performance.
A single number isn't going to be representative of performance across benchmarks or all the tasks you're interested in.
Unfortunately, I think it'll be more than a year before we can see the Cortex X4 (as it's aiming at TSMC N3E), but I'm definitely looking forward to deep dives into it's performance (and also that of Intel's Meteor Lake, Zen5, etc).
tracker1 781 days ago [-]
Even then... in what ways are they actually faster, it seems that a lot of the custom (non-arm) custom processing units are what gives them the boost in a few cases, and compared to NVidia and AMD are still a bit behind there. There are a lot of ways to approach this, and in the Apple case, where they're really winning is the power utilization in terms of performance/watt.
wmf 782 days ago [-]
Ampere is on their fourth generation custom core but none of the earlier X-Gene cores were any good so I don't know if they're learning from their mistakes.
KingOfCoders 782 days ago [-]
Apple does it mostly with the best fab tech (3nm next [0]), one generation ahead of everyone else, and with lots of cache (expensive, but they have the margins by being expensive and being vertically integrated), more than everyone else.
Apple doesn't do it with ARM design in this sense.
Current a15 is still in 5nm but outperforms latest snapdragon that’s on 4nm - those fastest android arm chips matching iPhone chips from 2 generations behind last time when checked Geekbench
p_l 782 days ago [-]
The part where Apple can (or is courageous enough) to plop large caches and other otherwise "trivial but expensive" performance enhancements is the critical point. The vertical integration also means that they get to design the parameters of the planned device instead of how Qualcomm or Samsung end up making compromises for market share
KingOfCoders 782 days ago [-]
Difficult to get information, seems like the A15 has 2x 3.2 ghz cores whereas the S8G2 has 1x 3.2 hgz core. Searched a little bit on Google but could not come up with detailed - only fragmented - information on L1/L2/L3 caches for the A15 and S2G2.
But I thought we were talking Server/Desktop chips, not phone chips (I have no clue about them, my MI11Ultra seems fast enough for everything, I'm not compiling on it).
How much performance you get with caches are shown with the X3D models that - gaming performance - are faster than all than more expensive Intel/AMD chips.
M2 Max Ryzen 7900
Cores 12 12
L1 320kb 64kb
L2 32MB 12MB
Last 48MB 64MB (32MB/chiplet)
senttoschool 782 days ago [-]
Nuvia should outdo ARM - if the lawsuit preventing Qualcomm from using Nuvia cores fail.
I do expect Ampere to succeed in outperforming stock ARM core designs. Otherwise, there'd be no reason to have embarked on a very risky and expensive journey to have their own custom cores.
jacquesm 782 days ago [-]
That's circular reasoning. They may simply not succeed because their risky and expensive journey didn't work out. They can't retro-actively redo their reasoning.
awill 782 days ago [-]
Exactly.
They're doing it because they THINK they can outdo Arm.
Arm isn't standing still, and they would have had to start years ago.
bee_rider 782 days ago [-]
Maybe they started with N2 as a baseline and improved from there. If if their 3 year old chips were performance/power competitive with the new AMD ones, then even if they only tweaked N2 a little bit they would seem to have a likely winner?
411111111111111 782 days ago [-]
Weren't the tensor chips in the pixel phones based on arm too?
And in the Datacenter you've also got google and Amazon making their own arm based chips, not sure if MS does the same.
However, Apples M1 Chip had the biggest impact to consumers I think. The tensor chips aren't particularly better and we just don't have any interactions with hardware servers
rgbrenner 782 days ago [-]
Google Cloud, Azure, and Oracle use Ampere Altra (ARM Neoverse)
AWS used Neoverse for Graviton.
Google Cloud is designing their own ARM chips: they're working on one design from Marvell (likely Neoverse), and their own custom design. These are in the design phase, and very unlikely to be produced this year.
Re: Google Tensor chips on phones: Those used ARM Cortex CPU and ARM Mali GPU, and a Google designed Tensor processing unit for machine learning. The google TPU replaces chips like Qualcomm Hexagon. The TPU does not use an ARM instruction set.
cjbprime 782 days ago [-]
I think I read that AmpereOne isn't Neoverse.
782 days ago [-]
qwertox 782 days ago [-]
Off Topic: I wish we wouldn't use expressions like "shit the bed" here on HN.
BiteCode_dev 782 days ago [-]
Off Topic: I wish we wouldn't complain about the use of expressions like "shit the bed" here on HN.
solarkraft 782 days ago [-]
Off Topic: I like the use of normal, everyday language on HN.
dijit 782 days ago [-]
Do you have a reason to dislike it? It it because its a colloquialism/ambiguity or because you consider “shit” to be a naughty word?
_a_a_a_ 782 days ago [-]
There's a third option. To some people the word shit is just a word when used as a word and no more. To others it can invoke strong and vivid visual imagery, and by that, provoke a strong discomfort and distaste (as it happens I'm somewhere in the middle). So it could be down simply to that in which case it's just consideration for other people's feelings.
I too prefer colloquialisms and straightforward speaking, and I swear plenty, but if it makes someone uncomfortable, that could be considered bad manners.
boredumb 782 days ago [-]
Not inclusive language for people who are incontinent
tw1984 782 days ago [-]
it is quite amazing given just 5 years ago that Intel PR mouthpieces were still trying to convince people that why they don't need anything more than 4 cores on desktops.
rewmie 782 days ago [-]
Of course Intel would prefer to sell 32 expensive server CPUs instead of just one expensive server CPU.
Competition is good. We are lucky to have AMD.
dogma1138 782 days ago [-]
So would AMD, density optimized cores like Zen4C would likely not be a priority without e-cores and products like Sierra Forest being on the roadmap. The question now would be just how far would this density optimization path would go. AMD right now minimized the core size by stripping out cache and lowering the frequency target considerably. Intel went with an approach that also includes a full hybrid design where you get smaller cores with fewer features hence why the e-cores are still fare smaller than Zen4C.
senttoschool 782 days ago [-]
To be fair, most people don't need more than 4 cores on desktops.
Zen4c and Ampere chips are for data centers.
bee_rider 782 days ago [-]
These are server/workstation chips anyway.
10 years ago Intel was selling Xeon Phis, 60 little cores on a PCIe card.
Coincidentally the CEO/founder of Ampere worked at Intel at the time.
torginus 782 days ago [-]
I'm not an Intel mouthpiece, and I still don't have a clue what regular power users (including me) use more than 4 cores for.
Speaking from experience, I don't run that many parallel workloads (I don't know how parallel my compiles are), but when I do, given a specific TDP for a desktop processor, and the asymptotically diminishing returns due to lower clocks speeds, more memory/cache contention and more synchronization, the benefits of going for more than 4-6 cores are negligible, for parallel workloads.
paulmd 782 days ago [-]
3930K came out over 10 years ago, my man.
5820K was almost 9 years ago.
8700K was almost 6 years ago now.
If $375 was too much for a hexacore that was your choice for many many years, for a long time now the market simply preferred the quad cores.
giuliomagnifico 784 days ago [-]
But what could be a power consumption of this new CPU? Because the Ampere 128-core is quite surprising in power consumption:
> In terms of power consumption, that is perhaps where the HPE ProLiant RL300 shines. Here is a screenshot of iLO 6 with the server idle. One can see a 136W average. That is fairly good for a 128-core server (~64-core EPYC or Xeon equivalent.)
The article answers this: Bergamo is 360W and Ampere One is 350W. I suspect the 256T Bergamo is faster than 192T Ampere One.
nine_k 782 days ago [-]
256 hyperthreads may be not exactly comparable to 192 dedicated cores, but it depends on the type of load, and needs measurement.
The AMD part's power budget is about 2.7W per core, and Ampere's, 1.8W. None of these cores look like very high speed. Their performance will also likely depend heavily on efficiency of caches and access to RAM, not just cores' internals.
geerlingguy 782 days ago [-]
For chips like these, they are complex enough that even benchmarking them to figure out which one is faster is a chore.
It depends on how the memory channels work, what kind of optimizations you use in your benchmarking tools, how power delivery is handled on the motherboard, and that's before you consider the way the CCDs or cross-core communication happens (which is unique on each of the major architectures (Xeon/EPYC/Ampere)).
Anandtech had a great article comparing previous-generation server chips and found that memory access, L2/L3 access, and other cross-core communication varied dramatically on different chips, and even when BIOS was configured certain ways!
rewmie 782 days ago [-]
> For chips like these, they are complex enough that even benchmarking them to figure out which one is faster is a chore.
If two chips are hard to tell apart in terms of performance, that just means that they are practically equivalent. This means other factors come in play, such as ISA and cost.
nine_k 782 days ago [-]
They are not hard to tell apart, AFAICT; they are hard to reduce to a single number for easy comparison.
Your point stands though.
StillBored 781 days ago [-]
There are a pile of BIOS settings which are BIOS settings precisely because there isn't a right answer for every workload. DIMM channel ganging, DIMM bank interleave, socket interleave, NUMA chiplet vs socket descriptions, NUMA weights, and on an on frequently have 5%-10% or more of an impact on a given workload. Change the workload and those numbers all change. Its like playing with compiler flags, this or that flag raises or lowers the perf of this or that part of a given benchmark. If there was a concrete answer, they wouldn't be flags.
In a lot of cases when machines are within 20% of each other, its quite possible the real difference between the two results is how many hours a perf engineer has spent tuning the workload.
In a way this is why the simple microbenchmarks are as important as the system level ones. The simpler ones are easier to tune and give a speed of light for a given operation.
blueboo 782 days ago [-]
To that end, I wonder if there are benchmarks or standard methods that also measure “deep” utilisation — not just keeping cores “busy” but actually saturating their processing capacity.
LoganDark 783 days ago [-]
That's barely more than my single 12400F with only 6 cores (when performing work). Wow.
mattgrice 782 days ago [-]
12400F max turbo power is 117 watts.
LoganDark 782 days ago [-]
Maybe if you don't overclock it to 5.3GHz. I can easily pull 130.
Incipient 782 days ago [-]
You're comparing a cheap consumer grade 6core overclocked well beyond design operating at full load to a high end datacenter chip designed for performance and efficiency under idle conditions.
Your statement is correct, but I'm not sure the comparison really means much?
ilyt 782 days ago [-]
>high end datacenter chip designed for performance and efficiency under idle conditions.
Nothing datacenter AMD is efficient under idle condition.
thejosh 782 days ago [-]
If your lights don't dim when overclocking, are you truly overclocking?
marcofatica 783 days ago [-]
What does the equivalent EPYC consume?
Synaesthesia 783 days ago [-]
>Single-socket submissions show that AMD's 128-core, 256-thread Epyc 9754 scores right around 922 in the benchmark, about 2.58x higher than Ampere's top-specced Altra, the M128-30, which comes in at 356.
>While a clear win for AMD's Bergamo, it doesn't take into consideration other elements like power consumption. AMD's part is rated for 360W and can be configured up to 400W, while Ampere's has a TDP of just 182W. So yes, it may be 2.5x times faster, but it potentially uses 2-2.2x more power.
adgjlsfhk1 783 days ago [-]
that's still a major win for AMD. 2x performance per core means you need half as many servers (which is way cheaper) and all your latency bound operations happen twice as quickly. (also I'm pretty sure you can underclock the AMD system to the point where it's the same power and 50% more performance)
geerlingguy 783 days ago [-]
At a certain density, though, power budgets come into consideration.
For 2 or 4 sockets per RU, you could be dealing with hundreds of kW per rack, and the heat that entails.
For certain deployments that could make sense. For others, a lower power, but still high core count solution could be better.
pclmulqdq 783 days ago [-]
I would assume that undervolting and underclocking will likely drop that power consumption quite considerably.
rewmie 782 days ago [-]
> For others, a lower power, but still high core count solution could be better.
I think OP's point is that you'd need over twice the number of the lower density cores to get the same performance, thus by going that route you'd end up needing more power to get the same computational resources.
To put it simply, with ARM you'd need a 4U to get almost the same compute as a 2U of AMD.
wmf 783 days ago [-]
You can always solve power density by not fully populating a rack or underclocking or both. If Epyc is better it's better; it doesn't depend on density.
geerlingguy 782 days ago [-]
It's definitely better for certain workloads.
jiggawatts 783 days ago [-]
Let’s round that to 200 watts difference. At 20c per kWh that’s an opex difference of about $1 per day.
Public cloud providers will sell you that amount of compute for about $50-$100 per day.
The lower end of that is the Ampere processor and the high end would be the AMD 128-core processor.
In other words, they can make $49 more by using the AMD CPU per day.
userbinator 782 days ago [-]
Idle power consumption is more important for desktop and mobile, and IMHO 136W idle seems quite high. That article says it gets close to 400W under full load.
throwaway2990 782 days ago [-]
Power consumption is important in mobile regardless of idle or full load. That’s why MacBooks are better than non MacBooks. Even under load they outperform traditional laptops for performance per watt
fsh 782 days ago [-]
Not really. A M2 Pro draws something like 40W under all-core load which is comparable to AMD Ryzen mobile CPUs with similar performance. The long battery life of the MBP is almost entirely due to the crazy low idle consumption of the entire system (not only the CPU).
throwaway2990 782 days ago [-]
And yet, I can’t take any windows laptop I’ve ever owned to work without a charger and work all day and come home. But can do that with the M2 MacBook. Cos under full load, the performance and battery life exceeds that of windows laptops.
fsh 782 days ago [-]
That's because the MacBook idles well, and most of the time is spent idling. Some modern Windows machines like the T14s Gen2 AMD come close, but others don't.
formerly_proven 782 days ago [-]
You can do that with a 12th/13th gen too, though it depends on what “work” is. I’m not sure there is any laptop that would last a day using the average enterprise setup (two IPSes, three “endpoint protectors”, two AVs, heavy WMI scripts constantly running and MS Teams always open), simply because the thing won’t get under 40% CPU during the day. The advantage of a MacBook here would simply be that 90% of the mentioned software doesn’t exist for the platform.
fsh 782 days ago [-]
I have heard the joke that MobileIron (an enterprise software for iOS devices) had its name because it turns you iPhone into a mobile iron.
ilyt 782 days ago [-]
Go ahead and actually draw long term CPU load and clocks. You will discover that you're nowhere near full load most of the time.
beebeepka 782 days ago [-]
Maybe you haven't used sufficiently new AMD laptops. Because my Ryzen 5800H and 6800H laptops (16 and 14 inch) can easily last a day despite using TSMC 7nm instead the much more efficient 5nm process that Apple has been enjoying.
Macbooks are fantastic but there is no magic. Manufacturing is extremely important
throwaway2990 782 days ago [-]
I have a 5800h laptop and 3-4 hours pails to the 8-10 hours I’m getting on the MacBook for the same workload.
beebeepka 782 days ago [-]
No arguments there. The M1/2 last longer. However, this largely due to TSMC 5nm node. The Ryzen 7800u will close the gap for sure
RcouF1uZ4gsC 782 days ago [-]
It may also spell trouble for horizontal scaling. A 128 core computer with a few terabytes of RAM could handle loads that that would otherwise need dozens of computers. There are huge advantages in terms of ease of management and programming.
geysersam 782 days ago [-]
Not sure about the "ease" of writing programs that scale efficiently to 128 cores.
If we're just running several processes there's not much difference from just running 8 computers each with 16 cores.
eldenring 782 days ago [-]
This is absolutely not true, having shared memory or even just being able to communicate over local os pipes is massivley simpler than introducing a network.
ilyt 782 days ago [-]
And orders of magnitude faster
imtringued 782 days ago [-]
8 servers use up one third of a rack.
You need interprocess communication that works across machines.
You need to orchestrate the deployment of software.
Communication within a machine is much faster.
ilyt 782 days ago [-]
Sure if you never programmed anything complex that needs actual communication and not just submitting job results at the end
qingcharles 782 days ago [-]
I've been moving some of my web assets over from x86 boxes to Altras. It's .NET so a lot of it moves without trouble. I'm having to move databases though as there is no mainline SQL Server for ARM. These Epycs would potentially end that migration.
throwaway2990 782 days ago [-]
Just stop using sql server. PostgreSQL runs on arm and is superior to sql server.
albertopv 782 days ago [-]
Pgsql is great in many ways, but not really superior to sql server, it really depends on what you need, e.g. columb encryption, data compression etc. We had indexing issues on postgres changing underlying OS bc postgres uses a lot of C libraries of the OS, and that is just wrong for a (multiplatform) RDBMS, it makes it less predictable.
throwaway2990 782 days ago [-]
4 years running postgresql on arm in production with 0 issues. (2tb database so kinda small, a lot of indexes tho, a lot of which are on jsonb columns) I can’t think of any feature in sql server I would ever need that PostgreSQL doesn’t have. But sql servers lack of JSON support is a deal breaker for me.
albertopv 782 days ago [-]
We had issues with postgres on x64 when changing underlying OS, maybe you didn't have to.
Can you encrypt columns or compress data with vanilla postgresql? Why not?
Postgresql is REALLY great, PostGis is awesome, it's not perfect and other RDBMS may have other useful features.
greggyb 782 days ago [-]
SQL Server has had JSON column types (and supporting functions) since 2016.
I don't know you or your workload, so I can't comment on its suitability for you. I would hazard a guess, though, that your last proper due diligence knowledge is seven years out of date, or more.
throwaway2990 782 days ago [-]
No. Sql server has json functions. It does not have a json type. It does not support indexing of json.
Edit: it’s kinda ironic you think my knowledge of sql server is outdated when you don’t understand the features supported in sql server to begin with.
greggyb 782 days ago [-]
You index JSON columns in SQL Server by creating an index on a computed column. I am very open to the strict argument that this is a workaround if you'd like to make that more verbosely. Nevertheless it is very effective and adds minimal IO overhead and is dispatched with one extra line of code (more if you prefer newline-heavy formatting). From a practical perspective this has never been a sticking point for any of numerous clients I have seen using the features, and the performance is as good as any other btree index.
You are correct that there is no formal SQL type in SQL Server for JSON. And I am sorry for implying otherwise. Type safety requires a constraint on the column intended to hold JSON.
There is no equivalent to Postgres's GIN indices which would allow indexing on an array in the JSON column. Such a requirement would need a normalized table holding the array's values in SQL Server. Whether this is a limitation or a lack of support for JSON, full stop, seems to me a matter open to debate.
I have seen many successful projects (and participated in quite a few of those) that utilize JSON in SQL Server databases. I will amend my former statement, though, because it obviously lacked nuance: SQL Server's JSON functionality has covered all use cases that I have had and personally seen, but my experience is obviously much lesser than some others', so you can take this experience with as much salt as you like (:
tracker1 781 days ago [-]
SQL Server operations for JSON are parsed from text content on demand. In other databases, the JSON content is broken apart and can be queried/indexed on that parsed/stored content. The closest you can get with SQL is a computed column from the JSON that you can then use to index against, it's very messy by comparison. If you use a lot of JSON queries in SQL server it can get excessively sluggish as well compared to PostgreSQL or even MySQL/MariaDB.
It's definitely a shortcoming of MS-SQL, which iirc is being addressed to an extent in the next release.
There are plenty of points where MS-SQL Server is very nice. JSON support is far from one of them. Even if the JSON is still a step above XML support in SQL Server.
speedgoose 782 days ago [-]
I don’t know your requirements but wouldn’t using software containers with a reproductible operating system (like nix), or even a simple image like Debian PostgreSQL, fix the predictability problem?
dogma1138 782 days ago [-]
No, at least not in a heavily regulated industry because then other obligations e.g. patching become a much bigger issue.
The big benefit of MSSQL and Oracle SQL is that your database server is almost completely detached from the operating system the likelihood of a system update changing how your database works is nill. With Postgres and other databases it’s not a given. Ironically on Windows you get to extract quite a bit of that back since Postgress on Windows comes with most of its own libraries, however that’s also part of the problem where you can easily have material differences between running Postgress on Windows and Linux.
speedgoose 782 days ago [-]
I don’t know about your industry but patching and software containers aren’t incompatible from my point of view.
I am guessing that you don’t want to validate PostgreSQL on all operating systems, but you could always stick to one using software containers.
dogma1138 782 days ago [-]
Containers don’t solve anything, it’s not that the base image doesn’t need to get patched.
Postgres is far more dependent on OS libraries which makes it far less predictable.
speedgoose 781 days ago [-]
Alright, but if you need to patch a dependency of the database does it matter whether the change is in a shared library or not?
Of course you could skip patching the dependencies and ship unsafe software the oracle db way.
dogma1138 781 days ago [-]
Yes, MSSQL for example is self contained OS or container image updates aren’t going to impact it or at least it’s extremely unlikely that they will.
Postgres is dependent on a lot of OS level C libraries that can materially change how things work.
This means that there will have to be more testing with Postgres and there will be higher uncertainty between different deployments.
All of these can be mitigated and for many organizations the benefits of Postgres might outweigh these downsides but they do exist.
albertopv 776 days ago [-]
Things are not carved in stone, we had to change the OS and something quite unforeseeable for a RDBSM happened. Imho DBMS should be a sort of self sufficient OS in itself.
matwood 782 days ago [-]
PGSQL is great, but often when someone says MSSQL they mean the entire ecosystem. SSRS, SSIS, SSAS fall under the MSSQL umbrella and make it much more than a database. MSSQL really embraces the S of RDBMS.
MSSQL also has great tooling - another part of the ecosystem.
If I was starting today, pgsql would be my choice for licensing costs alone. But, if I already had a system built on MSSQL, it would be hard to make a business case to move (I have tried).
qingcharles 782 days ago [-]
This is exactly what I am aiming to do. I am DB agnostic as my requirements are very simple. I was using MSSQL only because it ships by default with Visual Studio and I've been using it extensively for the last 25 years.
sacnoradhq 782 days ago [-]
Sorry, but Zen 4c (like 9754) don't have the massive L3 cache or the fast clocking of Zen 4 (like 9684X). They're chopped down Zen 4's. 32 more cores is only good for hyperparallel HPC where the working set is smaller. That giant L3 cache and faster clock of Zen 4's is going to perform better in almost all other cases.
And Altra's are old PCIe gen 4 and DDR4, while AMDs are PCIe 5.0 and DDR5.
dragontamer 782 days ago [-]
Aren't web-frontends just basically HTTPS / AES-GCM cores that shuffles data between RAM and PCIe / Networking?
I'm thinking like Cloudflare or Twitter. There's not much compute going on at this level (all the compute would be at the SQL-server or Application-servers or other backend equipment handles).
What's needed for frongend / proxy code is AES for handling the HTTPS / TLS connection, and lots of threads to handle all the different connections.
ilyt 782 days ago [-]
Vast majority of them will be used to just host a bunch of small VMs for customers. The "almost all other cases" are the minority of uses here
qwertox 782 days ago [-]
Aren't the AmpereOne CPUs already on the market? They were announced a bit over a year ago, offer PCIe 5.0 and DDR5, and were officially presented last month.
ksec 782 days ago [-]
You do realise these Zen4c are aiming at HyperScaler for Cloud vCPU workload ?
user6723 782 days ago [-]
Even if less performant, a variant of this CPU guaranteed 100% immune from spectre-like problems is highly desirable.
krastanov 782 days ago [-]
I assume "spectre-like" means "leaks info due to side effects of speculative execution". Arm (and any high performance chip really) relies on speculative execution for performance, so I would be surprised if they are not susceptible to spectre-like attacks in principle. What am I missing?
adrian_b 782 days ago [-]
Nothing.
At least for now, the fastest ARM CPUs are not immune to Spectre, so the recent ARM cores have introduced a set of similar workarounds to the recent Intel and AMD CPUs, in order to serialize the execution and flush internal state around context changes.
Searching for "Speculative Processor Vulnerability" on the Arm developer site will find many resources describing possible attacks and the mitigations that must be implemented for various Arm cores.
ilyt 782 days ago [-]
But this isn't one. Every modern design that's used in any capacity turned out to be vulnerable one way or another.
tracker1 781 days ago [-]
Price:Performance:Cost-to-Run are afaik the three metrics that will determine if something is the right solution. Barring custom accelerators or purpose-built hardware general compute is roughly as approachable for arm or x86 in terms of software applications that run on said hardware.
If the price is less and the cost to run is less, then the bottom line max performance may not be the leading factor in making a decision. It really depends on a few factors, and this article does a poor job even spelling that much out.
It is of course always possible that Ampere has completely shit the bed and the AmpereOne won’t be an improvement, but I’ll give them the benefit of the doubt I guess.
Presumably Altra-Next will just use Neoverse N2 (or N3).
The real hubris at these companies is thinking they can build a team that can create a better chip in one generation. Apple is on what the 10th public generation of their own cores, from a team they acqihired that was already producing cpus? How many generations/respins did it take before they replaced the ARM ip with their own designs, and then how many generations was it before they were faster? Not only that but Arm seems to have gotten serious a few years ago and the IPC is within striking distance of the best amd/intel products. They are no longer doing obviously stupid things, so it seems odd a company like ampere doesn't have another respun Altra with a N2/V2 sitting on the sidelines as a fallback when their own design fails.
The specs there sound seriously impressive to me, and like they might be getting ready to leave AMD and Intel behind in terms of IPC (for their highest performing chips).
ARM's clock speeds are much lower, so single core performance will probably be worse. But I'd guess server clock speeds may be similar.
Disclaimer: I don't know what I'm talking about
Note that dispatch doesn't mean vector width, which is harder for software to take advantage of. It means how many uops the pipeline can handle/clock cycle.
As for higher clock speeds, there's a whole lot more that matters such as pipelining instructions, cache sizes, and any number of other things. Clock speed by itself isn't particularly revealing.
I would also not expect the X4 to utilize that full width in any real workload, but it only needs a fraction of its full width to get more IPC than the X64 competition.
But I'd expect it is a reasonably balanced chip (why waste silicon?), and thus the wide pipeline is an indicator of the chip itself being wide with immense out of order capability.
Branch prediction rates tend to be extremely high. The X4's frontend also has 10 frontend pipeline stages. Which means ideally, it'd be correctly predicting all branches at least 10 cycles into the future, so that on clock cycle `N-10`, the frontend can get started on the correct instructions that'll be needed on clock cycle `N`. The difference between 1 basic block/cycle and >1 basic block/cycle is really small; it already needs a long history of successful predictions to get 1. But of course, each mispredict is extremely costly.
As for bringing up clock speeds and the M1, my point there was that the M1 has already left Intel and AMD behind in terms of IPC; it achieves similar performance despite much lower clock speeds. My original comment said that the ARM Cortex X4 looks like it is starting to leave Intel and AMD behind in terms of IPC, and I used the width as an indicator. You responded saying that the software has to actually allow for this. Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve.
So you could argue that, unlike the M1, the Cortex X4 will not be able to realize such an advantage. While plausible, if it does fail to do so, we at least won't be able to blame the software, because the M1 is able to do so despite the software. It'd have to be some deficiency of the X4 relative to the M1 -- such as cache sizes, memory bandwidth... Hopefully it does turn out to be a great chip! But that remains to be seen.
> Yet the M1 example shows that existing software does in fact allow for significantly more out of order execution than Intel and AMD CPUs achieve
My point is, shortening the pipeline needed to execute an instruction would also get higher performance. Perhaps they invented a better cache, perhaps larger, perhaps more associative, perhaps…? There's more than one way of increasing performance besides IPC and clock.
Design of the CPU can also influence all three of these. Things like better cache, better branch predictors and shorter pipelines, will all help IPC.
> My point is, shortening the pipeline needed to execute an instruction would also get higher performance.
This wouldn't increase throughput if 100% of branches are predicted correctly -- except for the the extra cycles before instructions start executing. It'd decrease branch mispredict penalties though, which is a big deal and would help in practice. The Cortex X4 did shave off a frontend pipeline stage relative to the Cortex X3 (11 -> 10). This is better than Intel Alder Lake. One contributor is probably that it is easier to decode ARM instructions in parallel without needing mutliple pipeline stages (e.g., one to find out where variable width instructions end before the instruction byte stream can be sent to decoders [not an issue if instructions are already in the uop cache]).
> There's more than one way of increasing performance besides IPC and clock.
I assume by "IPC" here you mean dispatch width? Things like a better cache for fewer misses, better prefetching, better branch prediction, larger reorder buffers so that it can speculate further ahead before stalling, all help IPC.
Zen1 CPUs (6 uops) were already wider than Intel Skylake (4 uops) and Ice/Tiger lake (5 uops), matching Alder Lake (6 uops). But they were obviously far behind in IPC (and Zen1 in particular also decoded AVX2 instructions into 2 uops).
Zen1 has SMT, which was part of the reason to go wide early on: the frontend wasn't good enough to feed that width with a single thread, but using two threads could mitigate that. Early on, Zen1 (and the Zen family) generally did better in multithreaded than single threaded benchmarks thanks to that approach.
The ARM Cortex X4 doesn't have SMT, so it's taking a different approach to performance.
A single number isn't going to be representative of performance across benchmarks or all the tasks you're interested in.
Unfortunately, I think it'll be more than a year before we can see the Cortex X4 (as it's aiming at TSMC N3E), but I'm definitely looking forward to deep dives into it's performance (and also that of Intel's Meteor Lake, Zen5, etc).
Apple doesn't do it with ARM design in this sense.
[0] https://www.macrumors.com/2023/02/22/apple-secures-tsmc-3nm-...
But I thought we were talking Server/Desktop chips, not phone chips (I have no clue about them, my MI11Ultra seems fast enough for everything, I'm not compiling on it).
And the M2 has massive caches (L1/L2/L3) https://en.wikipedia.org/wiki/Apple_M2
How much performance you get with caches are shown with the X3D models that - gaming performance - are faster than all than more expensive Intel/AMD chips.
I do expect Ampere to succeed in outperforming stock ARM core designs. Otherwise, there'd be no reason to have embarked on a very risky and expensive journey to have their own custom cores.
They're doing it because they THINK they can outdo Arm.
Arm isn't standing still, and they would have had to start years ago.
And in the Datacenter you've also got google and Amazon making their own arm based chips, not sure if MS does the same.
However, Apples M1 Chip had the biggest impact to consumers I think. The tensor chips aren't particularly better and we just don't have any interactions with hardware servers
AWS used Neoverse for Graviton.
Google Cloud is designing their own ARM chips: they're working on one design from Marvell (likely Neoverse), and their own custom design. These are in the design phase, and very unlikely to be produced this year.
Re: Google Tensor chips on phones: Those used ARM Cortex CPU and ARM Mali GPU, and a Google designed Tensor processing unit for machine learning. The google TPU replaces chips like Qualcomm Hexagon. The TPU does not use an ARM instruction set.
I too prefer colloquialisms and straightforward speaking, and I swear plenty, but if it makes someone uncomfortable, that could be considered bad manners.
Competition is good. We are lucky to have AMD.
Zen4c and Ampere chips are for data centers.
10 years ago Intel was selling Xeon Phis, 60 little cores on a PCIe card.
Coincidentally the CEO/founder of Ampere worked at Intel at the time.
Speaking from experience, I don't run that many parallel workloads (I don't know how parallel my compiles are), but when I do, given a specific TDP for a desktop processor, and the asymptotically diminishing returns due to lower clocks speeds, more memory/cache contention and more synchronization, the benefits of going for more than 4-6 cores are negligible, for parallel workloads.
5820K was almost 9 years ago.
8700K was almost 6 years ago now.
If $375 was too much for a hexacore that was your choice for many many years, for a long time now the market simply preferred the quad cores.
> In terms of power consumption, that is perhaps where the HPE ProLiant RL300 shines. Here is a screenshot of iLO 6 with the server idle. One can see a 136W average. That is fairly good for a 128-core server (~64-core EPYC or Xeon equivalent.)
https://www.servethehome.com/hpe-proliant-rl300-gen11-review...
The AMD part's power budget is about 2.7W per core, and Ampere's, 1.8W. None of these cores look like very high speed. Their performance will also likely depend heavily on efficiency of caches and access to RAM, not just cores' internals.
It depends on how the memory channels work, what kind of optimizations you use in your benchmarking tools, how power delivery is handled on the motherboard, and that's before you consider the way the CCDs or cross-core communication happens (which is unique on each of the major architectures (Xeon/EPYC/Ampere)).
Anandtech had a great article comparing previous-generation server chips and found that memory access, L2/L3 access, and other cross-core communication varied dramatically on different chips, and even when BIOS was configured certain ways!
If two chips are hard to tell apart in terms of performance, that just means that they are practically equivalent. This means other factors come in play, such as ISA and cost.
Your point stands though.
In a lot of cases when machines are within 20% of each other, its quite possible the real difference between the two results is how many hours a perf engineer has spent tuning the workload.
In a way this is why the simple microbenchmarks are as important as the system level ones. The simpler ones are easier to tune and give a speed of light for a given operation.
Your statement is correct, but I'm not sure the comparison really means much?
Nothing datacenter AMD is efficient under idle condition.
>While a clear win for AMD's Bergamo, it doesn't take into consideration other elements like power consumption. AMD's part is rated for 360W and can be configured up to 400W, while Ampere's has a TDP of just 182W. So yes, it may be 2.5x times faster, but it potentially uses 2-2.2x more power.
For 2 or 4 sockets per RU, you could be dealing with hundreds of kW per rack, and the heat that entails.
For certain deployments that could make sense. For others, a lower power, but still high core count solution could be better.
I think OP's point is that you'd need over twice the number of the lower density cores to get the same performance, thus by going that route you'd end up needing more power to get the same computational resources.
To put it simply, with ARM you'd need a 4U to get almost the same compute as a 2U of AMD.
Public cloud providers will sell you that amount of compute for about $50-$100 per day.
The lower end of that is the Ampere processor and the high end would be the AMD 128-core processor.
In other words, they can make $49 more by using the AMD CPU per day.
Macbooks are fantastic but there is no magic. Manufacturing is extremely important
If we're just running several processes there's not much difference from just running 8 computers each with 16 cores.
You need interprocess communication that works across machines.
You need to orchestrate the deployment of software.
Communication within a machine is much faster.
I don't know you or your workload, so I can't comment on its suitability for you. I would hazard a guess, though, that your last proper due diligence knowledge is seven years out of date, or more.
Edit: it’s kinda ironic you think my knowledge of sql server is outdated when you don’t understand the features supported in sql server to begin with.
You are correct that there is no formal SQL type in SQL Server for JSON. And I am sorry for implying otherwise. Type safety requires a constraint on the column intended to hold JSON.
There is no equivalent to Postgres's GIN indices which would allow indexing on an array in the JSON column. Such a requirement would need a normalized table holding the array's values in SQL Server. Whether this is a limitation or a lack of support for JSON, full stop, seems to me a matter open to debate.
I have seen many successful projects (and participated in quite a few of those) that utilize JSON in SQL Server databases. I will amend my former statement, though, because it obviously lacked nuance: SQL Server's JSON functionality has covered all use cases that I have had and personally seen, but my experience is obviously much lesser than some others', so you can take this experience with as much salt as you like (:
It's definitely a shortcoming of MS-SQL, which iirc is being addressed to an extent in the next release.
There are plenty of points where MS-SQL Server is very nice. JSON support is far from one of them. Even if the JSON is still a step above XML support in SQL Server.
The big benefit of MSSQL and Oracle SQL is that your database server is almost completely detached from the operating system the likelihood of a system update changing how your database works is nill. With Postgres and other databases it’s not a given. Ironically on Windows you get to extract quite a bit of that back since Postgress on Windows comes with most of its own libraries, however that’s also part of the problem where you can easily have material differences between running Postgress on Windows and Linux.
I am guessing that you don’t want to validate PostgreSQL on all operating systems, but you could always stick to one using software containers.
Postgres is far more dependent on OS libraries which makes it far less predictable.
Of course you could skip patching the dependencies and ship unsafe software the oracle db way.
Postgres is dependent on a lot of OS level C libraries that can materially change how things work.
This means that there will have to be more testing with Postgres and there will be higher uncertainty between different deployments.
All of these can be mitigated and for many organizations the benefits of Postgres might outweigh these downsides but they do exist.
MSSQL also has great tooling - another part of the ecosystem.
If I was starting today, pgsql would be my choice for licensing costs alone. But, if I already had a system built on MSSQL, it would be hard to make a business case to move (I have tried).
And Altra's are old PCIe gen 4 and DDR4, while AMDs are PCIe 5.0 and DDR5.
I'm thinking like Cloudflare or Twitter. There's not much compute going on at this level (all the compute would be at the SQL-server or Application-servers or other backend equipment handles).
What's needed for frongend / proxy code is AES for handling the HTTPS / TLS connection, and lots of threads to handle all the different connections.
At least for now, the fastest ARM CPUs are not immune to Spectre, so the recent ARM cores have introduced a set of similar workarounds to the recent Intel and AMD CPUs, in order to serialize the execution and flush internal state around context changes.
Searching for "Speculative Processor Vulnerability" on the Arm developer site will find many resources describing possible attacks and the mitigations that must be implemented for various Arm cores.
If the price is less and the cost to run is less, then the bottom line max performance may not be the leading factor in making a decision. It really depends on a few factors, and this article does a poor job even spelling that much out.