Tuesday, January 5th 2021

AMD Applies for CPU Design Patent Featuring Core-Integrated FPGA Elements

AMD has applied for a United States Patent that describes a CPU design with FPGA (Field-Programmable Gate Array) elements integrated into its core design. Titled "Method and Apparatus for Efficient Programmable Instructions in Computer Systems", the patent application describes a CPU with FPGA elements inscribed into its very core design, where the FPGA elements actually share CPU resources such as registers for floating-point and integer execution units. This patent undoubtedly comes in the wake of AMD's announced Xilinx acquisition plans, and brings FPGA and CPU marriages to a whole other level. FPGA,as the name implies, are hardware constructions which can reconfigure themselves according to predetermined tables (which can also be updated) to execute desired and specific functions.

Intel have themselves already shipped a CPU + FPGA combo in the same package; the company's Xeon 6138P, for example, includes an Arria 10 GX 1150 FPGA on-package, offering 1,150,000 logic elements. However, this is simply a CPU + FPGA combo on the same substrate; not a native, core-integrated FPGA design. Intel's product has severe performance and latency penalties due to the fact that complex operations performed in the FPGA have to be brought out of the CPU, processed in the FPGA, and then its results have to be returned to the CPU. AMD's design effectively ditches that particular roundabout, and should thus allow for much higher performance.
Some of the more interesting claims in the patent application are listed below:
  • Processor includes one or more reprogrammable execution units which can be programmed to execute different types of customized instructions
  • When a processor loads a program, it also loads a bitfile associated with the program which programs the PEU to execute the customized instruction
  • Decode and dispatch unit of the CPU automatically dispatches the specialized instructions to the proper PEUs
  • PEU shares registers with the FP and Int EUs.
  • PEU can accelerate Int or FP workloads as well if speedup is desired
  • PEU can be virtualized while still using system security features
  • Each PEU can be programmed differently from other PEUs in the system
  • PEUs can operate on data formats that are not typical FP32/FP64 (e.g. Bfloat16, FP16, Sparse FP16, whatever else they want to come up with) to accelerate machine learning, without needing to wait for new silicon to be made to process those data types.
  • PEUs can be reprogrammed on-the-fly (during runtime)
  • PEUs can be tuned to maximize performance based on the workload
  • PEUs can massively increase IPC by doing more complex work in a single cycle
As it stands, this sort of design would allow, in theory, for an updatable CPU that might never need to be upgraded when it comes to new instruction support: since FPGA is a programmable hardware logic, a simple firmware update could allow the CPU to reconfigure its FPGA array so as to be able to process new, exotic instructions as they are released. Another argument for this integration is that in this way, some fixed-function silicon that is today found in CPUs and that serve to support legacy x86 instructions could be left out of the die, to be taken care of by the FPGA package itself - enabling a still-on-board hardware accelerator for when (and if) these instructions are required.

This would also allow AMD to trim the CPU of the "dark silicon" that is currently present - essentially, highly specialized hardware acceleration blocks that sit idly, as a waste of die space, when not in use. The bottom line is this: CPUs with lower die space reserved for highly specialized operations, thus with more die area available for other resources (such as more cores), and with integrated, per-core FPGA elements that would on-the-fly reconfigure themselves according to processing needs. And if there are no exotic operations required (such as AI inferencing and acceleration, AVX (for example), video hardware acceleration, or other workloads, then the FPGA elements can just be reconfigured to "turbo" the CPU's own floating point and integer units, increasing available resources. An interesting patent application, for sure.
Sources: Free Patents Online, Reddit user @ Marakeshmode, Hot Hardware
Add your own comment

22 Comments on AMD Applies for CPU Design Patent Featuring Core-Integrated FPGA Elements

#1
Bytales
Cant imagine why hasnt anyone thought of this before. I dreamed of on the fly reprogramable hardware units inside chips since like forever, and it is only know i see one of the great CPU makers on the planet going this route.
This is the future, Full CPU with on the fly reprogramable units. They "morph" in the shape that is the most efficient for the calculation that needs to be done. Its as if you have a billion cpus into one. This is the future for sure.
Imagine having a 512 or 1024 core on 0.5 nanometer cpu like this, would blast curent high end cpus like the 5950x into oblivion like they were nothing. In all kind of workloads.
Posted on Reply
#2
maxitaxi96
Bytales
Cant imagine why hasnt anyone thought of this before. I dreamed of on the fly reprogramable hardware units inside chips since like forever, and it is only know i see one of the great CPU makers on the planet going this route.
This is the future, Full CPU with on the fly reprogramable units. They "morph" in the shape that is the most efficient for the calculation that needs to be done. Its as if you have a billion cpus into one. This is the future for sure.
Imagine having a 512 or 1024 core on 0.5 nanometer cpu like this, would blast curent high end cpus like the 5950x into oblivion like they were nothing. In all kind of workloads.
Do FPGAs not have a performance penalty? I always thought it was a trade-off (highly specialized & faster <--> highly generalized & slower)
Posted on Reply
#3
john_
Noob question.


Can those "FPGA parts" emulate old stuff? Like 32bit code? I mean, that could probably help AMD clean up all the old stuff in their designs, that is there for compatibility perposes and help streamline their future cores?
Posted on Reply
#4
Raevenlord
News Editor
john_
Noob question.


Can those "FPGA parts" emulate old stuff? Like 32bit code? I mean, that could probably help AMD clean up all the old stuff in their designs, that is there for compatibility perposes and help streamline their future cores?
Yes. That's part of what dark silicon in the article refers to.
Posted on Reply
#5
john_
Raevenlord
Yes. That's part of what dark silicon in the article refers to.
Thanks. I should have read ALL your article, not the first half :)
Posted on Reply
#6
OSdevr
maxitaxi96
Do FPGAs not have a performance penalty? I always thought it was a trade-off (highly specialized & faster <--> highly generalized & slower)
They do yes. An FPGA implementation will always be slower and take more die space than dedicated hardware. However, if there is more than one accelerator present and only one is used at a time then an FPGA implementation can emulate both while taking up less space. Further an FPGA solution is almost always faster than performing the function in software on a traditional CPU.

In summary FPGAs are much more flexible than dedicated hardware while generally faster than performing a function in software. They're sort of a middle ground.
Posted on Reply
#7
_Flare
maybe it can handle AVX512 and Intels new AMX stuff, that would be great.
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsics-for-intel-advanced-matrix-extensions-amx-tile-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsic-for-intel-advanced-matrix-extensions-amx-bf16-instructions.html
software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions/intrinsics-for-intel-advanced-matrix-extensions-amx-int8-instructions.html
Posted on Reply
#8
Raevenlord
News Editor
OSdevr
They do yes. An FPGA implementation will always be slower and take more die space than dedicated hardware. However, if there is more than one accelerator present and only one is used at a time then an FPGA implementation can emulate both while taking up less space. Further an FPGA solution is almost always faster than performing the function in software on a traditional CPU.

In summary FPGAs are much more flexible than dedicated hardware while generally faster than performing a function in software. They're sort of a middle ground.
This. Of course, one also has to take into consideration the relation between the amount of die space reserved for the FPGA (more die space means more units means more performance for any given task) and also the AMD intention of having these take advantage of already-existing core resources.

Also, perhaps we could actually see improved performance on tasks performed by fixed-function hardware. I suppose in theory, if one can shave 3x 20mm2 (pulling this out of my proverbial, as an example) fixed function hardware for three specific tasks, and replace those with 60 mm2 FPGA, perhaps those 60 mm2 FPGA resources will be faster at executing one of those tasks than their previous 20mm2 fixed-function hardware?
Posted on Reply
#9
TechLurker
This could be another way AMD keeps up with ARM and RISC V, by making their CPUs flexible enough to use software/hardware intended for ARM/RISC devices while still retaining exclusive (1 of 3? companies, IIRC) x86 legacy support. As opposed to shifting entirely over to ARM (though I can still see AMD using a K12 successor in entering the ARM ecosystem proper; moreso given their RDNA joint effort with Samsung to integrate ARM and RDNA for mobile).
Posted on Reply
#10
Wirko
The idea is exciting but doubts remain.
* Manufacturing tech. For example, the process for making CPUs is significantly different from those for DRAM and NAND - two of those can't be combined on the same die efficiently. So, is the process that produces the best CPU logic also the best for FPGA logic?
* Can a CPU really make full advantage of flexible execution units if all other parts remain fixed, like decode logic, out-of-order logic, etc.? For 32-bit emulation, I believe that programmable decoders would be the key, not programmable EUs.
* Context switching might require reprogramming the FPGA logic every time, or often, depending on the load. How much time does it take? Not very good if it's many microseconds.
* The poor guy that writes highly optimized C/C++ code, will he (or she) have to become a VHDL expert too? (or will that be left up to the other poor guy, the on that maintains the optimizing compiler?)

If all this ever sees the light of day, I imagine AMD will make various FPGA functions available as downloads, for a fee of course. Maybe they won't let just anyone develop or sell new ones.
Posted on Reply
#11
zlobby
I smell a huge field for mischievious endeavors.
Posted on Reply
#12
Aquinus
Resident Wat-man
maxitaxi96
Do FPGAs not have a performance penalty? I always thought it was a trade-off (highly specialized & faster <--> highly generalized & slower)
The advantage to FPGA is to be able to make changes after the device has already been fabricated. The cost really comes down to how the FPGA is programmed and the implementation itself. Being able to make changes after the device has been built is definitely an advantage, particularly if you consider some of these security flaws we've been seeing.
Posted on Reply
#13
DeathtoGnomes
other workloads, then the FPGA elements can just be reconfigured to "turbo" the CPU's own floating point and integer units, increasing available resources.
FPGA: I'm bored
CPU: you could help me
FPGA: eh?
CPU: Lazy bum!
FPGA: Ok fine here, Haz some Red Bull, it'll make you run faster.
CPU: I dont have legs!
FPGA: better I stick a Falcon 9 up your rear?
CPU: Zoom! ZOOM!
Posted on Reply
#14
Mussels
Moderprator
Can someone simplify this for me? in way over my head on CPU designs

is this what it seems like, with AMD making their CPU's hardware functions reprogammable so they can just change the damn architecture and feature set on a whim?
Posted on Reply
#15
InVasMani
So many implications from this. I've long been a fan of FPGA's flexibility prospects. I kind of felt this was coming down the pike for a long long while with CPU instruction sets and chiplets. Bravo to AMD on this if it proves to be efficient and functionally sound. What I really like with this is they have the prospect of utilizing FPGA tech to significantly accelerate instruction sets for example compress/decompression/encryption/decryption algorithms them swap them around in and out as required quickly on the fly maximizing the die space available as opposed to have all fixed into hardware occupying more overall combined space. This is ideal for any instruction sets that won't be utilized a certain % of the time. Instead of a compromised instruction set as well that's less brute force they can hopefully have several brute force ones that they can quickly interchange and utilize to bump up the overall efficiency. Perhaps you need some profiles to load them around dynamically and it takes a moment or two, but once configured is a pronounced speed up that's still a great compromise and if it happens behind the scenes and quickly enough in the first place that's excellent. AI accelerated FPGA's if you will and this is what will lead to better on the fly cognitive neuron like chips.
Mussels
Can someone simplify this for me? in way over my head on CPU designs

is this what it seems like, with AMD making their CPU's hardware functions reprogammable so they can just change the damn architecture and feature set on a whim?
Think of it a bit like a sound DSP chip that can reconfigure itself on the fly to the sound environment to maximize the sound effect and realism of the sound in relation to the 3D surrounding. In essence they could reconfiguration optimizing re-calibrating AI assisted instruction responsive compute algorithms on a chiplet. Depending on how much FPGA tech is on the chiplet and how many instructions were removed along with how quickly they can be reconfigured there is a lot of potential upside in form of die space that could be better allocated. Not simply that either this gives them a great idea of where they might shift things from the direction dedicated instructions they retain on the CPU chiplet's and more legacy ones they can trim or adjust better and supplement with FPGA tech to yield better IPC as a whole within the die space and heat tolerance constraints
maxitaxi96
Do FPGAs not have a performance penalty? I always thought it was a trade-off (highly specialized & faster <--> highly generalized & slower)
I believe that's a bit of a misconception with FPGA's. They aren't as optimal as a ASIC at a given task and designed for it, but they aren't one trick ponies confined to that task indefinitely either and that is a key difference. What AMD's aiming to do here is remove some "fixed instruction sets" of less vital importance thru clever use of FPGA's and interchanging instruction sets in quick fashion "ideally" and re-purposing or transforming the removed instructions set die space however is optimus roll out AMD autobots.
Raevenlord
This. Of course, one also has to take into consideration the relation between the amount of die space reserved for the FPGA (more die space means more units means more performance for any given task) and also the AMD intention of having these take advantage of already-existing core resources.

Also, perhaps we could actually see improved performance on tasks performed by fixed-function hardware. I suppose in theory, if one can shave 3x 20mm2 (pulling this out of my proverbial, as an example) fixed function hardware for three specific tasks, and replace those with 60 mm2 FPGA, perhaps those 60 mm2 FPGA resources will be faster at executing one of those tasks than their previous 20mm2 fixed-function hardware?
Quite a bit like polyphony and timbrality for music and sequencing. Really with a FPGA AMD could adjust many aspects in many ways at any point thru transforming the FPGA programming. Don't need compression/decompression/encryption/decryption at the moment or only a select amount of it dynamically junk and reconfigure it. Don't need certain instruction sets for the task goodbye. Basically this is a bit like precision boost all over again on a whole other level of refinement and IPC efficiency uplift in a round about sense in theory if done efficiently and well. I can see it taking a bit of generation refinement, but much like other tech it should see nice improvements as it's better perfected.
Wirko
The idea is exciting but doubts remain.
* Manufacturing tech. For example, the process for making CPUs is significantly different from those for DRAM and NAND - two of those can't be combined on the same die efficiently. So, is the process that produces the best CPU logic also the best for FPGA logic?
* Can a CPU really make full advantage of flexible execution units if all other parts remain fixed, like decode logic, out-of-order logic, etc.? For 32-bit emulation, I believe that programmable decoders would be the key, not programmable EUs.
* Context switching might require reprogramming the FPGA logic every time, or often, depending on the load. How much time does it take? Not very good if it's many microseconds.
* The poor guy that writes highly optimized C/C++ code, will he (or she) have to become a VHDL expert too? (or will that be left up to the other poor guy, the on that maintains the optimizing compiler?)

If all this ever sees the light of day, I imagine AMD will make various FPGA functions available as downloads, for a fee of course. Maybe they won't let just anyone develop or sell new ones.
The way I see it is AMD could utilize the chiplet's cleverly. Example 4 chiplet design. The first chiplet pure multi-core CPU design, second pure FPGA design, third pure APU/GPU design. As for that fourth chiplet perhaps it's 1/4 of each and infinity fabric between the other three that it controls. Also that fourth chip could effectively be seen as one large monolithic chip in essence. Now think about that prospect suddenly those yields and die defects and laser cutting off some of the bad portions to salvage what they can in a chip isn't as big a issue in the overall chip design.
Posted on Reply
#16
Punkenjoy
Mussels
Can someone simplify this for me? in way over my head on CPU designs

is this what it seems like, with AMD making their CPU's hardware functions reprogammable so they can just change the damn architecture and feature set on a whim?
From what i read in the article, they will only use it for some stuff but the main part of the CPU will remain a traditional CPU. This might help to increase performance in the future but better performance come mostly from the increase in transitors, and not a lot by better optimise one.

Of these 3 part of a CPU, they want to be able to use these programable transitors to replace 2 of them.
The legacy support (to support old x86 apps that use instruction that modern apps no longer use.).
Accelerators like AI, Image processing, video decoding/encoding, Encryption/Decryption, etc.

The core of the CPU will remain normal static transitors.

But the goal would be to use the space saved as they won't require as much space and transitors for many legacy instruction/code or accelerators for something that might provide better performance. Like more core, larger cores, more cache, etc...
Posted on Reply
#17
Xajel
AMD -when asked about AVX512- said they're more interested in a better silicon usage that can do multiple things rather than wasting die space in a specialised workload that only few can take benefit from. The same goes for their RayTracing on GPU's, they said they're leaning toward making a more general purpose cores than can do RT calculations faster rather than having a dedicated silicon only for RT.

I guess this is how AMD is seeing things like AVX512, AI and other stuff, just put an FPGA there. Developers can just program it and do their magic. But I don't know how much it can do over specialised ASICS like how Intel is doing with AVX512, and how the FPGA will work with multiple applications each trying to do its own thing. The patent I saw was like each x86 core has a small FPGA beside it and both share resources (like how x86 core has integer and FP units, now we will have an FPGA unit as well). So each core can have their own FPGA and each core can program it's FPGA to do specific task (or combine more than one core with their FPGA to have more FPGA power).

Maybe int he future, a single x86 core na have multiple FPGA execution units like how they do with integer and FP units. And maybe AMD can differentiate Server and consumer Zen Core dies buy how many FPGA units per core/die as I don't think consumers will need that much in the near future.
Posted on Reply
#18
InVasMani
Programming thing and doing magic is pretty great take reshade for example. Would you look at the god rays on that grim dawn! Fake it til you make it!


Far as the FPGA matter is concerned I feel AMD full well intends to expand FPGA power over time with refinement and also use it to help redesign maximize what's ideal in terms of fixed function instruction sets to keep in place and which could be shifted away from fixed function instructions per core to FPGA silicone die space instead of lesser importance instruction set algorithms and other chip aspects. It could lead to something like a chiplet with 8 cores within it and each potentially could have it's own unique fixed instruction set the saved space on the rest used to FPGA space that's programmable. It could lead to a chip where the first core has some FPGA parts and all instruction sets you'd want and the next core drops a instruction down the line per additional core and replaces that instruction set space for additional FPGA space. That would enable a fair degree of programmable micro adjustments a lot like precision boost with voltages. How they work out which ways Windows Task Scheduler handles it is another matter, but it'll work itself out over time I'm sure.

To touch on what I said a few months back "I think bigLITTLE is something to think about and perhaps some FPGA tech being applied to designs. I wonder if perhaps the MB chipset will be turned into a FPGA or incorporate some of that tech same with CPU/GPU just re-route some new designs and/or re-configure them a bit depending on need they are wonderfully flexible in a great way perfect no, but they'll certainly improve and be even more useful. Unused USB/PCI-E/M.2 slots cool I'll be reusing that for X or Y. I think eventually it could get to that point perhaps hopefully and if it can be and efficiently that cool as hell." That's something I feel is another aspect of FPGA's being integrated and fused with CPU's. The CPU's these days have fixed hardware to handle a lot of stuff even things like direct CPU based PCIE connections. What happens with fixed function hardware is if that hardware isn't being utilized fully it's effectively wasted die space is it not!!?

Now with FPGA's handling some of those things and depending on how much extra space is required to do so you can actually bypass that downfall to fixed function hardware design not utilizing something repurpose the die space for something you're trying to do like sorcery. The traditional chipset could be eliminated in the future entirely replaced by a FPGA potentially or at least more of a twin socket CPU and no chipset with a infinity cache and infinity fabric connection between them both. They could even behave more like the human brain across a motherboard left one handle memory channels on the left side and other on the right side along with peripherals like PCIE lanes with shorter traces by making the PCIE distance between the CPU socket more symmetric. That could be part of the issue with mGPU as well the PCIE traces differ a fair amount because of the slot location nearer and further away in relation to the CPU. That's certainly a area that could be improved in practice.

Another part to touch on few months back I mentioned and feel is quite true. Eventually we need even more fixed function ASIC functions integrated into chip dies or FPGA's because the low hanging fruit on node shrinks is eroding. Where FPGA's come into the equation is die space is limited, but their programmability isn't though the extent it certainly is. That said you can't infinity put new ASIC fixed instructions into a chip die with the laws of physics diminishing the prospects of node shrinks at some stage or another. We either need a cost effective quantum computer break-thru of some sort or FPGA's cleverly being purposed and a delicate balance of critical fixed function instruction sets.
"I still really feel FPGA's could be the best all around solution outside of combining a variety of ASIC's to really specifically maximize and prioritize a handful of the individuals use cases. Eventually this will be one of the few low hanging fruits left to leverage so it has to happen eventually for both Intel and AMD not to mention Nvidia on the GPU side this is how it is going to be moving forward one way or another w/o a break thru on the manufacturing side or quantum computers really taking a foothold."

From like 3 years ago...the tides have turned the future is now!
"I hate to say it because I wish AMD luck in the future as they are a great company, but I strongly feel that Intel's FPGA is a enormous sleeping giant. FPGA's in general have so much potential to me as they can be configured appropriately to specific needs. I'm not sure why we don't have a CPU FPGA swamped with like 8 FPGA's around it that interconnect with it. You'd have tons of surface area for cooling and enormous amounts of reconfigurable power at hand especially if you had that with something like very potent APU at it's center with lots of AI machine learning capability to adapt to a users usage and needs."
Posted on Reply
#19
Mouth of Sauron
Xajel
The same goes for their RayTracing on GPU's, they said they're leaning toward making a more general purpose cores than can do RT calculations faster rather than having a dedicated silicon only for RT.
NVIDIA RT cores are quite near to the dark silicon. RT math is boring, repetitive and highly non-interesting. But it's different from rasterization-math, not that much, but cores are optimized either for rasterization or for *partial* RT (and are likely either idling or doing something very inefficiently when no RT is needed).

In short, I don't like NVIDIA RT solution at all - it's wasteful (and consumers pays for it), it's incomplete and doomed to become obsolete in time.

If part of cores are capable to do a quick change from rasterization to (any) RT or vice-versa, it's a very flexible and elegant solution.

Goes to various other stuff, say anti-aliasing - some games clearly don't need it at all (though they don't usually need high computing power too, but lets think small stuff, like APUs). And a number of others. Not needed - used for something else. Especially in APUs and low-end.

Also - reasonably future-proof. As bloody TFOPs might actually get some meaning... Stuff just work, until there are fundamental changes or compute power just becomes too small.

Will be interesting what will NVIDIA do, because if AMD get Xilinx and Intel already has Altera and it's 85% of the FPGA-world, and since NVIDIA is 'coexisting peacefully' with both...

Perhaps it's worth mentioning that Intel-guys aren't probably just sitting on their collective ass, they are using interposers, have announced big.little design (could it be that it's FPGA-related? this big.little stuff puzzles me greatly since announced), they could be more ready than we think they are...

One last thing - probably just a feeeeeeling, but whole stuff smells like DC spirit, anyhow something we won't see in quite a while (except those who works with datacenters and web-servers and whathaveyou)
Posted on Reply
#20
pjl321
What are all the 'accelerators' on Intel's CPUs at the moment? Maybe apart of the iGPU. Are they ASICs?
The stuff that allows performance jumps like this:
Posted on Reply
#21
Patriot
FPGAs are good at inference, so think bfloat 16 and smaller, int 4-8.

Having a fpga chiplet to handle interference would allow normal calculations at the same time as specialized.
pjl321
What are all the 'accelerators' on Intel's CPUs at the moment? Maybe apart of the iGPU. Are they ASICs?
The stuff that allows performance jumps like this:
Intel has Bfloat 16 support as one of their cpu extensions, VNNI
software.intel.com/content/www/us/en/develop/articles/introduction-to-intel-deep-learning-boost-on-second-generation-intel-xeon-scalable.html
Posted on Reply
#22
r9
On Intel meeting: "How come we didn't come up with this ?" ... complete silence ...
Posted on Reply
Add your own comment
欧美 亚洲 中文 国产 综合,色综合亚洲色综合久久久,日本三级韩国三级欧美三级,无码日本有码中文字幕,婷婷色香五月伊人缴情