With the intent to address the needs of high-performance applications demanding ever higher data rates, GDDR5X is targeting data rates of 10 to 14 Gb/s, a 2X increase over GDDR5. In order to allow a smooth transition from GDDR5, GDDR5X utilizes the same, proven pseudo open drain (POD) signaling as GDDR5. There's not a whole lot to talk about if you already have a 1080/1080 Ti with GDDR5X considering what made GDDR6 faster was already available through GDDR5X. The exception to that would be the GTX 1070 (GDDR5) to the RTX 2070 (GDDR6.) Your looking at a 40% increase in memory bandwidth from the GTX 1070.
- Will A Gddr5x Work In A Gddr5 Slot Machine For Sale
- Will A Gddr5x Work In A Gddr5 Slot Machine Manual
Video RAM: What’s the difference between the types available today?
All graphics cards need both a GPU and VRAM to function properly. While the GPU (Graphics Processing Unit) does the actual processing of data to output images on your monitor, the data it is processing and providing is stored and accessed from the chips of VRAM (Video Random Access Memory) surrounding it.
Outputting high-resolution graphics at a quick rate requires both a beefy GPU and a large quantity of high-bandwidth VRAM working in tandem. For most of the past decade, VRAM design was fairly stagnant, and focused on using more power to achieve greater VRAM clock speeds.
But the power consumption of that process was beginning to impinge on the power needed by newer GPU designs. In addition to possibly bottlenecking GPU improvements, the standard sort of VRAM (which is known as GDDR5) was also determining (and growing) the form factor (i.e. the actual size) of graphics cards.
Chips of GDDR5 VRAM have to be attached directly to the card in a single layer, which means that adding more VRAM involves spreading out horizontally on the graphics card. And moving beyond a tight circle of VRAM around the GPU means increasing the travel distance for the transfer process as well.
With these concerns in mind, new forms of VRAM began to be developed, such as HBM and GDDR5X, which have finally surfaced in the past couple of years. These are explained below, as straightforward as possible.
HBM vs. GDDR5:
If you want the differences between these two varieties of VRAM summed up in two simple sentences, here they are:
GDDR5 (SGRAM Double Data Rate Type 5) has been the industry standard form of VRAM for the better part of a decade, and is capable of achieving high clock speeds at the expense of space on the card and power consumption.
HBM (High Bandwidth Memory) is a new kind of VRAM that uses less power, can be stacked to increase memory while taking up less space on the card, and has a wide bus to allow for higher bandwidth at a lower clock speed.
Here is a per-package (one stack of HBM vs. one chip of GDDR5) comparison:[i]
HBM (1 stack) | GDDR5 (1 chip) | |
Higher Bandwidth | ✓ (~100 GB/s) | – (~28 GB/s) |
Smaller Form Factor | ✓ (Stackable, Integrated) | – (Single-layer) |
Higher Clock Speed | – (~1 Gb/s) | ✓ (~7 Gb/s) |
Lower Voltage | ✓ (~1.3 V) | – (~1.5 Volts) |
Widely Available | – (New, Needs Redesigned Cards) | ✓ (Old, Cards Designed Alongside) |
Less Expensive | – (New, Needs Redesigned Cards) | ✓ (Old, Cards Designed Alongside) |
Again, don’t be fooled by the ✓ that GDDR5 received there for having a higher clock speed; HBM, with its wide bus, still boasts a higher overall bandwidth per Watt (according to AMD, over three times as much bandwidth per Watt). The lower clock speed is related to how HBM attains its energy savings.
A diagram of HBM’s stacked design, by ScotXW
The idea here is that GDDR5, with its narrow channel, keeps being pushed to higher and higher clock speeds in order to achieve the performance that is currently expected out of VRAM. This is very costly from a power perspective. HBM, on the other hand, moves at a lower rate across a wide bus.
With the huge gains in GPU processing power and the increasing consumer appetite for high-resolution gaming (a higher resolution means more visible detail, which means more data, which requires VRAM that is both higher capacity and higher speed), it seemed inevitable that most cards, starting at the top-end and moving down, would be re-designed to feature a version of HBM (such as the already-developed HBM2, or otherwise) in the future. But then, last year, yet another new standard of VRAM came about which called that into question.
GDDR5 vs. GDDR5X:
You may have seen some news in the past year or so regarding a form of VRAM called GDDR5X, and wondered exactly what this might be. For starters, here’s a simple-sentence-summary like the one offered for HBM and GDDR5 above:
GDDR5X (SGRAM Double Data Rate Type 5X) is a new version of GDDR5, which has the same low- and high-speed modes at which GDDR5 operates, but also an additional third tier of even higher speed with reportedly twice the data rate of high-speed GDDR5.
Here is a per-package (one chip of GDDR5X vs. one chip of GDDR5) comparison:[ii]
GDDR5X (1 chip) | GDDR5 (1 chip) | |
Higher Bandwidth | ✓ (~56 GB/s) | – (~28 GB/s) |
Smaller Form Factor | Tie (Single-layer) | Tie (Single-layer) |
Higher Clock Speed | ✓ (~14 Gb/s) | – (~7 Gb/s) |
Lower Voltage | ✓ (~1.35 Volts) | – (~1.5 Volts) |
Widely Available | – (New, Only in High-end Cards) | ✓ (Old, Cards Designed Alongside) |
Less Expensive | – (New, Only in High-end Cards) | ✓ (Old, Cards Designed Alongside) |
So, you might be wondering, if a chip of GDDR5X is still operating at just around 60% of the overall bandwidth of a stack of HBM while not even quite making the same power savings or space savings, then why is it a big deal? Isn’t it still just immediately made obsolete by HBM? Well, the answer is no, for two reasons.
The first thing to notice is that it’s not a perfect comparison. After all, one chip is just one chip, whereas a stack has the advantage of holding multiple chip-equivalents. Just because they take up the same real estate on the card, that doesn’t mean they are the same amount of memory. So, in theory, a GDDR5X array with the same VRAM capacity as some HBM VRAM array would come much closer in overall graphics card VRAM bandwidth (perhaps just over 10% slower than the HBM system, as estimated by Anandtech).
And yes, that’s still lower, but there are further advantages to GDDR5X when you consider the development side of things. HBM being an entirely new form of VRAM means that chip developers will need to redesign their products with new memory controllers. GDDR5X has enough similarities to GDDR5 to make it a much easier and less expensive proposition to implement it. For this reason, even if HBM, HBM2, and other HBM-like solutions win out in the long run, GDDR5X is likely to see a wider roll-out than HBM in the short run (and possibly at a lower cost to the consumer).
Which Graphics Cards Use Which VRAM:
Now that you’ve heard about these exciting new developments in VRAM design, you might be wondering what sort of VRAM lies within your card, or else where you can get your hands on some of this new technology.
Well, for the time being, most of the cards that are available, from the low-end through the mid-range and into the lower high-end (currently including every card from our Minimum tier to our Exceptional tier builds) still feature GDDR5 VRAM. Popular cards in this year’s builds, from the RX 480 to the GTX 1060 to the GTX 1070, all feature this fairly standard variety of high-clock-speed, relatively-space-inefficient, relatively-energy-inefficient VRAM.
NVIDIA’s highest tier of cards, including the GTX 1080 and the Titan X, currently feature GDDR5X. It seems likely (but not guaranteed) that NVIDIA will continue to make use of GDDR5 and GDDR5X in the near future, simply because that is their current trend and the design implementation is less costly.
AMD, meanwhile, has rolled out HBM in some of their high-end cards, including the R9 Fury X and the Pro Duo. Don’t be surprised if you see smaller form factor cards sporting HBM from AMD in the future. Perhaps using HBM and related innovations will be the avenue through which AMD finally breaks free of their reputation for making cards with comparable performance, but worse thermals and power consumption, compared to NVIDIA.
What about GDDR6?
Micron has been teasing yet another new memory technology for over a year now: GDDR6. Their current plan is have GDDR6 on the market in or before 2018 (though their earlier estimates were closer to 2020). And, while info on it is scarce, they are now claiming that it will provide 16 Gb/s per pin (meaning somewhere in the neighborhood of 64 GB/s of overall bandwidth per chip—compared to 56 GB/s per chip of GDDR5X and 100 GB/s per stack of HBM).
Is GDDR6 likely to start showing its face in high-end cards over a year from now? Yes, it is.
It’s a GDDR solution, which means—like GDDR5X—it will be less costly for manufacturers to implement than HBM.
Does that mean you should shelve your planned build until it shows up? Absolutely not.
Three reasons: (1) at the claimed speed of GDDR6, it still has a significantly lower overall bandwidth and likely lower power savings than HBM, let alone HBM2; (2) at the claimed speed of GDDR6, it is less than 15% faster than GDDR5X, which is unlikely to be noticeable to the user; and (3) there is no guarantee that this new standard will be released by Micron on schedule, nor that it will live up to its claimed figures (ancient wisdom you should always heed: benchmarks before buying).
Conclusion:
So, would I say you should pick your card based on its VRAM type? In the current market situation, I would say probably not. Frankly, there just aren’t enough cards out there with HBM or GDDR5X to put together proper apples-to-apples benchmark comparisons. But this information definitely helps to illustrate something that we here at Logical Increments are all about: a well-balanced build is crucial.
Consider: a high amount of VRAM (and VRAM that performs at a high level) is going to be most important in set-ups that run at a high resolution. And if you’re already balancing your build well—by following our guides, for instance—then you are not likely to end up in a situation where you buy a 4K monitor (such as the grandiose Dell Ultrasharp 4K 31.5” LCD Monitor) and pair it with a low-end graphics card (like the respectable yet modest RX 460).
And for those of us who are mid-range builders, don’t despair. As with any new technology in the computer world, what is currently rare and expensive will likely become both commonplace and affordable in the future.
Notes:
[i]This chart features numbers from AMD’s press release infographic concerning HBM and specified by JEDEC’s standard document concerning HBM.
[ii]This chart features numbers specified by JEDEC’s standard document concerning GDDR5X.
Daniel Podgorski
is a contributing writer for Logical Increments. He is also the researcher, writer, and web developer behind The Gemsbok, where you can find articles on books, games, movies, and philosophy.
The Pascal GP104 GPU
The GP104 is based on DX12 compatible architecture called Pascal. Much like in the past designs you will see pre-modelled SMX clusters that hold what is 128 shader processors per cluster. Pascal GPUs are composed of different configurations of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. Each SM is paired with a PolyMorph Engine that handles vertex fetch, tessellation, viewport transformation, vertex attribute setup, and perspective correction. The GP104 PolyMorph Engine also includes a new Simultaneous Multi-Projection units.
There are 20 active (SM) clusters for a fully enabled Pascal GP104 GPU. The GeForce GTX 1070 however is not fully enabled. One out of the four GPCs holding 5 SM clusters has been disabled.
- The GeForce GTX 1070 (GP104-200) has 15 x 128 shader processors makes a total of 1,920 shader processors.
- The GeForce GTX 1080 (GP104-400) has 20 x 128 shader processors makes a total of 2,560 shader processors.
Each SM however has a cluster of 64 shader / stream / cuda processors doubled up. Don't let that confuse you, it is 128 shader units per SM. Each GPC ships with a dedicated raster engine and five SMs. Each SM contains 128 CUDA cores, 256 KB of register file capacity, a 96 KB shared memory unit, 48 KB of total L1 cache storage, and eight texture units. The reference (Founders Edition) 1070 card will be released with a core clock frequency of1.5 GHz with a Boost frequency that can run up to 1.68 GHz (and even higher depending on load and thermals). As far as the memory specs of the GP104 GPU are concerned, these boards will feature a 256-bit memory bus connected to a nice 8 GB of GDDR5 (1070) / GDDR5X (1080) video buffer memory, AKA VRAM AKA framebuffer AKA graphics memory for the graphics card. The GeForce GTX 1000 series are DirectX 12 ready, in our testing we'll address some Async compute tests as well as Pascal now has Enhanced Async compute. The latest revision of DX12 is a Windows 10 feature only, yet will bring in significant optimizations. For your reference here's a quick overview of some past generation high-end GeForce cards.
GeForce GTX | 780 | 780 Ti | 970 | 980 | Titan | Titan X | 1070 | 1080 |
Stream (Shader) Processors | 2,304 | 2,880 | 1,664 | 2,048 | 2,688 | 3,072 | 1,920 | 2,560 |
Core Clock (MHz) | 863 | 875 | 1,050 | 1,126 | 836 | 1,002 | 1,506 | 1,607 |
Boost Clock | 900 | 928 | 1,178 | 1,216 | 876 | 1,076 | 1,683 | 1,733 |
Memory Clock (effective MHz) | 6,000 | 7,000 | 7,000 | 7,000 | 6,000 | 7,000 | 8,000 | 10,000 |
Memory amount | 3,072 | 3,072 | 4,096 | 4,096 | 6,144 | 12,288 | 8,192 | 8,192 |
Memory Interface | 384-bit | 384-bit | 256-bit | 256-bit | 384-bit | 384-bit | 256-bit | 256-bit |
Memory Type | GDDR5 | GDDR5 | GDDR5 | GDDR5 | GDDR5 | GDDR5 | GDDR5 | GDDR5X |
With 8 GB graphics memory available for one GPU, both the GTX 1070 and 1080 are very attractive for both modern and future games no matter what resolution you game at.
Improved Color Compression
Will A Gddr5x Work In A Gddr5 Slot Machine For Sale
You will have noticed the GDDR5X memory on the 1080, it increases bandwidth. So what about the 1070 with GDDR5 then? The 1070 uses GDDR5 memory, but it's the good stuff at 8000 MHz (effective), 8GB and all on a 256-bit wise memory bus. Well you can never have too much bandwidth so Nvidia applied some more tricks, color compression being one of them. The GPU’s compression pipeline has a number of different algorithms that intelligently determine the most efficient way to compress the data. One of the most important algorithms is delta color compression. With delta color compression, the GPU calculates the differences between pixels in a block and stores the block as a set of reference pixels plus the delta values from the reference. If the deltas are small then only a few bits per pixel are needed. If the packed together result of reference values plus delta values is less than half the uncompressed storage size, then delta color compression succeeds and the data is stored at half size (2:1 compression). Pascal GPUs include a significantly enhanced delta color compression capability:
- 2:1 compression has been enhanced to be effective more often
- A new 4:1 delta color compression mode has been added to cover cases where the per pixel deltas are very small and are possible to pack into ¼ of the original storage
- A new 8:1 delta color compression mode combines 4:1 constant color compression of 2x2 pixel blocks with 2:1 compression of the deltas between those blocks
With that additional memory bandwidth combined with new advancements in color compression Nvidia can claim even more bandwidth as Pascal cards now use 4th generation delta color compression thanks to enhanced color compression and enhanced caching techniques. Up-to Maxwell the GPU could handle 2:1 color compression ratios, newly added are 4:1 and 8:1 delta color compression. So on one hand the Raw memory bandwidth increases 1.4x (for the GeForce GTX 1080 with GDDR5X) and then there's a compression benefit of 1.2x for the GeForce GTX 1070 which is a nice step up in this generation technology wise. Overall there is an increase of roughly 1.6x - 1.7x in memory bandwidth thanks to the faster memory and new color compression technologies. More effective bandwidth thanks to enhanced color compression and enhanced caching techniques. The effectiveness of delta color compression depends on the specifics of which pixel ordering is chosen for the delta color calculation. The GPU is able to significantly reduce the number of bytes that have to be fetched from memory per frame.
Pascal Graphics Architecture
Let's place the more important data on the GPU into a chart to get an idea and better overview of changes in terms of architecture like shaders, ROPs and where we are at frequencies wise:
GeForce | GTX 1080 | GTX 1070 | GTX Titan X | GTX 980 Ti | GTX 980 |
---|---|---|---|---|---|
GPU | GP104 | GP104 | GM200 | GM200 | GM204 |
Architecture | Pascal | Pascal | Maxwell | Maxwell | Maxwell |
Transistor count | 7.2 Billion | 7.2 Billion | 8 Billion | 8 Billion | 5.2 Billion |
Fabrication Node | TSMC 16 nm FF | TSMC 16 nm FF | TSMC 28 nm | TSMC 28 nm | TSMC 28 nm |
CUDA Cores | 2,560 | 1,920 | 3,072 | 2,816 | 2,048 |
SMMs / SMXs | 20 | 15 | 24 | 22 | 16 |
ROPs | 64 | 64 | 96 | 96 | 64 |
GPU Clock Core | 1,607 MHz | 1,506 MHz | 1,002 MHz | 1,002 MHz | 1,127 MHz |
GPU Boost clock | 1,733 MHz | 1,683 MHz | 1,076 MHz | 1,076 MHz | 1,216 MHz |
Memory Clock | 1,250 MHz | 2,000 MHz | 1,753 MHz | 1,753 MHz | 1,753 MHz |
Memory Size | 8 GB | 8 GB | 12 GB | 6 GB | 4 GB |
Memory Bus | 256-bit | 256-bit | 384-bit | 384-bit | 256-bit |
Mem Bandwidth | 320 GB/sec | 256 GB/s | 337 GB/s | 337 GB/s | 224 GB/s |
FP Performance | 9 TFLOPS | 6.5 TFLOPS | 7.0 TFLOPS | 6.4 TFLOPS | 4.61 TFLOPS |
Thermal Threshold | 97 Degrees C | 97 Degrees C | 91 Degrees C | 91 Degrees C | 95 Degrees C |
TDP | 180 Watts | 150 Watts | 250 Watts | 250 Watts | 165 Watts |
Launch MSRP | $599/$699 | $379/$449 | $999 | $699 | $549 |
So we talked about the core clocks, specifications and memory partitions. However, to be able to better understand a graphics processor you simply need to break it down into tiny pieces. Let's first look at the raw data that most of you can understand and grasp. This bit will be about the architecture. NVIDIA’s “Pascal” GPU architecture implements a number of architectural enhancements designed to extract even more performance and more power efficiency per watt consumed. Above, in the chart photo, we see the GP104 block diagram that visualizes the architecture, Nvidia started developing the Pascal architecture around 2013/2014 already. Each of the GPCs has 10 SMX/SMM (streaming multi-processors) clusters in total. You'll spot eight 32-bit memory interfaces, bringing in a 256-bit path to the graphics GDDR5 or GDDR5X memory. Tied to each 32-bit memory controller are eight ROP units and 256 KB of L2 cache. The full GP104 chip used in GTX 1080 and 1080 ship with a total of 64 ROPs and 2,048 KB of L2 cache.
A fully enabled GP104 GPU will have (GTX 1080):
- 2,560 CUDA/Shader/Stream processors
- There are 128 CUDA cores (shader processors) per cluster (SM)
- 7.1 Billion Transistors (FinFet at 16 nm)
- 160 Texture units
- 64 ROP units
- 2 MB L2 cache
- 256-bit GDDR5X
A partially disabled GP104 GPU will have (GTX 1070):
- 1,920 CUDA/Shader/Stream processors
- There are 128 CUDA cores (shader processors) per cluster (SM)
- 7.1 Billion Transistors (FinFet at 16 nm)
- 120 Texture units
- 64 ROP units
- 2 MB L2 cache
- 256-bit GDDR5
What about double-precision? It's dumbed down to not interfere with Quadro sales -- double-precision instruction throughput is 1/32 the rate of single-precision instruction throughput. An important thing to focus on is the SM (block of shader processors) clusters (SMX), which have 128 shader processors. One SMX holds 128 single‐precision shader cores, double‐precision units, special function units (SFU), and load/store units. So based on a full 20 SM (2,560 shader proc) core chip the looks are fairly familiar in design. In the pipeline we run into the ROP (Raster Operation) engine and the GP104 has 64 engines for features like pixel blending and AA. The GPU has 64 KB of L1 cache for each SMX plus a special 48 KB texture unit memory that can be utilized as a read-only cache. The GPU’s texture units are a valuable resource for compute programs with a need to sample or filter image data. The texture throughput then, each SMX unit contains 8 texture filtering units.
- GeForce GTX 960 has 8 SMX x 8 Texture units = 64
- GeForce GTX 970 has 13 SMX x 8 Texture units = 104
- GeForce GTX 980 has 16 SMX x 8 Texture units = 128
- GeForce GTX Titan X has 24 SMX x 8 Texture units = 192
- GeForce GTX 1070 has 15 SMX x 8 Texture units = 120
- GeForce GTX 1080 has 20 SMX x 8 Texture units = 160
So there's a total of up-to 20 SMX x 8 TU = 160 texture filtering units available for the silicon itself (if all SMXes are enabled for the SKU).
Asynchronous Compute
Modern gaming workloads are increasingly complex, with multiple independent, or “asynchronous,” workloads that ultimately work together to contribute to the final rendered image. Some examples of asynchronous compute workloads include:
- GPU-based physics and audio processing
- Postprocessing of rendered frames
- Asynchronous timewarp, a technique used in VR to regenerate a final frame based on head position just before display scanout, interrupting the rendering of the next frame to do so
These asynchronous workloads create two new scenarios for the GPU architecture to consider. The first scenario involves overlapping workloads. Certain types of workloads do not fill the GPU completely by themselves. In these cases there is a performance opportunity to run two workloads at the same time, sharing the GPU and running more efficiently — for example a PhysX workload running concurrently with graphics rendering. For overlapping workloads, Pascal introduces support for “dynamic load balancing.” In Maxwell generation GPUs, overlapping workloads were implemented with static partitioning of the GPU into a subset that runs graphics, and a subset that runs compute. This is efficient provided that the balance of work between the two loads roughly matches the partitioning ratio. However, if the compute workload takes longer than the graphics workload, and both need to complete before new work can be done, and the portion of the GPU configured to run graphics will go idle. This can cause reduced performance that may exceed any performance benefit that would have been provided from running the workloads overlapped. Hardware dynamic load balancing addresses this issue by allowing either workload to fill the rest of the machine if idle resources are available.
Time critical workloads are the second important asynchronous compute scenario. For example, an asynchronous timewarp operation must complete before scanout starts or a frame will be dropped. In this scenario, the GPU needs to support very fast and low latency preemption to move the less critical workload off of the GPU so that the more critical workload can run as soon as possible. As a single rendering command from a game engine can potentially contain hundreds of draw calls, with each draw call containing hundreds of triangles, and each triangle containing hundreds of pixels that have to be shaded and rendered. A traditional GPU implementation that implements preemption at a high level in the graphics pipeline would have to complete all of this work before switching tasks, resulting in a potentially very long delay. To address this issue, Pascal is the first GPU architecture to implement Pixel Level Preemption. The graphics units of Pascal have been enhanced to keep track of their intermediate progress on rendering work, so that when preemption is requested, they can stop where they are, save off context information about where to start up again later, and preempt quickly. The illustration below shows a preemption request being executed.
Will A Gddr5x Work In A Gddr5 Slot Machine Manual
In the command pushbuffer, three draw calls have been executed, one is in process and two are waiting. The current draw call has six triangles, three have been processed, one is being rasterized and two are waiting. The triangle being rasterized is about halfway through. When a preemption request is received, the rasterizer, triangle shading and command pushbuffer processor will all stop and save off their current position. Pixels that have already been rasterized will finish pixel shading and then the GPU is ready to take on the new high priority workload. The entire process of switching to a new workload can complete in less than 100 microseconds (μs) after the pixel shading work is finished. Pascal also has enhanced preemption support for compute workloads. Thread Level Preemption for compute operates similarly to Pixel Level Preemption for graphics. Compute workloads are composed of multiple grids of thread blocks, each grid containing many threads. When a preemption request is received, the threads that are currently running on the SMs are completed. Other units save their current position to be ready to pick up where they left off later, and then the GPU is ready to switch tasks. The entire process of switching tasks can complete in less than 100 μs after the currently running threads finish. For gaming workloads, the combination of pixel level graphics preemption and thread level compute preemption gives Pascal the ability to switch workloads extremely quickly with minimal preemption overhead.