oZone3D.Net Tutorials - 3D Pipeline Of SM3/DX9 GPUs - ATI Radeon R580 X1900

3D Pipeline Of SM3/DX9 GPUs

By Jerome 'JeGX' Guinot - jegx[NO-SPAM-THANKS]ozone3d.net

Initial draft: November 24, 2006

[ Index ]

4 - Fragments Processing Unit

And now let's examine the rasterisation module. This module is made up of 2 principal parts:

the Fragment Processing Unit
the Raster Operation Unit

The Fragment Processing Unit is by far the most popular of the 3D pipeline components. This is normal as the majority of the current graphic effects depend on its capabilities. The following denominations are equivalent:

Fragment Processing Unit
Pixel Shader Engine
Pixel Shading Unit
Pixel Shader Processor
Fragment Shading Unit
Fragment Shader Core
Pixel Shader Core

The Fragment Processing Unit Unit also works according to the SIMD principle and has therefore, like the Vertex Processing Unit, a certain number of pipelines called Fragment Pipelines. The following denominations are equivalent:

Fragment Pipeline
Pixel Pipeline
Fragment Engine
Pixel Engine
Pixel Processor
Pixel Pipe

Each one of these Fragment Pipelines executes a program called Pixel Shader. The Pixel Shader is also called a Fragment Program.

The pixel pipelines are gathered per blocks of 4 called Quad: a quad thus can process 4 fragments at the same time. A quad is the smallest element processed by the Pixel Shading Unit.

Before continuing, let us specify a little more the concept of a fragment and a pixel. The pixel being processed in a Fragment Pipeline is called a fragment. On exiting the Pixel Engine and thus at the entry of the Raster Operation Unit, this fragment becomes a pixel candidate. Once exiting the Raster Operation Unit, this pixel candidate becomes the final pixel, that which will then be written in the framebuffer, which can then be seen on the screen.

The new graphic effects (software shadows, hdr, depth of field) amply use the Fragment Processing Unit. Hence, the more the Fragment Processing Unit has quads, the more it will be powerful. As an example, here are values for some current graphics controllers:

nVidia Geforce 7950 GX2: 2*6 quads is 48 pixel pipelines
nVidia Geforce 7900 GTX: 6 quads is 24 pixel pipelines
ATI Radeon 1950 XTX: 12 quads is 48 pixel pipelines
ATI Radeon 1900 XTX: 12 quads is 48 pixel pipelines

To assist in clearing up any confusion: the R580 GPU (Radeon X1900) seems more powerful than the G71 GPU (Geforce 7900) because it has twice more pixel pipelines. In fact, this is relative, because the depth of a pixel pipeline of R580 is smaller (the pixel pipeline has less stages - simple pipeline) than that of G71 (the pixel pipeline has more stages - complex pipeline). The more one pipeline has stages, the more the number of pixel being processed is significant. In brief, while a pixel pipeline of R580 is able to process one pixel, a pixel pipeline of G71 is able to process 2 of them.

But then why such architectural difference?. And what is the advantage of one type of pipeline compared to that of the other?

These two types of pixel pipeline exists today likely for historic reasons. GPU manufacturing is extremely expensive and the founders try to the maximum re-use of what already exists (recycling). From the very start, the pixel pipelines of the nVidia cards were more complex than those of the current ATI, which explains the actual situation.

The advantage or rather the difference in operation of one type compared to that of another is primarily at the processing level of the dynamic branching, which is a significant functionality not to say a fundamental one of the Shader Model 3.0.

The dynamic branching is the somewhat barbarian name given to the conditional tests - as the if that the majority of the programming languages supports. The major problem of these branchings is that the result of a branching (or conditional jump) forces to flush all pixel pipelines if the new codepath (or sequence of instructions) is different from the codepath already used. It is precisely on this level that a complex pipeline is penalizing: indeed, the instead of a complex pipeline is more time consuming than that of a simple pipeline. But to also take into consideration, as long as the running codepath does not change, the complex pipeline is more powerful than a simple pipeline. Conclusion: all depends on the type of pixel shader and the presence, or lack, of dynamic connections.

A term which is often encountered is that of Execution Thread or simply Thread. A Thread defines the number of fragments which are processed at the same time and which will follow the same codepath. This point is significant as in the case of a dynamic branching, it may occur that a part of the fragments of a thread is not concerned with the new codepath. In this case, these fragments will follow this codepath, the instructions simply not being executed. Therefore it is easily understood that the more significant the number of fragments by thread, and the more there will be inertia in the management of the dynamic branchings! The GPU R500+ of ATI with its thread of 16 pixels makes it possible to manage the dynamic branchings with a great effectiveness. As for nVidia GPUs, they work with much larger threads, about 1024 pixels - a number which is relatively penalizing. But this significant number of fragments by thread for the nVidia GPU is due to the complex pixel pipelines.

As an example of dynamic branching, please refer to the following OpenGL benchmark: Soft Shadows Benchmark.

Soft Shadows Benchmark - GLSL Soft Shadows Benchmark

The code above is extracted from the pixel shader ps_7x7_bluring_kernel_v3b_tex.glsl used by the benchmark (in the data/ directory):

void main (void)
{
	//...
	//... See the original code
	//...

	
	// Dynamic Branching
	if( (shadowColor-1.0) * shadowColor * lambertTerm != 0.0 )
	{
	
		//...
		//... See the original code
		//...
	}
		
	//...
	//... See the original code
	//...

	gl_FragColor = final_color;			
}

The soft shadows are very GPU power consuming at the Pixel Shading Unit level. The principle of the soft shadows is rather simple: the scene is rendered (from the light point of view) in a depth map (also called shadow map), then on the level of the pixel shader, a shadow map filtering is performed in order to soften the edges of the shadows. Filtering consists of significant reading of the shadow map. This significant number of readings is very GPU cycles consuming task and without dynamic branching, it is executed on all shadow maps. Due to the dynamic branching, one can easily locate the edge of the shadow in the shadow map and thus perform filtering on the edge only. Theoretically that should dramatically increase the performances.

Here are some values to set the ideas:

Geforce 7950 GX2 / Forceware 91.47:

Without Dynamic Branching:

2305 o3Marks

With Dynamic Branching:

2125 o3Marks

Radeon X1950XT / Catalyst 6.9:

Without Dynamic Branching:

1805 o3Marks

With Dynamic Branching:

3634 o3Marks

Radeon X1600XT / Catalyst 6.9:

Without Dynamic Branching:

450 o3Marks

With Dynamic Branching:

1183 o3Marks

We can observe that performances are doubled on Radeon X1600XT and X1950XTX and remain identical on GeForce 7950 GX2. These results confirm a facet of the preceding theory: GPUs Geforce with their complex pixel pipelines and threads with many fragments are less effective in the dynamic branchings than Radeon with their short pipelines and their light threads. Moreover, ATI has particularly optimized the dynamic branchs processing with its technology: the Ultra Threading. For more details on this technology, please consult the references in the footnotes. Radeon X1950XTX is a real killer at the level of the dynamic branchs!

The following demo, resulting from the SDK of nVidia, shows the soft shadows in action with an algorithm similar to the preceding benchmark but whose shaders are coded in assembler. The results corroborate the scores of the benchmark:

nVidia Soft Shadows Demo - With Dynamic Branching

With dynamic branchs: 30 fps

nVidia Soft Shadows Demo - Without Dynamic Branching

Without dynamic branchs: 30 fps

The odd thing is that this demo was to make it possible for nVidia to highlight the interest of the dynamic branchs...

Another characteristic of Shader Model 3.0 is the possibility to access the texture units from the vertex processor. This quite practical functionality makes it possible to transform the geometry of a 3d object according to the contents of a texture. The possibility of drawing the contents of a texture from the vertex shader is called Vertex Texture Fetching or VTF. For more detail please do refer to the following tutorial: Vertex Displacement Mapping. But there is a hitch. Currently, only nVidia GPUs (Geforce 6 and 7) are able to make VTF. For reasons of hardware design, the ATI Radeon GPUs cannot make a VTF that it is in Direct3D or OpenGL. It is hoped that they will eventually be able to!

With the next generation of GPUs (SM4/DX10) that will be able to dynamically create / destroy vertices inside the vertex processing unit, Vertex Displacement Mapping will become really interesting.

To summarize the innovations brought by SM 3.0, please see the following page: Shader Model 3.0 - No Limits. On the level of the pixel processor, the last models of GPUs by ATI and nVidia are compatible with SM 3.0.

[ Index ]

�Next Page