The authors of GPUs: a closer look [Fatahalian and Houston, 2008] provides the reader a high-level overview of the inner workings of a graphics processing unit (GPU) outlined as a computing pipeline. This pipeline is an interleaving of program execution on programmable and fixed-function cores designed to execute Single Instruction Multiple Data (SIMD) com- mands in a massively parallel order. The application-programmable stages in the pipeline consist of vertex processing (VP), primitive processing (PP) and fragment processing (FP). Fixed-function stages of the pipeline consist of the vertex generation (VG), primitive genera- tion (PG), fragment generation and pixel operations (PO). The authors also provided points of challenges in GPU designs as of year 2008 and some opportunities of advancements in strategies for more efficient program execution.
Notes on opportunities and challenges
As discussed by the authors, computations in the PO stage is not easily subjected to parallel program execution since due to dependencies between each stream of fragment data which define the state of a single pixel in the screen. This is particularly true for instances in which the image to render involves a lot of overlapping objects which then require independent non-parallelize-able computation of pixel values for each overlapping sample position, i.e. 3D animation. Even more challenging are instances which involve higher frame rate and higher dynamism in the ordering of objects in the scene, e.g. 120fps gaming with high dynamism in virtual camera positions. Scenes with fewer pixel value changes per frame may employ selective refresh of affected pixels only to reduce the computations per frame.
Fixed-function stages implement tasks that are not easily parallelize-able using opti- mized hardware-defined logic. Fixed GPU functions mentioned by the authors were texture filtering, fragment generation, output pixel compositing, data (de)compression and surface visibility determination. Since the hardware logic for these functions are highly specialized and at the same time compute and data access efficient (also in the scale of gigaflops), it is interesting to investigate whether these functions can be mapped to other solutions to non-graphics generation related problems. The article did not provide insights about this curiosity though. If such non-graphics generation related problems can be characterized, then it may be possible to provide an application programming interface for providing input to and accessing output of the fixed-function computing components of a GPU. The function of these components are fixed but the meaning we give to their input and output are purpose specific. An example of such repurposing of a fixed-function computation is that of comput- ing match scores between a pattern and same length substrings of a much longer text us- ing convolution [Fischer and Paterson, 1974, Fredriksson and Grabowski, 2009, Baba, 2010, Alba et al., 2012]. Efficient hardware logic which implements computation of convolution of two arrays of numerical values for the multiplication operation is then also an efficient hardware logic for computing approximate matches of a pattern on a text. The problem of multiplying two numerical values and that of computing for match scores between a pattern and a text thus share a characteristic property used for repurposing the computation of convolution.
Different applications will require shader functions of varying computing and storage re- quirements. This will require efficient strategies for management of resources which must also exhibit flexibility in response to rapid variation in user input. e.g. panning rapidly and frequently in a high frame rate 3D computer game with many objects to render. Of interest for investigation in this challenge area are algorithms for scheduling and alloca- tion of computing tasks to processors. Some results on this interest use software-defined management mechanisms in contrast to hardware-defined logic in current commercial GPUs [Tzeng et al., 2010, Membarth et al., 2012, Wu et al., 2015]. Software-defined approach to task distribution and scheduling introduces flexibility in optimization in the program level which is not available in fixed hardware-based task management logic. A task-donation scheme load-balancing approach implemented using a parallel task queue is proposed in [Tzeng et al., 2010]. Investigated on different workloads, the authors have shown that the scheme provides higher efficiency in load distribution compared to other load-balancing schemes implemented using monolithic task queue. A task-scheduling framework for GPU which considers multiple applications where high priority tasks are allowed to interrupt pre- viously queued lower priority tasks is proposed in [Membarth et al., 2012]. The efficiency of a GPU using the proposed framework is analyzed with respect to applications in med- ical imaging. The software-based approach proposed in [Wu et al., 2015] for optimizing load-balancing in GPU tackles the limitation in the current hardware-defined load-balancing mechanism of GPUs by working on (1) the assignment of data-sharing tasks to threads be- longing to same multiprocessor and (2) the control on the number of threads to activate on a single multiprocessor.
As pointed out in the paper, high predictability of processing of stream data records allows for aggregate data prefetching (batch loading of input data from globally shared memory) for minimized data access count. An example of input data which can be batch-loaded prior to computation are those which can be subjected to parallel reduction, i.e. data applied with single associative arithmetic operation such as addition and subtraction. On the other hand, shader functions’ access of buffer and texture data is not suitable for parallelism though since storage of such data requires dynamic memory addresses. An example of such data are intermediate computation results which will serve as input to succeeding computations. Intermediate results can be stored in cache (if succeeding operations will access it frequently) internal to blocks or in registers if it will only be used once. As mentioned by the authors, access to such storage during computation will require optimized onboard storage resource management.
Shader function invocations on related data streams may use a shared instruction stream to parallelize the execution process. This results to an SIMD type of parallel computation. In addition to what the authors have pointed out, another level of parallelism can be applied to the branching statements of user-programmable stages in the pipeline. Conditional (shared) instructions on a set of related data streams can be serially assigned to each arithmetic logic unit (ALU) of core [Goddeke et al., 2011]. The parallelism is applied to internal-to-branch shared instruction for the related data streams.
Thoughts about the paper
The authors were able to provide a high-level introduction to the computation on a GPU and at the same time introduce areas of challenges in the current design and implementation of the device. Of particular interest from this paper is the opportunities for investigating software-based approaches to introducing optimizations in various aspects of the device’s operation such as load balancing, task scheduling and task assignment which are restricted on current hardware-based resource management.
[Alba et al., 2012] Alba, A., Rodriguez-Kessler, M., Arce-Santana, E. R., and Mendez, M. O. (2012). Approximate string matching using phase correlation. Conference proceedings : …Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, 2012:6309–12.
[Baba, 2010] Baba, K. (2010). String Matching with Mismatches by Real-valued FFT*. In Computational Science and Its Applications – ICCSA 2010, Part IV, pages 273–283.
[Fatahalian and Houston, 2008] Fatahalian, K. and Houston, M. (2008). GPUs: a closer look. Queue, 6(2):18.
[Fischer and Paterson, 1974] Fischer, M. J. and Paterson, M. S. (1974). String Matching and other Products. Technical report, Massachusetts Institute of Technology, Cambridge.
[Fredriksson and Grabowski, 2009] Fredriksson, K. and Grabowski, S. (2009). Fast con- volutions and their applications in approximate string matching. In 20th International Workshop, IWOCA 2009, pages 254–265.
[Goddeke et al., 2011] Goddeke, D., Kurzak, J., and Weiß, J.-P. (2011). Scientific Comput- ing on GPUs : GPU Architecture Overview.
[Membarth et al., 2012] Membarth, R., Lupp, J. H., Hannig, F., Teich, J., K ̈orner, M., and Eckert, W. (2012). Dynamic task-scheduling and resource management for GPU accelera- tors in medical imaging. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7179 LNCS:147–159.
[Tzeng et al., 2010] Tzeng, S., Patney, A., and Owens, J. D. J. (2010). Task management for irregular-parallel workloads on the GPU. Proceedings of the Conference on High Performance Graphics, pages 29–37.
[Wu et al., 2015] Wu, B., Chen, G., Li, D., Shen, X., and Vetter, J. (2015). Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transforma- tions. In Proceedings of the 29th ACM on International Conference on Supercomputing -ICS ’15, pages 119–130, New York, New York, USA. ACM Press.