Pulling off an all nighter

I was able to finalize my tickets for my CS 239 requirements this sem. Tonight I will work on my thesis first then will move on to drafting my code for my parallel programming project in CS 239. This will be a loooong night. I need to put aside some lower priority stuff for the mean time like gaming on weekends, Facebook, Youtube and Twitter but I will be blogging most of my updates on this blog at least until the semester ends.

Posted in Graduate students life | Tagged | Leave a comment

Tickets and more tickets

Now working on my Scrum board for my parallel programming class. I am trying out JetBrain’s free version of YouTrack for task management. As of now, I would say Jira’s task management is more intuitive to use and the user interface is much more user-friendly.

Posted in Graduate students life, Random Stuff | Tagged , , , , , | Leave a comment

Hell weeks ahead

I have a project paper and presentation on parallel programming (using CUDA) due on two weeks from now and I haven’t started my project yet. Add to that I still need to prove the correctness of two algorithms for my thesis due this semester. The following days are going to be a hell of a ride of sleepless nights and lots of cups of coffee. Oh, dear God, help me. Need to brush up my personal Scrum board ASAP. Fight!

Posted in Computer Science, GPU computing, Graduate students life, Random Stuff | Tagged , , | Leave a comment

Testing our Scrum

It’s been a while since my last post. I’ve been busy in the office making up some lost time in managing my team. I am now overseeing a new team of 3 people for a new project for testing a web application. Since this is a pretty small team I decided to use Scrum management in monitoring our tasks and so far things are going well. We are using Jira’s Scrum board for processing our tickets and based on the feedback of the team during standup meetings they feel more engaged and active in accomplishing their tasks. I am attributing it to the fact that the status of each sprint and backlog is transparent to all members of the team. The project will go on until December this year and I am looking forward to a continuous increase in the team’s velocity.

Posted in Work | Leave a comment

A synthesis of the paper entitled “The future of supercomputing” by M. Snir, 2014

Introduction

Marc Snir presented a high-level picture of the current status of technologies for supercomputing, the current challenges threatening the continuation of advancements in CMOS-based components and the possible tracks for coping up with such challenges in his talk outlined in [Snir, 2014]. Among the top challenges in CMOS-based manufacturing of integrated circuits (IC) are limitations in increasing the clock speed of systems implemented using current hardware implementations, lack of proven commercial grade technologies which could readily replace current computing technologies, limitations on manufacturing technologies for producing IC components with ever decreasing dimension which do not necessarily promise considerable increase in performance, and the demand for higher energy supply to power up proposed advancements in CMOS-based technologies. Some possible future tracks for supercomputing technologies mentioned by the author which could pave the way for a new Moore’s observation are new models of computing which are massively task-parallel in nature, esoteric computing device technologies which currently has a non-commodity status, further advancements in the software domain, i.e. algorithms for optimization, data structures, software architectures, task management strategies.

Investigating new computing models for long term solution

Quantum computing

A good balance between the research and development in long term solutions and short term patches is key to securing continuous advancements in solving hard computational problems and a continuous healthy economy which rely on computing devices. The investigation of new computing models is one long term approach to solving the current limitations of Turing model-based computing devices. One unconventional computing model actively investigated is the quantum computing model. The investigation of this model can be attributed to two motivations. The first motivation is as a response to the steady decrease in dimension of IC components. Current manufacturing technologies can now produce ICs with billions of transistors in the scale of a few nanometers [Courtland, 2013, Cavin et al., 2012]. As the dimension of transistors decrease, quantum mechanical effects such as interference between nanoscale components become more highly manifest. This effects could affect the stability of transistors during information processing which could result to variable outputs. Such effects are also investigated when proposing new designs for computing components in the nanoscale level [Smith et al., 2016]. Quantum computing aims to utilize some of these effects, i.e. quantum interference, linear superposition of states, entanglement of quantum states, to gain speedup in solving computational problems [Nielsen and Chuang, 2010]. The second motivation comes from the need for simulating quantum mechanical systems. In [Feynman, 1982] physicist Richard Feynman posed the question of how to simulate quan- tum mechanical systems and in which he also proposed to use a computing device which exhibit quantum mechanical effects. Simulation of quantum mechanical systems on Von- neumann model-based computers will require an exponential amount of time due to the exponential increase in computational paths with respect to the number of quantum bits. In [Feynman, 1985] Feynman presented a proposed schema of a quantum mechanical sim- ulator. Computation on a quantum computer is parallel in nature due to the possibility of quantum mechanical systems to be in a linear superposition of all of its possible states. Quantum operators on the other hand are linear which implies that an operation applied to a quantum mechanical system in a linear superposition state equates to operation on the system’s individual states. Early fundamental quantum algorithms which provided sub-linear to exponential speedup relied heavily on operations on quantum superposition states [Deutsch and Jozsa, 1992, Coppersmith, 1994, Shor, 1994, Grover, 1996]. The mas- sive parallelism in the quantum computing model makes the model’s simulation (at least computation wise) suitable to be implemented as general purpose computation on graphics processing units (GPGPU). Data-wise, efficient simulation of computation in a quantum computer will require massive memory for storing exponential amount of data with low latency as not to cancel the speedup in computation. One caveat in quantum computing is the susceptibility of quantum mechanical systems to the destructive effect of decoherence [Chuang et al., 1995, Zurek, 2002]. A quantum system in a superposition state tends to decohere into a definite classical state in a very short amount time. The decoherence time of a system further decreases as the number of quantum bits in the system increases. This limits the scaling and operation time of a quantum computer. Different technological implementations of a quantum computer also differs in their decoherence time [Hughes et al., 2004].

Biological computing

Another path currently being investigated for possible future alternative computing device technology is biological computing. This model focuses on computing using biological structures such as unicellular organisms, DNA and ATP. Wet lab investigations aim to discover if these structures are truly capable of performing information processing with high energy efficiency and can truly store large amount of information. An E.coli cell has been investigated in comparison to a CMOS-based information processor of same scale (1 micrometer) in terms of energy efficiency and data storage capacity [Cavin et al., 2012]. The E.coli cell utilizes its DNA as long-term memory (data storage), some proteins and RNA molecules for short-term and logic units, receptor proteins for sensing both outside and inside the cell, and ribosomes for outputting computation result and communicating with other cells in the population. The analysis has identified that the logic units of an E.coli cell has count of at least a million. Accesses to its memory (DNA) can also be executed in parallel where data retrieved can be processed also in parallel. An E.coli cell thus support massive parallel computation. It would be interesting to simulate the computation of an E.coli cell as a computer using GPGPU for the parallel computation and the CPU for the control flow. It will be a more complex task to simulate the operation of an E.coli cell as its biological operating system also supports inter-cellular communication simultaneous to its internal operation. Data re- ceived from the population also serve as parameter to its internal operation. Some challenges that comes to mind in implementing a full-scale computer using a population of unicellular organisms such as E.coli cells is the control of each cell for computation and the control of intercellular communication for cooperative computation. It is interesting to investigate if a master-slave architecture can be applied to such unicellular organism population-based computer wherein a singe cell will act as a master for facilitating the computation and dis- tribution of tasks to its slave cells. Slave cells will mainly act as parallel computing devices which could interact with other cells in the population.

One recent study in biocomputing is that of D. Nicolau et. al in [Nicolau et al., 2016]. Their group was able to synthesize a book-sized biological supercomputer which utilizes a network of lithographically constructed channels for encoding combinatorial problems. Computation in this biocomputer is implemented using independent biological agents which can explore the synthesized network in a massively parallel fashion. In the study, they used the following criteria in identifying candidate biological agents: (1) can be produced in large quantity, low cost, (2) self-propelled, (3) operates independently for exploration of the network, (4) with small dimensions for use in high-density networks (large problem instances), and (5) move rapidly for fast computation. They investigated the effectiveness and efficiency of the biocomputer in solving a small instance of the Subset Sum problem (SSP) which is an NP-Complete problem. Results of the study has shown that the SSP can be solved using their biocomputer in polynomial time only. The results of the study is interesting such that technologies for synthesizing the network is currently available, the independent agents can be produced in a cost-effective manner, and that the proposed physical implementation of a biocomputer can solve an NP-Complete problem with massive parallelism. The proposed biological supercomputer may be deemed as an alternative to a quantum computer which is currently plagued with challenges with regards to its scalability and stability in carrying out computation.

Thoughts about the paper

The paper in [Snir, 2014] provides a good outline of the current computing device technologies and the challenges which hampers their advancement in terms of computing, energy and cost efficiency. An interesting point in the paper is the question of whether we really need continuous increase in supercomputing performance. The answer to this question may be influenced by several factors like ever occurring computational challenges which hit the performance wall of current technologies, the continuing search for polynomial-time solutions to exponentially hard computational problems and the ever striving economy of computing devices. It is of my utmost interest to witness the approach of a shift from the current computing models to new computing models with unexpected physical implementations, just like the shift from the mechanical computer to CMOS-based computer.

References

[Cavin et al., 2012] Cavin, R. K., Lugli, P., and Zhirnov, V. V. (2012). Science and Engineer- ing Beyond Moore’s Law. Proceedings of the IEEE, 100(Special Centennial Issue):1720–1749.

[Chuang et al., 1995] Chuang, I. L., Laflamme, R., Shor, P. W., and Zurek, W. H. (1995). Quantum computers Quantum Computers, Factoring, and Decoherence. Science, 270(5242):1633–1635.

[Coppersmith, 1994] Coppersmith, D. (1994). An Approximate Fourier Transform Useful in Quantum Factoring. Technical report, IBM Research Division.

[Courtland, 2013] Courtland, R. (2013). The Status of Moore’s Law: It’s Complicated.

[Deutsch and Jozsa, 1992] Deutsch, D. and Jozsa, R. (1992). Rapid Solution of Problems by Quantum Computation. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 439(1907):553–558.

[Feynman, 1982] Feynman, R. (1982). Simulating Physics with Computers. International Journal of Theoretical Physics, 21(6/7):467–488.

[Feynman, 1985] Feynman, R. (1985). Quantum Mechanical Computers. Foundations of Physics, 16(6):507–531.

[Grover, 1996] Grover, L. K. (1996). A fast quantum mechanical algorithm for database search. In Proceedings, STOC 1996, pages 212–219.

[Hughes et al., 2004] Hughes, R., Doolen, G., Awschalom, D., Caves, C., Chapman, M., Clark, R., Cory, D., DiVincenzo, D., Ekert, A., Hammel, P. C., Kwiat, P., Lloyd, S., Milburn, G., Orlando, T., Steel, D., Vazirani, U., Whaley, K. B., and Wineland, D. (2004). A quantum information science and technology roadmap.

[Nicolau et al., 2016] Nicolau, D. V., Lard, M., Korten, T., van Delft, F. C. M. J. M., Pers- son, M., Bengtsson, E., M ̊ansson, A., Diez, S., Linke, H., and Nicolau, D. V. (2016). Parallel computation with molecular-motor-propelled agents in nanofabricated networks. Proceedings of the National Academy of Sciences of the United States of America, 113(10):2591–6.

[Nielsen and Chuang, 2010] Nielsen, M. A. and Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press, Cambridge, 1st edition.

[Shor, 1994] Shor, P. W. (1994). Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 124–134.

[Smith et al., 2016] Smith, L. W., Al-Taie, H., Lesage, A. A. J., Thomas, K. J., Sfigakis, F., See, P., Griffiths, J. P., Farrer, I., Jones, G. A. C., Ritchie, D. A., Kelly, M. J., and Smith, C. G. (2016). Effect of Split Gate Size on the Electrostatic Potential and 0.7 Anomaly within Quantum Wires on a Modulation-Doped <math display=”inline”> <mrow> <mi>GaAs</mi> <mo>/</mo> <mi>AlGaAs</mi> </mrow> </math> Heterostructure. Physical Review Applied, 5(4):044015.

[Snir, 2014] Snir, M. (2014). The Future of Supercomputing. In Proceedings of the 28th ACM international conference on Supercomputing – ICS ’14, volume 25, pages 261–262, New York, New York, USA. ACM Press.

[Zurek, 2002] Zurek, W. H. (2002). Decoherence and the Transition from Quantum to Classical Revisited. Los Alamos Science, (27):1–24.

Posted in Computer Science, GPU computing, Graduate students life | Tagged , , | Leave a comment

A synthesis of the paper entitled “GPUs: a closer look” by K. Fatahalian and M. Houston, 2008

Introduction

The authors of GPUs: a closer look [Fatahalian and Houston, 2008] provides the reader a high-level overview of the inner workings of a graphics processing unit (GPU) outlined as a computing pipeline. This pipeline is an interleaving of program execution on programmable and fixed-function cores designed to execute Single Instruction Multiple Data (SIMD) com- mands in a massively parallel order. The application-programmable stages in the pipeline consist of vertex processing (VP), primitive processing (PP) and fragment processing (FP). Fixed-function stages of the pipeline consist of the vertex generation (VG), primitive genera- tion (PG), fragment generation and pixel operations (PO). The authors also provided points of challenges in GPU designs as of year 2008 and some opportunities of advancements in strategies for more efficient program execution.

Notes on opportunities and challenges

As discussed by the authors, computations in the PO stage is not easily subjected to parallel program execution since due to dependencies between each stream of fragment data which define the state of a single pixel in the screen. This is particularly true for instances in which the image to render involves a lot of overlapping objects which then require independent non-parallelize-able computation of pixel values for each overlapping sample position, i.e. 3D animation. Even more challenging are instances which involve higher frame rate and higher dynamism in the ordering of objects in the scene, e.g. 120fps gaming with high dynamism in virtual camera positions. Scenes with fewer pixel value changes per frame may employ selective refresh of affected pixels only to reduce the computations per frame.

Fixed-function stages implement tasks that are not easily parallelize-able using opti- mized hardware-defined logic. Fixed GPU functions mentioned by the authors were texture filtering, fragment generation, output pixel compositing, data (de)compression and surface visibility determination. Since the hardware logic for these functions are highly specialized and at the same time compute and data access efficient (also in the scale of gigaflops), it is interesting to investigate whether these functions can be mapped to other solutions to non-graphics generation related problems. The article did not provide insights about this curiosity though. If such non-graphics generation related problems can be characterized, then it may be possible to provide an application programming interface for providing input to and accessing output of the fixed-function computing components of a GPU. The function of these components are fixed but the meaning we give to their input and output are purpose specific. An example of such repurposing of a fixed-function computation is that of comput- ing match scores between a pattern and same length substrings of a much longer text us- ing convolution [Fischer and Paterson, 1974, Fredriksson and Grabowski, 2009, Baba, 2010, Alba et al., 2012]. Efficient hardware logic which implements computation of convolution of two arrays of numerical values for the multiplication operation is then also an efficient hardware logic for computing approximate matches of a pattern on a text. The problem of multiplying two numerical values and that of computing for match scores between a pattern and a text thus share a characteristic property used for repurposing the computation of convolution.

Different applications will require shader functions of varying computing and storage re- quirements. This will require efficient strategies for management of resources which must also exhibit flexibility in response to rapid variation in user input. e.g. panning rapidly and frequently in a high frame rate 3D computer game with many objects to render. Of interest for investigation in this challenge area are algorithms for scheduling and alloca- tion of computing tasks to processors. Some results on this interest use software-defined management mechanisms in contrast to hardware-defined logic in current commercial GPUs [Tzeng et al., 2010, Membarth et al., 2012, Wu et al., 2015]. Software-defined approach to task distribution and scheduling introduces flexibility in optimization in the program level which is not available in fixed hardware-based task management logic. A task-donation scheme load-balancing approach implemented using a parallel task queue is proposed in [Tzeng et al., 2010]. Investigated on different workloads, the authors have shown that the scheme provides higher efficiency in load distribution compared to other load-balancing schemes implemented using monolithic task queue. A task-scheduling framework for GPU which considers multiple applications where high priority tasks are allowed to interrupt pre- viously queued lower priority tasks is proposed in [Membarth et al., 2012]. The efficiency of a GPU using the proposed framework is analyzed with respect to applications in med- ical imaging. The software-based approach proposed in [Wu et al., 2015] for optimizing load-balancing in GPU tackles the limitation in the current hardware-defined load-balancing mechanism of GPUs by working on (1) the assignment of data-sharing tasks to threads be- longing to same multiprocessor and (2) the control on the number of threads to activate on a single multiprocessor.

As pointed out in the paper, high predictability of processing of stream data records allows for aggregate data prefetching (batch loading of input data from globally shared memory) for minimized data access count. An example of input data which can be batch-loaded prior to computation are those which can be subjected to parallel reduction, i.e. data applied with single associative arithmetic operation such as addition and subtraction. On the other hand, shader functions’ access of buffer and texture data is not suitable for parallelism though since storage of such data requires dynamic memory addresses. An example of such data are intermediate computation results which will serve as input to succeeding computations. Intermediate results can be stored in cache (if succeeding operations will access it frequently) internal to blocks or in registers if it will only be used once. As mentioned by the authors, access to such storage during computation will require optimized onboard storage resource management.

Shader function invocations on related data streams may use a shared instruction stream to parallelize the execution process. This results to an SIMD type of parallel computation. In addition to what the authors have pointed out, another level of parallelism can be applied to the branching statements of user-programmable stages in the pipeline. Conditional (shared) instructions on a set of related data streams can be serially assigned to each arithmetic logic unit (ALU) of core [Goddeke et al., 2011]. The parallelism is applied to internal-to-branch shared instruction for the related data streams.

Thoughts about the paper

The authors were able to provide a high-level introduction to the computation on a GPU and at the same time introduce areas of challenges in the current design and implementation of the device. Of particular interest from this paper is the opportunities for investigating software-based approaches to introducing optimizations in various aspects of the device’s operation such as load balancing, task scheduling and task assignment which are restricted on current hardware-based resource management.

References

[Alba et al., 2012] Alba, A., Rodriguez-Kessler, M., Arce-Santana, E. R., and Mendez, M. O. (2012). Approximate string matching using phase correlation. Conference proceedings : …Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, 2012:6309–12.

[Baba, 2010] Baba, K. (2010). String Matching with Mismatches by Real-valued FFT*. In Computational Science and Its Applications – ICCSA 2010, Part IV, pages 273–283.

[Fatahalian and Houston, 2008] Fatahalian, K. and Houston, M. (2008). GPUs: a closer look. Queue, 6(2):18.

[Fischer and Paterson, 1974] Fischer, M. J. and Paterson, M. S. (1974). String Matching and other Products. Technical report, Massachusetts Institute of Technology, Cambridge.

[Fredriksson and Grabowski, 2009] Fredriksson, K. and Grabowski, S. (2009). Fast con- volutions and their applications in approximate string matching. In 20th International Workshop, IWOCA 2009, pages 254–265.

[Goddeke et al., 2011] Goddeke, D., Kurzak, J., and Weiß, J.-P. (2011). Scientific Comput- ing on GPUs : GPU Architecture Overview.

[Membarth et al., 2012] Membarth, R., Lupp, J. H., Hannig, F., Teich, J., K ̈orner, M., and Eckert, W. (2012). Dynamic task-scheduling and resource management for GPU accelera- tors in medical imaging. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7179 LNCS:147–159.

[Tzeng et al., 2010] Tzeng, S., Patney, A., and Owens, J. D. J. (2010). Task management for irregular-parallel workloads on the GPU. Proceedings of the Conference on High Performance Graphics, pages 29–37.

[Wu et al., 2015] Wu, B., Chen, G., Li, D., Shen, X., and Vetter, J. (2015). Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transforma- tions. In Proceedings of the 29th ACM on International Conference on Supercomputing -ICS ’15, pages 119–130, New York, New York, USA. ACM Press.

Posted in Computer Science, GPU computing, Graduate students life | Tagged , , | Leave a comment

Some exciting new developments

Ever since I came to work at my first ever and still current workplace, our team has been more into developing information systems and sometimes enterprise resource planning software (ERP). This involves having working knowledge of the current technologies, methods, frameworks and programming languages which we can use in developing software. Basically, we are more into the development side of R&D (note that our institute is a research and development institute) When I decided to take my graduate studies I became aware of the possibilities of doing research in my workplace. Taking graduate studies has broadened my perspective on what I can impart into our institute when it comes to doing research. It was not that easy to just propose doing research in my division though. My workmates are more inclined to sticking with the vanilla projects of developing software probably due to the fact that their current skill set may be limiting them from exploring research work. I personally can tell that it is not trivial to do research work. It takes a different skill set and mindset and those can only be acquired through focused and guided practice.

Just this year there has been a major shift with the leadership of our institute. The new director comes from the premier university of the country and he is very much involved with doing research in the engineering field. The current leadership is now pushing each division of the institute to incorporate research work into their projects. This implies some changes to the organizational structure of each division to accommodate the instructions of the new director. In our division, one possibility is to create a team which will specifically handle the research work while the other team handles the development work. The output of the research team should  serve as input to the development group. Good thing is that there are already several of us in our division currently taking their graduate studies and being exposed to doing research work. As I have mentioned before in one of our division meetings, taking graduate studies will prepare our division for what is currently being demanded from us by the leadership. Nevertheless, I am much excited about what opportunities will come now that we will be exerting more effort as an institution in doing research and not just developmental work.

Posted in Graduate students life, Work | Tagged , | Leave a comment