Pulling off an all nighter

I was able to finalize my tickets for my CS 239 requirements this sem. Tonight I will work on my thesis first then will move on to drafting my code for my parallel programming project in CS 239. This will be a loooong night. I need to put aside some lower priority stuff for the mean time like gaming on weekends, Facebook, Youtube and Twitter but I will be blogging most of my updates on this blog at least until the semester ends.

Advertisement
Posted in Graduate students life | Tagged | Leave a comment

Tickets and more tickets

Now working on my Scrum board for my parallel programming class. I am trying out JetBrain’s free version of YouTrack for task management. As of now, I would say Jira’s task management is more intuitive to use and the user interface is much more user-friendly.

Posted in Graduate students life, Random Stuff | Tagged , , , , , | Leave a comment

Hell weeks ahead

I have a project paper and presentation on parallel programming (using CUDA) due on two weeks from now and I haven’t started my project yet. Add to that I still need to prove the correctness of two algorithms for my thesis due this semester. The following days are going to be a hell of a ride of sleepless nights and lots of cups of coffee. Oh, dear God, help me. Need to brush up my personal Scrum board ASAP. Fight!

Posted in Computer Science, GPU computing, Graduate students life, Random Stuff | Tagged , , | Leave a comment

Testing our Scrum

It’s been a while since my last post. I’ve been busy in the office making up some lost time in managing my team. I am now overseeing a new team of 3 people for a new project for testing a web application. Since this is a pretty small team I decided to use Scrum management in monitoring our tasks and so far things are going well. We are using Jira’s Scrum board for processing our tickets and based on the feedback of the team during standup meetings they feel more engaged and active in accomplishing their tasks. I am attributing it to the fact that the status of each sprint and backlog is transparent to all members of the team. The project will go on until December this year and I am looking forward to a continuous increase in the team’s velocity.

Posted in Work | Leave a comment

A synthesis of the paper entitled “The future of supercomputing” by M. Snir, 2014

Introduction

Marc Snir presented a high-level picture of the current status of technologies for supercomputing, the current challenges threatening the continuation of advancements in CMOS-based components and the possible tracks for coping up with such challenges in his talk outlined in [Snir, 2014]. Among the top challenges in CMOS-based manufacturing of integrated circuits (IC) are limitations in increasing the clock speed of systems implemented using current hardware implementations, lack of proven commercial grade technologies which could readily replace current computing technologies, limitations on manufacturing technologies for producing IC components with ever decreasing dimension which do not necessarily promise considerable increase in performance, and the demand for higher energy supply to power up proposed advancements in CMOS-based technologies. Some possible future tracks for supercomputing technologies mentioned by the author which could pave the way for a new Moore’s observation are new models of computing which are massively task-parallel in nature, esoteric computing device technologies which currently has a non-commodity status, further advancements in the software domain, i.e. algorithms for optimization, data structures, software architectures, task management strategies.

Investigating new computing models for long term solution

Quantum computing

A good balance between the research and development in long term solutions and short term patches is key to securing continuous advancements in solving hard computational problems and a continuous healthy economy which rely on computing devices. The investigation of new computing models is one long term approach to solving the current limitations of Turing model-based computing devices. One unconventional computing model actively investigated is the quantum computing model. The investigation of this model can be attributed to two motivations. The first motivation is as a response to the steady decrease in dimension of IC components. Current manufacturing technologies can now produce ICs with billions of transistors in the scale of a few nanometers [Courtland, 2013, Cavin et al., 2012]. As the dimension of transistors decrease, quantum mechanical effects such as interference between nanoscale components become more highly manifest. This effects could affect the stability of transistors during information processing which could result to variable outputs. Such effects are also investigated when proposing new designs for computing components in the nanoscale level [Smith et al., 2016]. Quantum computing aims to utilize some of these effects, i.e. quantum interference, linear superposition of states, entanglement of quantum states, to gain speedup in solving computational problems [Nielsen and Chuang, 2010]. The second motivation comes from the need for simulating quantum mechanical systems. In [Feynman, 1982] physicist Richard Feynman posed the question of how to simulate quan- tum mechanical systems and in which he also proposed to use a computing device which exhibit quantum mechanical effects. Simulation of quantum mechanical systems on Von- neumann model-based computers will require an exponential amount of time due to the exponential increase in computational paths with respect to the number of quantum bits. In [Feynman, 1985] Feynman presented a proposed schema of a quantum mechanical sim- ulator. Computation on a quantum computer is parallel in nature due to the possibility of quantum mechanical systems to be in a linear superposition of all of its possible states. Quantum operators on the other hand are linear which implies that an operation applied to a quantum mechanical system in a linear superposition state equates to operation on the system’s individual states. Early fundamental quantum algorithms which provided sub-linear to exponential speedup relied heavily on operations on quantum superposition states [Deutsch and Jozsa, 1992, Coppersmith, 1994, Shor, 1994, Grover, 1996]. The mas- sive parallelism in the quantum computing model makes the model’s simulation (at least computation wise) suitable to be implemented as general purpose computation on graphics processing units (GPGPU). Data-wise, efficient simulation of computation in a quantum computer will require massive memory for storing exponential amount of data with low latency as not to cancel the speedup in computation. One caveat in quantum computing is the susceptibility of quantum mechanical systems to the destructive effect of decoherence [Chuang et al., 1995, Zurek, 2002]. A quantum system in a superposition state tends to decohere into a definite classical state in a very short amount time. The decoherence time of a system further decreases as the number of quantum bits in the system increases. This limits the scaling and operation time of a quantum computer. Different technological implementations of a quantum computer also differs in their decoherence time [Hughes et al., 2004].

Biological computing

Another path currently being investigated for possible future alternative computing device technology is biological computing. This model focuses on computing using biological structures such as unicellular organisms, DNA and ATP. Wet lab investigations aim to discover if these structures are truly capable of performing information processing with high energy efficiency and can truly store large amount of information. An E.coli cell has been investigated in comparison to a CMOS-based information processor of same scale (1 micrometer) in terms of energy efficiency and data storage capacity [Cavin et al., 2012]. The E.coli cell utilizes its DNA as long-term memory (data storage), some proteins and RNA molecules for short-term and logic units, receptor proteins for sensing both outside and inside the cell, and ribosomes for outputting computation result and communicating with other cells in the population. The analysis has identified that the logic units of an E.coli cell has count of at least a million. Accesses to its memory (DNA) can also be executed in parallel where data retrieved can be processed also in parallel. An E.coli cell thus support massive parallel computation. It would be interesting to simulate the computation of an E.coli cell as a computer using GPGPU for the parallel computation and the CPU for the control flow. It will be a more complex task to simulate the operation of an E.coli cell as its biological operating system also supports inter-cellular communication simultaneous to its internal operation. Data re- ceived from the population also serve as parameter to its internal operation. Some challenges that comes to mind in implementing a full-scale computer using a population of unicellular organisms such as E.coli cells is the control of each cell for computation and the control of intercellular communication for cooperative computation. It is interesting to investigate if a master-slave architecture can be applied to such unicellular organism population-based computer wherein a singe cell will act as a master for facilitating the computation and dis- tribution of tasks to its slave cells. Slave cells will mainly act as parallel computing devices which could interact with other cells in the population.

One recent study in biocomputing is that of D. Nicolau et. al in [Nicolau et al., 2016]. Their group was able to synthesize a book-sized biological supercomputer which utilizes a network of lithographically constructed channels for encoding combinatorial problems. Computation in this biocomputer is implemented using independent biological agents which can explore the synthesized network in a massively parallel fashion. In the study, they used the following criteria in identifying candidate biological agents: (1) can be produced in large quantity, low cost, (2) self-propelled, (3) operates independently for exploration of the network, (4) with small dimensions for use in high-density networks (large problem instances), and (5) move rapidly for fast computation. They investigated the effectiveness and efficiency of the biocomputer in solving a small instance of the Subset Sum problem (SSP) which is an NP-Complete problem. Results of the study has shown that the SSP can be solved using their biocomputer in polynomial time only. The results of the study is interesting such that technologies for synthesizing the network is currently available, the independent agents can be produced in a cost-effective manner, and that the proposed physical implementation of a biocomputer can solve an NP-Complete problem with massive parallelism. The proposed biological supercomputer may be deemed as an alternative to a quantum computer which is currently plagued with challenges with regards to its scalability and stability in carrying out computation.

Thoughts about the paper

The paper in [Snir, 2014] provides a good outline of the current computing device technologies and the challenges which hampers their advancement in terms of computing, energy and cost efficiency. An interesting point in the paper is the question of whether we really need continuous increase in supercomputing performance. The answer to this question may be influenced by several factors like ever occurring computational challenges which hit the performance wall of current technologies, the continuing search for polynomial-time solutions to exponentially hard computational problems and the ever striving economy of computing devices. It is of my utmost interest to witness the approach of a shift from the current computing models to new computing models with unexpected physical implementations, just like the shift from the mechanical computer to CMOS-based computer.

References

[Cavin et al., 2012] Cavin, R. K., Lugli, P., and Zhirnov, V. V. (2012). Science and Engineer- ing Beyond Moore’s Law. Proceedings of the IEEE, 100(Special Centennial Issue):1720–1749.

[Chuang et al., 1995] Chuang, I. L., Laflamme, R., Shor, P. W., and Zurek, W. H. (1995). Quantum computers Quantum Computers, Factoring, and Decoherence. Science, 270(5242):1633–1635.

[Coppersmith, 1994] Coppersmith, D. (1994). An Approximate Fourier Transform Useful in Quantum Factoring. Technical report, IBM Research Division.

[Courtland, 2013] Courtland, R. (2013). The Status of Moore’s Law: It’s Complicated.

[Deutsch and Jozsa, 1992] Deutsch, D. and Jozsa, R. (1992). Rapid Solution of Problems by Quantum Computation. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 439(1907):553–558.

[Feynman, 1982] Feynman, R. (1982). Simulating Physics with Computers. International Journal of Theoretical Physics, 21(6/7):467–488.

[Feynman, 1985] Feynman, R. (1985). Quantum Mechanical Computers. Foundations of Physics, 16(6):507–531.

[Grover, 1996] Grover, L. K. (1996). A fast quantum mechanical algorithm for database search. In Proceedings, STOC 1996, pages 212–219.

[Hughes et al., 2004] Hughes, R., Doolen, G., Awschalom, D., Caves, C., Chapman, M., Clark, R., Cory, D., DiVincenzo, D., Ekert, A., Hammel, P. C., Kwiat, P., Lloyd, S., Milburn, G., Orlando, T., Steel, D., Vazirani, U., Whaley, K. B., and Wineland, D. (2004). A quantum information science and technology roadmap.

[Nicolau et al., 2016] Nicolau, D. V., Lard, M., Korten, T., van Delft, F. C. M. J. M., Pers- son, M., Bengtsson, E., M ̊ansson, A., Diez, S., Linke, H., and Nicolau, D. V. (2016). Parallel computation with molecular-motor-propelled agents in nanofabricated networks. Proceedings of the National Academy of Sciences of the United States of America, 113(10):2591–6.

[Nielsen and Chuang, 2010] Nielsen, M. A. and Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press, Cambridge, 1st edition.

[Shor, 1994] Shor, P. W. (1994). Polynomial-Time Algorithms for Prime Factorization and Discrete Logarithms on a Quantum Computer. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science, pages 124–134.

[Smith et al., 2016] Smith, L. W., Al-Taie, H., Lesage, A. A. J., Thomas, K. J., Sfigakis, F., See, P., Griffiths, J. P., Farrer, I., Jones, G. A. C., Ritchie, D. A., Kelly, M. J., and Smith, C. G. (2016). Effect of Split Gate Size on the Electrostatic Potential and 0.7 Anomaly within Quantum Wires on a Modulation-Doped <math display=”inline”> <mrow> <mi>GaAs</mi> <mo>/</mo> <mi>AlGaAs</mi> </mrow> </math> Heterostructure. Physical Review Applied, 5(4):044015.

[Snir, 2014] Snir, M. (2014). The Future of Supercomputing. In Proceedings of the 28th ACM international conference on Supercomputing – ICS ’14, volume 25, pages 261–262, New York, New York, USA. ACM Press.

[Zurek, 2002] Zurek, W. H. (2002). Decoherence and the Transition from Quantum to Classical Revisited. Los Alamos Science, (27):1–24.

Posted in Computer Science, GPU computing, Graduate students life | Tagged , , | Leave a comment

A synthesis of the paper entitled “GPUs: a closer look” by K. Fatahalian and M. Houston, 2008

Introduction

The authors of GPUs: a closer look [Fatahalian and Houston, 2008] provides the reader a high-level overview of the inner workings of a graphics processing unit (GPU) outlined as a computing pipeline. This pipeline is an interleaving of program execution on programmable and fixed-function cores designed to execute Single Instruction Multiple Data (SIMD) com- mands in a massively parallel order. The application-programmable stages in the pipeline consist of vertex processing (VP), primitive processing (PP) and fragment processing (FP). Fixed-function stages of the pipeline consist of the vertex generation (VG), primitive genera- tion (PG), fragment generation and pixel operations (PO). The authors also provided points of challenges in GPU designs as of year 2008 and some opportunities of advancements in strategies for more efficient program execution.

Notes on opportunities and challenges

As discussed by the authors, computations in the PO stage is not easily subjected to parallel program execution since due to dependencies between each stream of fragment data which define the state of a single pixel in the screen. This is particularly true for instances in which the image to render involves a lot of overlapping objects which then require independent non-parallelize-able computation of pixel values for each overlapping sample position, i.e. 3D animation. Even more challenging are instances which involve higher frame rate and higher dynamism in the ordering of objects in the scene, e.g. 120fps gaming with high dynamism in virtual camera positions. Scenes with fewer pixel value changes per frame may employ selective refresh of affected pixels only to reduce the computations per frame.

Fixed-function stages implement tasks that are not easily parallelize-able using opti- mized hardware-defined logic. Fixed GPU functions mentioned by the authors were texture filtering, fragment generation, output pixel compositing, data (de)compression and surface visibility determination. Since the hardware logic for these functions are highly specialized and at the same time compute and data access efficient (also in the scale of gigaflops), it is interesting to investigate whether these functions can be mapped to other solutions to non-graphics generation related problems. The article did not provide insights about this curiosity though. If such non-graphics generation related problems can be characterized, then it may be possible to provide an application programming interface for providing input to and accessing output of the fixed-function computing components of a GPU. The function of these components are fixed but the meaning we give to their input and output are purpose specific. An example of such repurposing of a fixed-function computation is that of comput- ing match scores between a pattern and same length substrings of a much longer text us- ing convolution [Fischer and Paterson, 1974, Fredriksson and Grabowski, 2009, Baba, 2010, Alba et al., 2012]. Efficient hardware logic which implements computation of convolution of two arrays of numerical values for the multiplication operation is then also an efficient hardware logic for computing approximate matches of a pattern on a text. The problem of multiplying two numerical values and that of computing for match scores between a pattern and a text thus share a characteristic property used for repurposing the computation of convolution.

Different applications will require shader functions of varying computing and storage re- quirements. This will require efficient strategies for management of resources which must also exhibit flexibility in response to rapid variation in user input. e.g. panning rapidly and frequently in a high frame rate 3D computer game with many objects to render. Of interest for investigation in this challenge area are algorithms for scheduling and alloca- tion of computing tasks to processors. Some results on this interest use software-defined management mechanisms in contrast to hardware-defined logic in current commercial GPUs [Tzeng et al., 2010, Membarth et al., 2012, Wu et al., 2015]. Software-defined approach to task distribution and scheduling introduces flexibility in optimization in the program level which is not available in fixed hardware-based task management logic. A task-donation scheme load-balancing approach implemented using a parallel task queue is proposed in [Tzeng et al., 2010]. Investigated on different workloads, the authors have shown that the scheme provides higher efficiency in load distribution compared to other load-balancing schemes implemented using monolithic task queue. A task-scheduling framework for GPU which considers multiple applications where high priority tasks are allowed to interrupt pre- viously queued lower priority tasks is proposed in [Membarth et al., 2012]. The efficiency of a GPU using the proposed framework is analyzed with respect to applications in med- ical imaging. The software-based approach proposed in [Wu et al., 2015] for optimizing load-balancing in GPU tackles the limitation in the current hardware-defined load-balancing mechanism of GPUs by working on (1) the assignment of data-sharing tasks to threads be- longing to same multiprocessor and (2) the control on the number of threads to activate on a single multiprocessor.

As pointed out in the paper, high predictability of processing of stream data records allows for aggregate data prefetching (batch loading of input data from globally shared memory) for minimized data access count. An example of input data which can be batch-loaded prior to computation are those which can be subjected to parallel reduction, i.e. data applied with single associative arithmetic operation such as addition and subtraction. On the other hand, shader functions’ access of buffer and texture data is not suitable for parallelism though since storage of such data requires dynamic memory addresses. An example of such data are intermediate computation results which will serve as input to succeeding computations. Intermediate results can be stored in cache (if succeeding operations will access it frequently) internal to blocks or in registers if it will only be used once. As mentioned by the authors, access to such storage during computation will require optimized onboard storage resource management.

Shader function invocations on related data streams may use a shared instruction stream to parallelize the execution process. This results to an SIMD type of parallel computation. In addition to what the authors have pointed out, another level of parallelism can be applied to the branching statements of user-programmable stages in the pipeline. Conditional (shared) instructions on a set of related data streams can be serially assigned to each arithmetic logic unit (ALU) of core [Goddeke et al., 2011]. The parallelism is applied to internal-to-branch shared instruction for the related data streams.

Thoughts about the paper

The authors were able to provide a high-level introduction to the computation on a GPU and at the same time introduce areas of challenges in the current design and implementation of the device. Of particular interest from this paper is the opportunities for investigating software-based approaches to introducing optimizations in various aspects of the device’s operation such as load balancing, task scheduling and task assignment which are restricted on current hardware-based resource management.

References

[Alba et al., 2012] Alba, A., Rodriguez-Kessler, M., Arce-Santana, E. R., and Mendez, M. O. (2012). Approximate string matching using phase correlation. Conference proceedings : …Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual Conference, 2012:6309–12.

[Baba, 2010] Baba, K. (2010). String Matching with Mismatches by Real-valued FFT*. In Computational Science and Its Applications – ICCSA 2010, Part IV, pages 273–283.

[Fatahalian and Houston, 2008] Fatahalian, K. and Houston, M. (2008). GPUs: a closer look. Queue, 6(2):18.

[Fischer and Paterson, 1974] Fischer, M. J. and Paterson, M. S. (1974). String Matching and other Products. Technical report, Massachusetts Institute of Technology, Cambridge.

[Fredriksson and Grabowski, 2009] Fredriksson, K. and Grabowski, S. (2009). Fast con- volutions and their applications in approximate string matching. In 20th International Workshop, IWOCA 2009, pages 254–265.

[Goddeke et al., 2011] Goddeke, D., Kurzak, J., and Weiß, J.-P. (2011). Scientific Comput- ing on GPUs : GPU Architecture Overview.

[Membarth et al., 2012] Membarth, R., Lupp, J. H., Hannig, F., Teich, J., K ̈orner, M., and Eckert, W. (2012). Dynamic task-scheduling and resource management for GPU accelera- tors in medical imaging. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7179 LNCS:147–159.

[Tzeng et al., 2010] Tzeng, S., Patney, A., and Owens, J. D. J. (2010). Task management for irregular-parallel workloads on the GPU. Proceedings of the Conference on High Performance Graphics, pages 29–37.

[Wu et al., 2015] Wu, B., Chen, G., Li, D., Shen, X., and Vetter, J. (2015). Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transforma- tions. In Proceedings of the 29th ACM on International Conference on Supercomputing -ICS ’15, pages 119–130, New York, New York, USA. ACM Press.

Posted in Computer Science, GPU computing, Graduate students life | Tagged , , | Leave a comment

Some exciting new developments

Ever since I came to work at my first ever and still current workplace, our team has been more into developing information systems and sometimes enterprise resource planning software (ERP). This involves having working knowledge of the current technologies, methods, frameworks and programming languages which we can use in developing software. Basically, we are more into the development side of R&D (note that our institute is a research and development institute) When I decided to take my graduate studies I became aware of the possibilities of doing research in my workplace. Taking graduate studies has broadened my perspective on what I can impart into our institute when it comes to doing research. It was not that easy to just propose doing research in my division though. My workmates are more inclined to sticking with the vanilla projects of developing software probably due to the fact that their current skill set may be limiting them from exploring research work. I personally can tell that it is not trivial to do research work. It takes a different skill set and mindset and those can only be acquired through focused and guided practice.

Just this year there has been a major shift with the leadership of our institute. The new director comes from the premier university of the country and he is very much involved with doing research in the engineering field. The current leadership is now pushing each division of the institute to incorporate research work into their projects. This implies some changes to the organizational structure of each division to accommodate the instructions of the new director. In our division, one possibility is to create a team which will specifically handle the research work while the other team handles the development work. The output of the research team should  serve as input to the development group. Good thing is that there are already several of us in our division currently taking their graduate studies and being exposed to doing research work. As I have mentioned before in one of our division meetings, taking graduate studies will prepare our division for what is currently being demanded from us by the leadership. Nevertheless, I am much excited about what opportunities will come now that we will be exerting more effort as an institution in doing research and not just developmental work.

Posted in Graduate students life, Work | Tagged , | Leave a comment

Home is whenever I’m with you.

At last! I was able to go home to visit my parents for four days. Yesterday, Monday, was National Heroes Day so I booked my return flight for today, Tuesday. I was able to treat them to a movie which we do very seldom as a family. We were also able to spend some time on a local beach, nothing fancy, just your plain beach with some kubos. Mama just brought some bread, chips and soft drink. The water was not that clear also. Nevertheless, we had fun. My paper review requirement for my course this semester due tomorrow is taking a lot of my time though. I could have spent some more time with them on one of our local parks or visit some of our relatives. Hopefully, when I come home some other time I will have a more requirements-free schedule.

Today and tomorrow will most likely be a rough day for me. I need to write two paper reviews today (The Future of Supercomputing by M. Snir, GPUs: a closer look by K. Fatahalian et. al) for submission tomorrow. Add to that, I need to write some exercise CUDA C code in preparation for our class programming exercise. The faster I will be able to finish my paper reviews, the more time I will have for coding. Hopefully, I will have already finished my review of the first paper by the time our plane arrives in Manila. I am currently writing this blog post mid-flight.  

Posted in Family, Life, Random Stuff | Tagged , , , | Leave a comment

Masters thesis and parallel computing

I am currently taking a course on Parallel Computing this semester under Dr. F. Cabarle. The course will require programming on a graphics processing unit (GPU) using CUDA C by NVIDIA. It has been a while since I have written a program in C since I have been coding in Java ever since I started working at DOST-ASTI. I need to refresh my C programming skills then for this sem, at least the minimum concepts necessary for writing CUDA applications.

We will also be required to submit reviews of all our required readings for the class. The last time I attended a class with paper review requirement was during our Advanced Computer Networks class under Dr. C. Festin. You can browse through my blog and will find some of my short reviews of the assigned readings for my networks class.

Lastly, the course will also require a project wherein the concepts of parallel computing will be applied. I have not yet decided on my topic for the project but I might work on implementing one of the algorithms in my masters thesis using DOST-ASTI’s GPU cluster.

Since I still need to simulate the algorithms in my thesis and do some revisions of my manuscript, I need to budget my time well if I want to do well in this course and at the same time finish my masters requirements this semester.

Posted in Graduate students life | Tagged , , , , , | Leave a comment

Proof-of-Concept Requiring Service Bus Programming

Currently, our team is working on a prospect project which requires service bus programming. I am new to the concept of using a service bus for interoperability but I think it is a good way of decoupling heterogenous systems which require communication. Some key terms I am currently encountering are MuleESB, MarkLogic NoSQL Database Server, REST, SOAP, web service, PKI. We are fortunate to be provided with all the resources we need for the proof-of-concept application we are building.

Hopefully, I will be able to post some updates on service bus programming including issues which I encountered and solutions which I found for them, also the new things which I will be learning during the application development. I am also currently coming up with results for my MS thesis so I’ll be able to defend my proposal and thesis this month.

That is all for now I guess.

Best Regards,
Jeff

Posted in Service Bus Programming | Tagged , | Leave a comment