Supercomputer Design, Hardware and Software Report

Exclusively available on Available only on IvyPanda® • No AI

Table of Contents

Introduction
Types
Hardware
Software
Performance Measurement
Conclusion
References

Introduction

Supercomputers are computers that can make calculations much faster than those computers we use every day. Also, supercomputers can store and process much more data. They are needed to solve complex mathematical and practical problems, such as physical problems, modeling of complicated systems, such as structures of materials and chemical substances, and simulations in which a lot of factors need to be considered.

The practical benefits of supercomputers’ work include forecasting the weather, analyzing economic situations, and testing things that people will use, such as airplanes: to see how safe they are, complicated simulations are performed by supercomputers. Although most people do not encounter supercomputers in their everyday life, it is important to understand that supercomputers are crucial not only for science but also for making possible things that all people use, such as weather forecasts or the production of fuel. To understand the design of supercomputers, their types should be described, hardware and software should be explored, and the way their performance is measured should be considered.

Types

One of the main criteria by which supercomputers are divided into groups is the way memory is stored in them. Two major ways of storing memory are shared and distributed. Distributed memory means that each processor has its memory; shared memory means that there is one memory space used by all the processors. When distributed memory is used, there is usually a processor and a memory unit, and there is the need to interconnect them so that programs on different processors could interact.

One of the methods to establish this interconnection is a point-to-point circuit that functions as a communications channel with two permanent endpoints similar to a tin can telephone. Another method of establishing interconnections is using hardware acting as a network switch (Hwang, Dongarra, & Fox, 2013). There is also the combined distributed shared memory type, in which each node (i.e. each element of a network) can access and retrieve data from a shared memory but also possesses its own limited memory that is not shared with other nodes.

The advantage of shared memory is that it implies the existence of a unified space from which all the necessary data can be retrieved; however, different processors usually need access to certain kinds of data that they regularly retrieve while rarely accessing other parts of the memory space or never accessing them at all. Therefore, the advantage of distributed memory is higher efficiency because the initial design considers which processors will need which memories, and redundant interconnections are excluded. Also, from the point of view of hardware, machines that will use distributed memory are easier to design and build.

On the other hand, the advantage of shared memory supercomputers is that they allow parallel computing, i.e. a method of solving complicated problems that are based on the assumption that complicated problems can be broken into simpler ones, each of which can be solved concurrently instead of consecutively, i.e. at the same time instead of one by one. Parallel computing is enabled when processors that solve different subproblems have access to one memory in which the original complicated problem’s data is contained.

Based on the above, another way of dividing supercomputers into groups is by their approach to problem-solving: some supercomputers solve problems by doing calculations for each one serially, one at a time, and other supercomputers run a larger number of operations solving subproblems in a parallel manner. Which approach is used depends on the memory type of a supercomputer.

Hardware

What is referred to as hardware in computer science is computers’ physical elements, i.e. machines and their parts, and one of the main considerations in the history of creating supercomputers was how to develop different components of hardware and how to arrange them (Patterson & Hennessy, 2017). The creators of supercomputers focused on developing central processing units (CPUs) that process data much faster than CPUs in personal computers—the speed of processing is one of the main things that differentiate supercomputers from regular ones.

However, it was understood at some point that even if CPUs are fast, the same amount of time would be needed for supplying data to CPUs (or even more time would be needed because there would be more data to process), so the parts of hardware responsible for this supply needed to be improved, too. For this, additional simple computers were connected into one network with the only purpose to receive data from the memory and send it back there so that the CPU focuses on processing only.

For a long time, the general direction in the development of supercomputers was combining them into larger structures capable of solving problems and performing calculations faster and more effectively than each of them separately. However, the direction eventually changed to improving the quality of design. Explaining this, the so-called father of supercomputing Seymour Cray, the inventor of the fastest computers of his time, once famously said, “If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?” (Machanick, 2015, p. 113). Indeed, the massively parallel computers of the new generation are networks of connected but parallel processing units whose calculation capacities are combined for more efficient processing of data.

One of the main hardware-related issues for supercomputers is heat management. Supercomputers and facilities that maintain them consume large amounts of electricity that are many times more than the amounts consumed by households. The use of energy inevitably produces heat, and it is a threat to the hardware. Moreover, heat is produced unevenly, i.e. some parts of supercomputers become hotter during their functioning than other parts, and they can damage their own inner structures or pieces of hardware near them. Therefore, complicated cooling systems need to be designed.

The first consideration is that components of supercomputers should be made, if possible, of materials that are not likely to become extremely hot because of the energy that supercomputers consume, i.e. the initial heat management begins at the stage of designing hardware and selecting materials for it. Further, a cooling system needs to be applied. In early supercomputers, liquid-based cooling was used; however, modern supercomputers rely more on air conditioning and combined systems.

Software

Those aspects of supercomputers that are associated with the software, i.e. computer programs, changed within recent decades. The main consideration in the work of supercomputers was initially their speed; a supercomputer was one that could perform calculations much faster than regular computers, which is why the general trend of the early development of supercomputers was creating operating systems and other software individually for each type of supercomputers so that they match the intentions expressed in the architecture.

However, with the technological development, it became evident that more unified approaches should be used, such as adopting ready-made software. Moreover, the advantage of this unification is that software used in supercomputers is open-source, i.e. more flexible and able of being adjusted to particular programming needs (Hwang et al., 2013), unlike software that is used in mass-produced regular computers because of the needs of supercomputers are rather more specific, diverse, and advanced.

A specific software-related characteristic of supercomputers is that different operating systems can be used for different parts. Parallel computing requires a certain type of distribution of functions: while some nodes perform calculations, other nodes provide additional services, such as data supply (see Types). Therefore, since the functions are different, the use of different computer programs on such elements as compute nodes and server nodes is justified.

Also, different kinds of existing supercomputers have dramatically different architectures, which is why, although ready-made operating systems are used in most of them, extensive adjustment is needed for each kind of supercomputer. According to Machanick (2015), a particular software issue for supercomputers is that, in massively parallel computing, programs should not only process data but also plan the extensive exchange of data between different nodes.

The parallel architecture also requires open-based software solutions that allow coordinating the data exchange. One of the ways of approaching this coordination is by connecting several regular computers into networks that process connections in a distributed memory supercomputer system. For shared-memory supercomputer systems, application programming interfaces (APIs) can be used that can be run on computer clusters and separate nodes to ensure that data is properly retrieved from the central memory unit and further supplied to appropriate units in which it is processed. The goal is to prevent CPUs from wasting time waiting for data to be supplied and to let them perform calculations at their full capacity.

Since massively parallel supercomputers are complicated systems with many complexly connected and coordinated elements, the risk of faults in them is higher than in regular computers. When something goes wrong in the work of a supercomputer, it may be hard to find what the reason is that there are so many components involved in processing a task. Therefore, debugging, i.e. finding and correcting errors, is challenging for parallel structures, and the software that supercomputers use should be built in a way that allows monitoring every step of a particular process so that, once an error occurs, it can be detected, and the element can be found in which the error occurred. For this, there is a separate area of software testing for supercomputers that can be referred to as high-performance computing testing.

Performance Measurement

In comparison with the performance of regular computers, the performance of supercomputers is measured differently. For regular computers, the performance (particularly, the speed of processors’ work) is measured in instructions per second (IPS); for supercomputers, the unit of measurement is floating-point operations per second (FLOPS); such operations refer to solving problems through calculations that involve very large or very small numbers compared to those we usually use or even those regular computers use. Such numbers are used in complicated probability calculations, such as weather forecasting. However, how many FLOPS a supercomputer has does not fully show how powerful it is.

Another consideration is measuring a supercomputer’s capability. The capability is understood as the ability to solve one complicated problem using all the computing resources of a supercomputer; the time needed for this is considered, too. The capability, in contrast, is how much computing a supercomputer can generally do, i.e. how many problems of a certain complexity it can solve within a given period (Patterson & Hennessy, 2017).

It means that, despite having high capacity, some supercomputers may lack the capability, i.e. be unable to solve one complicated program despite being able to solve several regular ones (regular for supercomputers, but possibly still too complicated for regular computers). Also, it should not be thought that the performance of a supercomputer can be measured by summing up the performances of all of its parts; when working together, different units, nodes, clusters, and components responsible for memory, processing, and interconnection produce better results than the sum of their results would be if they worked separately.

Conclusion

Supercomputers’ types, hardware, software, and performance measurement have been explored. There are different types of supercomputers depending on how they store memory: in each processor or in a centralized memory unit. A major aspect of supercomputers’ work is massively parallel computing, i.e. breaking complicated problems into smaller subproblems and calculating solutions for each subproblem separately and at the same time.

Hardware used in supercomputers is designed to make it possible, and the architecture features a network with many nodes responsible for different functions. Software is developed so that the fast processing of data is not hindered by data supply. The performance is measured through the number of complicated problems a supercomputer can solve within a second and through the complexity of problems, too. Although there are many different types of supercomputers, and they are designed very complexly, these general principles apply to all of them.

References

Hwang, K., Dongarra, J., & Fox, G. C. (2013). Distributed and cloud computing: From parallel processing to the internet of things. Waltham, MA: Elsevier.

Machanick, P. (2015). How general-purpose can a GPU be? South African Computer Journal, 57(1), 113-117.

Patterson, D. A., & Hennessy, J. L. (2017). Computer organization and design: The Hardware/software interface. Cambridge, MA: Elsevier.