MULTI-AGENT PARALLEL IMPLEMENTATION OF PHOTOMASK SIMULATION IN PHOTOLITHOGRAPHY

A framework for paralleling aerial image simulation in photolithography is proposed. Initial data for the simulation representing photomask are considered as a data stream that is processed by a multi-agent computing system. A parallel image processing is based on a graph model of a parallel algorithm. The algorithm is constructed from individual computing operations in a special visual editor. Then the visual representation is converted into XML, which is interpreted by the multi-agent system based on MPI. The system performs run-time dynamic optimization of calculations using an algorithm of virtual associative network. The proposed framework gives a possibility to design and analyze parallel algorithms and to adapt them to architecture of the computing cluster.


INTRODUCTION
A photolithography process is a major step in conversion of integrated circuit (IC) layout pattern on a surface of a semiconductor wafer. A simulation of the lithography process is very complicated. It can be roughly simulated in two steps: 1) mask shapes are projected into a photoresist as an aerial image; 2) a distribution of an absorbed intensity of a emission in the photoresist are calculated and patterned based on the aerial image intensity.
The image projection is simulated by the Hopkins equation [1], which is a four-dimensional integral. This calculation is too slow to simulate across the full chip and parallel algorithms can be used to speed up the simulation [2,3]. Currently, the known number of photolithography simulation software systems is implemented both on clusters of multiprocessor personal computers and workstations [4][5][6]7] and supercomputers [9]. Hardwareaccelerated computational lithography tools are also built [8]. The need of solving additional tasks for planning and optimizing the structure of a parallel application in the design process prevents their widespread use.
In many cases, a development of the parallel applications is carried out based on existing sequential algorithms and their composition. Computational operations as parts of the parallel algorithm often have a universal character and can be applied to various problems of information processing. Implementations of the operations in the portable forms allow going to component design, when the program is constructed from large blocks.
In addition, a process of a parallel program execution requires modern methods of planning and optimization of load characteristics of computational nodes. In many cases, the nodes of parallel systems have heterogeneous characteristics, both spatial and temporal. For an effective implementation of parallel programs such systems should provide tools for organizing and monitoring parallel computing and its dynamic reconfiguration.
There are a number of machine vision systems that use parallel and distributed processing [10][11][12][13]. Different technologies such as CORBA [13] or a multi-agent approach [10] are used as a architectural core of these systems.
We propose to use for designing and organizing parallel computations an integrated set of tools (framework) that includes a visual editor, a compiler, an optimization system and a parallel computing support system based on MPI [14]. Having framework for design, analysis and planning computing@computingonline.net www.computingonline.net ISSN 1727-6209 International Journal of Computing of parallel applications and built-in specific mechanisms for the implementation of parallelism, IC designer can significantly accelerate the processing flow of photomask and layout images in the operational analysis of IC. MPI makes our system is widely applicable to various parallel computers. The main features of the proposed framework are the following: 1) visual representation of a parallel application based on graph model of a computation. A concept of computational grains is used. The grains are independent modules developed in different programming languages (C + +, Java, MPI) and have a specific interface for integration into a parallel algorithm; 2) a support of a portability, implemented as libraries. The computing grains are added to the system algorithms at a run time; 3) static and dynamic optimization of the parallel applications using an algorithm of virtual associative network (VAN). This algorithm is a kind of hybrid genetic algorithm (GA) and provides a quick search for a solution close to optimal. It has two modifications: the first one is used in the analysis phase of the design (the static optimization), the second one -on the stage of program execution (the dynamic optimization); 4) assignment of operations of the parallel application on the computational nodes (processors) of a computer system, taking into account information obtained during the optimization.
The paper describes the implementation of parallel simulating of image formation in photoresist wafer during photolithography. Init data for processing are the images of the original topology of VLSI photomasks. Results of simulating give a possibility to do a subsequent automatic mask inspection and determine a significance of photolithography defects. The defects of the topology are significant if they lead to the formation of defects on the wafer that should be corrected at the stage of the production.
The task of simulation of an aerial image is calculation the light intensity distribution of the wafer surface and to obtain a latent image with regard to characteristics of the optical system and lighting conditions. Section 2 represents an algorithm that simulates the aerial image on the photoresist. In Section 3 we consider the problem of parallel processing and describe a graph model representation of the parallel algorithm based on concept of computational grains. Section 4 describes the basic elements of the computational platform and their interaction in the development of parallel applications. Section 5 represents an example of an implementation of the simulation algorithm and some experimental results of the optimization process.

ALGORITHM FOR SIMULATING AERIAL IMAGE ON THE PHOTORESIST
Algorithm for simulating the image on the photoresist surface is composed of the following steps [15,16]: -calculation of a pupillary function; -calculation of a vector amplitude of the object; -calculation of a transfer matrix of the projection lens -calculation of two-dimensional distribution of intensity in a given position of the plane -calculation of two-dimensional distribution of intensity in different positions of the plane; -calculation of image intensity in semi-coherent light; -calculation of the volume distribution of intensity.
The influence of the vector properties of light is taken into account by the so-called vector factors (multipliers) applied to the pupil function. Using the factors allowed describing the influence of the vector nature of electromagnetic waves on the image of thin periodic structures, whose dimensions are within the resolution of optical systems, significantly reducing the computation time of the aerial image.
To describe the effect of an anterior aperture of the optical system it is necessary to use the factor that accounts the diffraction of a plane linearly polarized wave at the input of the optical system. In this case, we consider the effect of the optical system to redistribute the energy in the spectrum regardless of the direction of polarization and the direction of wave propagation. Another factor takes into account the influence of the entrance aperture at the entrance of the optical system. Based on these data, we can simulate the effect of an influence of the numerical entrance aperture to the distribution of any Fourier component exactly as a vector field (a vector nature of electromagnetic waves is considered at the inlet and outlet of the optical system). As a result, it is possible to calculate the vector field of the image both in coherent and partially coherent light.
Using the factors allows describing the influence of a linearly polarized wave to the image quality. For the case non-polarized or partially polarized light it is necessary to use both electric and magnetic vectors which are implemented in the proposed algorithm.

Fig. 1 -An algorithm for constructing the aerial image
The main features of the algorithm are the following.
1) Integrating the vector nature of the light field based on the wording of the electric and the magnetic vector of the amplitudes as functions of the three space Cartesian coordinates, as well as two coordinates in the pupil of the optical system. This formulation provides a correct account of the aberrations of the optical system and an influence of high numerical aperture to an image formation without a significant complication of the mathematical apparatus. In contrast to the currently used models, the proposed model is based on a strict conformity physical nature of the processes and much simpler, it is favorable for constructing fast algorithms.
2) Using the partial coherence theory the image intensity is calculated the most economical way based on a system of Eigen functions. Eigen functions are simulated by Zernike polynomials and are used to describe the mutual intensity of different pixels of the image. Such technique significantly reduces the number of integrals over the light source, a calculation of which remains to be the most time-consuming step in the simulation after applying the features mentioned above.
These principles supplement each other in developing the most efficient algorithm for calculating the intensity distribution of the aerial image and do not individually represent a value what they find together. The implementation of the algorithm in the form of single-process applications has shown the following: if the data level is large, then the processing time is unacceptably large (over 1 min / frame). Since all of the layout images are processed by a common program and may be called by a defect detection system at the same time, it is expedient to use a parallel computer system for the simultaneous processing of input data stream.

A PARALLEL APPLICATION MODEL AND A COMPUTATIONAL GRAIN CONCEPT
The basic principles of creation a graph-oriented parallel program representation are defined in [17]. The scenarios for data processing are represented in the form of Directed Acyclic Graph (DAG). DAG is represented as a tuple V is a set of graph nodes, that represents decomposition of a parallel dataflow processing program on the separated operations ; E is a set of graph edges, that represents a precedence relation between operations in the scenario and determines a data transfer between these nodes, the communication volume between two data processing operations, which is transferred by edge E e j i ∈ , . We consider those operations, which are related and connected by the edge, use an identical data format for a predecessor output and a successor input. For all the scenarios, particular edges have an equal cost. The development of a dataflow processing application includes the following three stages: 1) creation of a part of DAG scenario that describes logical structure of application; 2) assignment and editing of operations parameters of DAG scenario for each data type; 3) mapping of DAG scenario to cluster architecture.
Each computational operation in DAG scenario is realized as a separate unit called a grain. The grain uses specific interface for integration into framework and data exchange. A design of the grain makes possible a rapid adaptation of existing processing algorithms into a parallel application. These algorithms are transformed to objects that are capable to form their own calling context on the base of received parameters. Each operation interprets its parameter string by convenient way and converts the parameters to a variable name or to a constant value. The order and rules of a parameter transform are determined by an operation specification.
All operations work with a specific data storage mechanism that is incorporated into framework architecture. The storage realizes a shared memory abstraction for source data and results of processing. A variant of shared memory is realized on a shared file system that is common for many cluster architectures. A storage interface provides operations for writing and reading of data. There exists also an intermediate storage mechanism in local memory of each processor, where the results of this processor operation are stored. This one allows reducing time expenses for variable reading in case of repeated access.
The parameters of operation are read from storage. Each parameter is identified by its object name, represented as a string. Each parameter value is placed into corresponding internal grain variable, thus all parameters form a calling context. Further, the operation is executed and results of processing are placed in the storage. At this moment these values are accessible for other grains in parallel application.
The grains are collected in specific libraries that are dynamically linked into the parallel application. The grain is loaded from the library in due time and identified by the operation name. The realization of specific grain libraries from different classes of processing algorithms allows expanding the application area of the proposed framework.
An example of the parallel program graph is presented in Fig.2, where each operation is denoted as Ci with a cost vector. A cost of an information transfer between contiguous operations is equal for all data types. Some operations are strictly oriented on a specific processor while others can be placed on each processor in cluster. If the operation cost for some type of data is equal to zero, then this operation must be skipped for the selected type of data. The matrix of restrictions is used in optimization procedures and prevents an erroneous allocation of the specified operations on some processors. The restrictions arise because of a heterogeneous cluster structure and different operations requirements.
A main task of a multi-agent system is a planning and an optimization of a parallel program execution with a simultaneous provision of reliable computations and guaranteed processing. There exist many algorithms of DAG scheduling that use various optimization techniques and heuristics. The techniques include priority based list scheduling, for example, the algorithms HLF (Highest Level First), LP (Longest Path) and CP (Critical Path) [18][19][20]. Another technique is a clusterization, and such algorithm, as DSC (Dominating Sequence Clustering) [21,22] belongs to this technique. However all static scheduling algorithms are constructed mostly for special graph topologies, or use special constraints, such as a zero communication time between nodes or an unbounded number of processors. Because of a stochastic nature of input information the static scheduling approach can not realize an effective optimization for many cases of parallel processing.
Another perspective search techniques use an evolutionary optimization. These techniques are based on such algorithms, as tabu search [23], simulated annealing and genetic algorithms. GA combined with VAN algorithm is the most powerful technique among them. VAN algorithm is based on a concept of associations between the particular operations and dedicated processors [24]. Each operation O and processor P are associated by means of virtual link of strength ω O,P . Some structure of an associative memory, which consists of the associations, is constructed for optimization. This memory is learned by an experience, accumulated in a solution search process. VAN algorithm is based on GA representation of solutions in a form of population of chromosomes. Each chromosome represents a variant of program graph decomposition.

THE FRAMEWORK ARCHITECTURE AND AGENT BEHAVIOR
The architecture of agent framework for parallel processing is presented in Fig. 3.
The framework architecture is based on MPI and allows a fast communication between agents by means of an internal MPI virtual machine. A basic multi-agent structure is presented in Fig. 4.
An input for the multi-agent system is a parallel program graph and a set of data objects that are different by their types. The parallel graph is represented as XML file, which allows specifying all the characteristics of separate operations for all types of data objects. Each agent performs parsing of the whole graph and builds internal structures that are used for an execution of specified operations with correct parameters for each object type. These operations are represented by descriptors, which are used by a scheduler to control precedence relations and an overall process.
The data object is represented by a descriptor too. The descriptor contains an identifier, type attributes and some additional information, for example, a name of data file, which contains information for this object. When an operation requires additional data for processing, this descriptor must be extended for specified applications in appropriate way.
As the descriptors are transferred between processors of the parallel application, therefore the application code must contain serialization mechanisms. These mechanisms are realized for an interaction with MPI facilities for messaging. The code is included in a message transfer interface that is extensible and allows the use of alternative message transport systems. Data objects are stored in a shared data storage that is realized as descriptor storage. Each descriptor is linked with a universal container for storing of different data objects. The storage interface allows interaction with global storage for each agent in the system. This interface has some facilities for an object search, on-demand loading of remote objects and deletion of unused objects from the storage. Each agent has a local copy of the storage and uses it as a write-through cache.
The purpose of the scheduler interface is a processing and a scheduling control. It contains a special component, which is called a scheduler and makes decisions about the next processing operation that must be placed in a descriptor queue. All descriptors of the operations for processed objects are stored in the descriptor storage that contains three sets of descriptors: ready, working and finished pools. The scheduler chooses the next processed operation from the ready pool and sends its descriptor to an appropriate processor agent. After processing this descriptor is placed to the finished operations pool and the information about next stage of processing is changed. The process is repeated while the ready pool is not empty.
For reliability of computations there exists an intermediate working pool of descriptors. This pool is used to mark the descriptors that are now executed by agents. When some agent is broken, then the corresponding descriptor remains in this pool a long period of time. The scheduler periodically checks a descriptor state and moves these waiting descriptors back to the ready pool. The descriptors then have a possibility to allocate on a different working agent.
The agents are dynamically linked up the library of image processing operations. Each processor executes the operations that are specified by the descriptors. The processor receives the descriptor from the coordinator, determines the next operation and executes it using the descriptor data. After a completion of data processing, the descriptor is returned to the scheduler. The processor works while a stop instruction is not received.
Besides the process coordination, the runtime agents check a system state and characteristics. These characteristics are collected and used for a runtime optimization. The optimization is based on a measuring of a data processing speed. When the input data change a system pattern significantly, the system must adapt to this situation. The adaptation consists in a reconfiguration of the operation subsets for all processor agents. The system tries to adapt to changed conditions and to achieve a high processing speed.
The agents use two different policies to choose of next operation from the descriptor pool. The first one consists in choosing operations on the base of agent's preferences. These preferences are formed in a working process by the means of VAN algorithm. Each agent has a vector of weights, and a probability of a selection of operation O for this agent A is: When the agent performs some operation and its performance characteristics are increased, then the corresponding operation weight is corrected according to: where α is a learning coefficient. The weights of the remaining operations are corrected according to: where N means overall amount of the agents. The agent can choose from a subset of operations, taking first ready operation. The second choosing policy is a greedy one that consists in choosing of first ready operation from the ready pool. This policy is introduced to eliminate a situation, when some descriptors are not chosen by long time. The greedy agents execute these operations and later they can choose this operations type as preferable. Each agent can switch between two scheduling policies randomly.

EXAMPLE APPLICATIONS AND EXPERIMENTAL RESULTS
For simulating we use 20×20 µm topological structure (Fig. 5), where black color indicates the exposed part on positive photoresist UV 210 Shipley with a puncture defect. The results of the simulating are shown in Fig. 6. X-axis is the distance from the center point of the object, the axis Y is the value of the lighting intensity.
Some experiments were conducted to measure a simulation performance depending against the size of the source data and the number of used processors. Checking the result of the simulation was carried out by comparison with the results obtained by leading industry SIGMA C microlithography simulator (Photronics, Inc.) on VLSI layouts with defects according to the standard SEMI-P22-0699 that have been detected EM-6329 [25].
The first group of experiments showed the dependence of the calculation time on the number of processors allocated to run the application (Table 1). The overall speedup is about 3 times for the case of 4 processors. This result is due to the fact that the parallel application has a nonlinear structure with a different duration of the operations, so its parallelism is also non-linear. Nevertheless, the resulting acceleration can substantially speed up the processing of the photomasks.
The second group of experiments shows the dependence of the performance of parallel applications on the size of the input data ( Table 2, the number of objects is equals 100, the number of processors -4). Judging by the results, the algorithm shows almost linear scalability. When the number of points of the pupil increases by 2 times in X and Y coordinates, the amount of data processing is increased by 4 times. In this case, the processing time increases by 3.16 times for 100 pixels, and by 3.42 times for 200 points (about the time for 100 pixels). This demonstrates the scalability of the algorithm, since the processing time is almost linearly dependent on the incoming data volume. The results of the experiments with the scenarios for calculating the characteristics of aerial images are shown in Table 3 (the processing flow of objects in seconds). It is clear that the optimum time is achieved for 2 processors. Obviously this is due to the large volume of communications between the operations. In this case, a high degree of a locality is required for a quick access to the data. The scenario was tested on a sequence of 20 images. The following results of the execution time are presented in Table 4. The results indicate a high degree parallelism of the scenario allowing achieving substantial speed up the processing even for relatively short flows simulated images.
Experimental data flows had an irregular structure and were generated randomly. The results of the static optimization for deterministic flows are presented in the form of comparing the times of stream processing according to the schedules obtained by the classical GA and VAN algorithm. The results are given for different numbers of CPU involved in the processing (Fig. 7).

Fig.7 -Improving of a performance for the static optimization
The optimal static schedules, obtained in the first series of the experiments, were used in the second series of the experiments with stochastic flows. These schedules were compared with the schedules which were implemented by dynamically reconfigurable applications (Fig. 8). The results show a change in processing time for the static (S) and dynamic (D) VAN algorithm.

Fig.8 -Improving of a performance for the dynamic optimization
The results of the first series of experiments indicate that VAN algorithm finds the best VAN schedules, and the performance of the algorithm increases with extension of the search space. The VAN algorithm also has a lower computational complexity and finds solutions faster than the classical GA.
The results of the second series of experiments show that the virtual network algorithm significantly improves performance in case of stochastic processing flow of images by operating of dynamic optimization system.

CONCLUSION
The implementation of parallel applications in specialized component architectures allows users to significantly accelerate the process of creating, analyzing and optimizing programs. Architecture of data stream processing based on the use of multiagent systems can be easily adapted for many applications involving parallel processing. Using the component approach defines a flexible framework of a parallel application that allows adapting the computational process for the operation. Design tools can be easily extended with new operations, algorithms, and data types for implementing application processing flow of information from different subject areas.