Table of Contents Previous Chapter

CHAPTER 5a Interprocess Communication (IPC)


A running program -- a process -- may well be functional by itself. But it's when processes communicate with each other that we take full advantage of the power of today's computing systems. As a Sun Microsystems president once said, "The computer is the network." Modern computing applications fetch data from remote sources, distribute tasks between clients and servers, and present results to users sitting in front of networked, remote displays.

The D2D component of the WFO-Advanced system is a "modern computing application" in the foregoing regard. Our software reads data from remote sources (the data ingest processes), distributes tasks between clients and servers (the communications router, the notification server, or the IGC_Process, for example), and presents results to users on workstation and X terminal displays.

Interprocess communication (IPC) of this sort is such a fundamental part of the WFO-Advanced D2D software that we have developed an object-oriented library to make the sending and receiving of data relatively simple. In fact, 60% of the applications in the D2D software link with our IPC library.

The IPC library developed for the D2D has three major components:

Recognizing that various IPC transport mechanisms may exist simultaneously, and that some are better suited for certain communications tasks than others, the D2D IPC library includes support for multiple, concurrent IPC mechanisms. The library hides the underlying IPC transport from an application programmer and automatically chooses the best one for each message to be sent.

Currently, the software's decision of which mechanism to use is made quite easy; our IPC library has never had multiple transport mechanisms. Since the birth of D2D, we have used three different transport mechanisms. Each successive mechanism has satisfied the requirements better than the previous mechanism so we haven't found a need to support two or more mechanisms concurrently. This may change as more reliable and high performing mechanisms emerge.

The rest of this chapter:

5a.1 IPC Requirements

This section describes the major requirements for IPC that surfaced in earlier incarnations and prototypes of D2D. Our first two implementations did not sufficiently satisfy all of these requirements, and this was the main impetus for evolving to our current implementation.

5a.1.1 Asynchronous Sends (Fire & Forget)

Clients of the IPC library should not be involved in the intricacies of queueing and resending data if the receiving process is not ready to accept and process messages. The sending process may have some critical processing to perform and cannot wait for the message to be received. The transport mechanism must hold on to messages until they can be sent without blocking the execution of the sending process. The mechanism should also be able to detect when a receiving process is not responding because of a normal termination, a crash, or a hang (infinite loop, for example). In that case, the transport mechanism should abort the sending of all the queued messages.

This requirement is crucial. In fact, the D2D data ingest system relies on this feature -- 14 of the 24 processes that comprise the data ingest system will not function without it.

5a.1.2 Synchronous Sends

In some rare cases, a client may not want to resume processing until it is certain that the IPC message has arrived at its destination and is processed. The transport mechanism should not queue any additional messages until the initial message arrives.

5a.1.3 Message Priority

With our IPC library, a client can specify two levels of priority for a message: high and normal. If a receiving process is responsive, then priority is not much of an issue because a message will be received almost immediately after it is sent if a fast transport mechanism is used. However, if the process is not responsive, high priority messages should be queued, sent, and received before messages with normal priority.

Currently, the vast majority of our IPC messages are specified to be at normal priority. In fact, the only type of message that is currently of a high priority is for relaying radar alert data.

5a.1.4 Message Selection

A client may want to select which messages to receive and process, and defer the processing of messages that may arrive before a selected message. Our IPC library allows a client to select messages from a particular sending process.

For example, a text depictable in an IGC_Process may need data from a TextDB process running on some remote host. The IGC sends an IPC message to the TextDB requesting the data. The IGC does not want to do any more processing until the reply from the TextDB arrives, so it chooses to receive messages only from the TextDB process.

5a.1.5 Friendly to Event Multiplexors

Event multiplexors wait for input, output, or exceptional conditions on a number of data sources at the same time. These are essential to efficient operation of D2D software, particularly the display software which needs to multiplex I/O with the X window display, the keypad, external applications, timers, and the IPC library.

In a UNIX environment, most event multiplexors are implemented with a select() call which uses file descriptors to identify the devices to multiplex. Therefore, the IPC transport mechanism must provide file descriptors for the devices that receive IPC messages.

The alternative to using an event multiplexor is to poll each of the input sources separately in a loop. This wastes CPU time. Considering that a single D2D graphics workstation runs 24 (or more as users start extensions and applications) processes -- each of which has to poll for input -- the waste is astronomical.

5a.1.6 Unlimited Message Size

The client should never have to worry about whether an IPC message is too big for a transport device. If it is, the transport mechanism should fragment the message and then reassemble the fragments on the receiving end.

The number of bytes for most IPC messages passed between D2D processes is rather small. However, some extensions such as warnGen and the alertAreaEditor can potentially pass large messages (> 64k) to the IGC process.

5a.1.7 Reliable and Maintainable

The IPC mechanism should always be accessible and require no special permissions to use except to provide security or authentication. It should take as few process slots as possible to run, preferably none. The only reasons a message cannot be sent may be that the receiving process is down or hung, the operating system is down, the machine on which it runs is down, or the network is down.

If one or more auxiliary processes are used to implement the IPC mechanism, then these processes should be able to restart without having to restart the client processes. In other words, these processes should be able to save and restore their state. The IPC mechanism should be able to detect an unresponsive IPC server process, and restart that process if necessary.

As far as being maintainable, the mechanism should not be a black box. It should either have a well-documented theory of operation or provide source code to its implementation.

The mechanism should require little system administrator assistance. It should use familiar forms of addressing.

5a.1.8 Performance

Initially, the needs of the IPC library were modest; today, we are sending messages all the time. In just the Denver WFO, literally hundreds of processes are involved, passing messages with a variety of sizes to each other. Here are some examples where fast IPC performance is of vital importance.

Users expect that green times displayed on the user interface product menus or the display of a product on an IGC will be updated as soon as new data for that product arrives. The expediency of auto-notification depends heavily on IPC. The decoders notify the notification server, which in turn notifies the IGCs, the UI, and the volume browser. The notification server sends and receives thousands of messages per hour; some of these messages can be quite large.

Remember, the UI and the display of data is divided into separate processes, although the user may not be aware of this. If the user selects a product for display, or a pane to swap with the large pane, he/she expects the results almost immediately. An average user response time of more than two seconds would register serious complaints among our users. Since the IGC may spend a good portion of the two seconds reading and displaying data, rapidly sending the message from the UI to the IGC and back to the UI is crucial.

Like the UI and IGC, extension interactivity is partitioned between the IGC and an extension process. One of the most time-critical tasks for a meteorologist is the generation of warnings, which is managed by the warnGen extension. Potentially very large IPC messages are sent between the IGC and warnGen processes. If the messages take too long to send and receive, the user will be bogged down with sluggish responsiveness.

While it would be nice to state this requirement as "throughput of at least 30 kilobytes per second," or some other figure, this requirement depends on the user's perceptions of system responsiveness, and the speed of non-IPC processing.

5a.2 History of the IPC Library

Since the advent of D2D, the transport and addressing mechanisms of the IPC library have evolved quite dramatically during our short development history. However, the message encapsulation component of the library has not changed nearly as much. This section will focus on the three transport mechanisms and the reasons for switching mechanisms. Addressing mechanisms are dependent on the transport mechanism, and have been changed only to accommodate a new transport mechanism.

Granted, approaching UNIX IPC can be a bit difficult because of all the different mechanisms it provides. Since UNIX had a colorful evolution, the IPC features it provides are equally colorful. There are a number of IPC mechanisms directly provided by the UNIX operating system, each with its own features and blemishes. There are also a number of packages, commercial and free, that layer themselves atop the UNIX IPC mechanisms. These "middleware" packages make using UNIX IPC easier or provide additional features not directly provided by UNIX itself. Since the beginning of our development, we have used three distinct mechanism implementations: one utilizes a middleware package, while the other two have been built on top of native UNIX mechanisms.

5a.2.1 DEC Message Queue (DMQ)

The first IPC implementation was built utilizing a commercial middleware package called MessageQ, DECmessageQ, or DMQ, made by Digital Equipment Corporation (DEC).

5a.2.1.1 Design Overview


Figure 5a.1 DMQ Transport Design

To send a message, a client process makes a call to the DMQ application programmers interface (API), specifying the address, the message attributes, and the message data. If the send is synchronous, the API routine will return when the message has been dequeued by the destination process. If the send is asynchronous, the routine will return immediately. The message is placed on the DMQ bus and sent to one of the DMQ queuing processes. The queuing processes examine the address and determine whether the message is for a process on the local host or some remote host. If it is for a local process, it is added to the back of the queue dedicated to the destination process. For a remote process, the message is sent via the bus to a queuing process on the remote host. That queuing process then inserts the message into the proper queue. The destination process does not have to be running.

To receive a message, a client process makes a call to the DMQ API, passing a time-out value, in deciseconds. The routine will not return until a message has been received or the time-out has been exceeded. Optionally, the client can pass an address, which tells the routine to return a message only from a process with that address. The DMQ routine makes a request via the bus to a queuing process for a message. If there is a message on the queue for the receiving process, it is sent back via the bus.

5a.2.1.2 Strengths

Because a queue is maintained in another process, a client process can send to a target and not wait for the destination process to receive the message. The DMQ queueing process is even smart enough to write the message to disk if the destination process is not running, and then restore and send the message when the process does come back up. Optionally, a sending process can block its execution until the message has arrived at the receiving process. Therefore, this transport mechanism successfully satisfies the requirement of both synchronous and asynchronous sends.

DMQ supports queueing processes with multiple queues, which makes it possible to separately receive high- and normal-priority messages, satisfying another important IPC requirement.

Receiving specific messages distinguished by address is another strong feature of the DMQ transport mechanism. The client process can ask the DMQ queuing process for a message with a particular address, leaving all the other messages still queued.

5a.2.1.3 Weaknesses

In Figure 5a.1, main components of the DMQ architecture are depicted as black boxes since DEC has supplied minimal information about the internals of these pieces. For example, the DMQ bus might be a TCP socket, a UNIX Domain socket, or something entirely different. Likewise, all we know about the queueing processes is that there are at least three running simultaneously, and sometimes up to six processes. Little is known about what these processes do, or the internal details of the queues; this makes this transport mechanism extremely difficult for D2D developers to maintain.

Another major weakness is that the DMQ transport is definitely not event multiplexor friendly. A UNIX select() call cannot be used to determine when data has arrived on the DMQ bus, since DEC does not supply a file descriptor for the bus. This has tremendous performance ramifications on the D2D display processes, which receive events from multiple sources: IPC, stdin and stdout (application interface), X, timer events, etc. These display processes are forced to resort to polling, which consumes a lot of unnecessary CPU time.

Unfortunately, the DMQ bus has a limitation of 32 kilobytes. The software we wrote around DMQ does not support automatically fragmenting and reassembling messages, so the responsibility falls to the users of the IPC library. This is not desirable since IPC clients have enough to worry about.

Besides requiring that system administrators install and maintain the product, DMQ layers an additional level of addressing on top of that already provided by every UNIX workstation. This makes debugging of IPC problems more difficult because of the additional lookup or memorization needed to map from DMQ host number to host name or Internet Protocol (IP) address. Users who initiate processes using IPC are also burdened with setting an environment variable to the DMQ host number for the host of the process.

Using separate processes to queue messages allows this transport mechanism to satisfy our important requirement of asynchronous sends. However, this strategy adversely affects performance and reliability. The dynamic memory for the binary byte stream has to be allocated in the sending process, receiving process, and the intermediate queuing process. The extra context switch to the queuing process increases transmission time, especially for large messages. For reliability reasons, a queuing process is not optimal since it can crash. Even worse, DMQ queueing processes do not save their state upon a crash, which means that all client processes will need to be restarted. This is problematic since the workstation takes considerable time to initialize.

Finally, this middleware product costs money. But UNIX provides several IPC mechanisms, and we already pay for UNIX. Why shell out more money?

5a.2.2 Server Based Socket IPC

Our next IPC mechanism, delivered to the fxa-3.0 (AWIPS Build 3) release of D2D, was designed to address many of the problems discussed in Section 5a.2.1.3. It was our first attempt at using the native UNIX IPC mechanisms, such as UNIX Domain (local) sockets, and RPC.

5a.2.2.1 Design Overview

In the UNIX parlance, sockets are an endpoint of communication. After a process creates a socket, it can then connect its socket to another process's socket and exchange data. The other process may run on the same host or across the world on the Internet, depending on the type of socket used. Sockets are easy to use; the same read and write calls that work with files also work with sockets. Because they're ubiquitous, well-documented, and relatively easy to use, sockets are an attractive IPC option.

UNIX Domain sockets, also known as local sockets, enable processes on a single UNIX system to communicate. These kinds of sockets are identified by creating nodes in the file system, usually under the /tmp directory. The X Window System uses UNIX Domain sockets: programs running on the same host as the X server can connect to the X server on display 0 by using the socket /var/spool/sockets/X11/0. This kind of socket features an easy addressing scheme (nodes in the filesystem) but can transfer data between processes on a single system only.

Sun's Remote Procedure Call system, or RPC, is an interface layered above sockets and is free with HP/UX. The client sends a message by making what appears to be a procedure call. The parameters passed into the procedure are the message data. The return value of the procedure is an optional reply message from the destination process.

In order to implement asynchronous sends (fire and forget), an IPC server daemon runs on every host where our client processes need IPC. The server maintains a queue for every running client process on the server's host, which is indexed by the process address (host IP address and process ID). Each queue maintains a local socket that is connected to a client process. RPC calls are used for communication from a client process to the server. Local sockets are used for sending queued messages from the server to the client. This design is very similar to how DMQ supports synchronous sends except for some notable improvements:


Figure 5a.2 Server Based IPC Transport Design

Here is the basic algorithm for this transport mechanism. Figure 5a.2 illustrates this algorithm.

  1. The IPC server (process name: rpc.ipcd) is invoked and initialized. Its initialization includes registering with the RPC system, and restoring its queues from the crash recovery file, if there is a file. Stored for each queue is the IPC address of the queue's process, the pathname of the process's socket, and any messages that still need to be sent. When the queue object is recreated, the server creates a socket and tries to connect to the process's socket. If this fails, the queue is removed. This scenario is very probable, especially if an IPC server daemon is down for a fairly long time.
  2. After initialization, the IPC server enters its event multiplexor. The UNIX select() call is used to determine when the RPC sockets are ready for reading, and when the local sockets for each queue are ready for writing.
  3. An IPC client process (an executable linking with the IPC library) initializes by creating a local (UNIX Domain) socket, and then binds that socket with some file whose name includes the process ID. It then registers with the server via RPC, passing its IPC address (host IP address, and process ID), and the pathname of the local socket. The RPC will return a status indicating whether the server was able to create a queue for this process and connect to the local socket. If the client cannot register with the server, then this client process will not be able to perform any IPC during the life of the process.
  4. When an IPC client wants to send a message, a message object is created with the binary stream and some message attributes. An RPC message is then sent to the IPC server that is running on the host of the destination process.
  5. An IPC server receives the RPC message sent in step 4. It then tries to queue the message using the IPC address to identify the queue object to use. If it can't find the queue, or the queue is full, then a status is delivered to the sender through the return value of the RPC. Each queue object has two queues, one for each possible priority. The priority attribute of the message is used to decide which queue to use.
  6. If the queueing of the message is successful, the server will be notified via the select() call that there is now a message to send. The selected message will be from the head of the high priority queue. If that queue is empty, then the message will be from the normal priority queue. Using the UNIX write() call, the message's binary stream is placed on the local socket for the destination process.
  7. The event multiplexor for the destination process will detect that the local socket connected to the server is now ready for reading. It will read the binary data from the socket using the UNIX read() call, and then pass the message to the dispatcher. If the client is waiting for a message from a particular target, the dispatcher will queue the message until the desired message arrives. Once a message is ready to be dispatched, it is sent to the appropriate receiver object.
  8. When a client process exits normally or abnormally, an RPC message is sent to the IPC server running on the same host as the client. This message tells the server to delete the queue for that client, and any pending messages in the queue.
  9. When the IPC server exits normally or abnormally, it closes the local socket for every queue. It then saves the state of every queue into a crash recovery file.

5a.2.2.2 Strengths

Because a queue is maintained in another process, a client process can send to a target and not wait for the destination process to receive the message. The only caveat is that the destination process has to be running, which is a minor blemish compared to DMQ. This transport mechanism successfully satisfies the requirement of asynchronous (fire and forget) sends. It is possible that the queuing process could be busy doing something else and not be responsive to the sending client, but that is unlikely since the server is not processing messages -- it is just forwarding the messages to other clients.

Local sockets and RPC are friendly to event multiplexors since a file (socket) descriptor can easily be obtained. And because this mechanism is event multiplexor friendly, a display process can truly sleep while waiting for an IPC, X, or a timer event arrive. No CPU intensive polling is necessary.

Both UNIX domain sockets and RPC support a message size that is limited only by the amount of virtual memory to which a process has access. Although this mechanism has code to fragment and reassemble messages, it is rarely used.

The addressing for this mechanism is fairly universal. Every UNIX process has a process ID, and every UNIX host has an IP (Internet Protocol) address. This is a big improvement over DMQ.

The IPC server has separate queues for IPC messages with different priorities, satisfying an important IPC requirement.

5a.2.2.3 Weaknesses

As shown in Figure 5a.2, local sockets are used as only a one-way communication link from the server to the client. However, local sockets can support two-way communication. Thus, this transport design is using only half of the available bandwidth of the socket.

The requirement of synchronous sends is not completely satisfied with this mechanism. When the sending process returns from its synchronous send, it knows that the message has been received at the server process running on the host of the target process. However, the message has not arrived at the destination process yet; the destination process might be in the middle of some intensive processing and cannot read the local socket immediately.

Although not as mysterious as DMQ, RPC is somewhat of a black box. Unfortunately, it is difficult to tell how many internal sockets are used for its transport. That information is pertinent to the server daemon since UNIX has a limit on the number of open sockets and files a process can have. In addition, HP's implementation of RPC was flawed with HP-UX 10.10: occasionally, strange values were returned from RPC calls. The problem seems to be have fixed with HP-UX 10.20.

Using separate processes to queue messages allows this transport mechanism to satisfy our important requirement of asynchronous sends. However this strategy adversely affects performance and reliability. The dynamic memory for the binary byte stream has to be allocated in the sending process, receiving process, and the intermediate queuing process. The extra context switch to the queuing process increases transmission time, especially for large messages. For reliability reasons, a queuing process is not optimal since it can crash. Fortunately, unlike DMQ, our server can restore its state after a crash, eliminating the need for clients to be restarted.

This design is fairly complex, and so is the resulting code which deplorably affects maintainability. The code to generate RPC calls is particularly complex. Initially, we planned to use the utility rpcgen to generate the RPC code. Unfortunately, it produced code of dubious quality.

The message selection requirement is not completely satisfied with this mechanism. The local socket connected to the server may contain many messages from different target processes. There is not a way to read from the middle of a socket and leave the rest of the socket data intact. Thus, other non-desirable messages have to be read before the desired message can be read. This implementation defers the dispatching and processing of the non-desired message until the desired message arrives. This could be time consuming and memory exhaustive if the desired message takes a long time to arrive and a lot of unwanted messages arrive in the meantime. DMQ and Thread Based Socket IPC provide better solutions for satisfying selective reception of messages.

5a.2.3 Thread Based Socket IPC

Our latest and greatest IPC mechanism, delivered to the november97 (AWIPS Build 4) release of D2D, was designed to address the performance and reliability problems discussed in Section 5a.2.2.3. This time we opted for a different UNIX IPC mechanism, TCP sockets. Also, in order to support asynchronous sends, we needed a library that supplied multiple threads of execution. Our choice was HP's implementation of the DCE pthread library (not that we had much choice).

5a.2.3.1 Design Overview

Transport-control protocol, or TCP, sockets are another type of network-capable socket. TCP sockets establish a virtual connection between peer processes wishing to communicate. Most of the popular Internet services are implemented using TCP sockets: rlogin, telnet, http, ftp, X Windows, mail, print spooling, net news, kerberos, and more. TCP works by layering a reliable, flow-controlled, data stream protocol on top of the Internet Protocol (IP). TCP sockets are quite easy to use because they are interchangeable with the file I/O interface used to read and write data to disks and terminals. TCP sockets enable processes to communicate whether they're on a single host, on the local network, or on the Internet; the programmer doesn't have to change the program to support all these types of connections.

Instead of a queuing process, we opted for a multiple thread environment in order to support asynchronous sends. Maurice Bach in The Design of the UNIX Operating System defines a thread as the following:

Until now, most if not all applications developed here at FSL have used single-threaded processes, which have a single flow of control through program code. Processes linking with our IPC library will potentially be multi-threaded with multiple flows of control. A multi-threaded process can achieve significant performance gains through the use of concurrent thread execution. This means the two or more threads are in progress at the same time. For some HP hosts with multiple processors, such as the K series, two or more threads can be executed simultaneously.

This mechanism uses multiple threads of execution to implement asynchronous (fire and forget) sends. Here is a possible scenario without using multiple threads. Suppose the notification server process is trying to send a message to an IGC process, but the IGC is really busy doing some other processing, perhaps constructing some radar tables. The UNIX buffer between the two sockets has become full, so when the notification server tries to write to the socket, the UNIX write() call blocks execution until the IGC reads some bytes from its end of the buffer. Since execution is blocked, the notification server cannot process other notifications, which will delay the notifications sent to other workstations who have IGCs that are responsive.

Now, consider the above scenario in a multi-threaded process. When the notification server detects the socket to the IGC is full, it creates a new thread of execution whose job is to keep writing to the socket until the entire message is sent. This thread will be blocked inside its write() call. Meanwhile, the main thread can proceed with the business of the notification server. When the IGC removes some bytes from the socket buffer, the socket writing thread wakes up, and receives the CPU so it can write to its socket. If the notification server keeps trying to send messages to the busy IGC, more socket writing threads may be created. Thus, the process could have a main thread and many socket writing threads. Keep in mind that thread creation does carry some performance overhead, so we want to initiate threads only when the socket buffer is full.

The real strength of this design is the elimination of the queueing (server) process. The socket writing threads (threads are often referred to as lightweight processes) take the place of the queueing process and are used only when really needed. The result is a dramatic performance increase over our previous two transport mechanisms. See Section 5a.2.5.

The following diagram depicts the general data flow for this mechanism. Each client process maintains a socket for every other process which which it wants to communicate. The bidirectional TCP socket can be used for both sending and receiving data on the same host or remote hosts. Unfortunately, the number of files (and/or sockets) a process can have open is limited (most of our hosts are configured to have 60, although this is a tunable kernel parameter at the expense of increased memory use). In an effort to be considerate of our clients' needs, the maximum number of file descriptors that the IPC library consumes is a third of the system max. In order to accomplish that goal, each socket object will record the time it was last accessed for either writing or reading. If a request for a new connection will exceed our 20 (or so) socket limit, the least recently used socket will be destroyed.


Figure 5a.3 Thread Based IPC Transport Design

5a.2.3.2 Strengths

TCP sockets are friendly to event multiplexors since a file (socket) descriptor can easily be obtained. And because this mechanism is event multiplexor friendly, a display process can truly sleep while waiting for an IPC, X, or a timer event arrive. No CPU-intensive polling is necessary.

As mentioned already, not having a third party server process involved will increase performance and reliability.

The use of threads fully satisfies the asynchronous send requirement. And the real advantage of using threads is that they are used only when really needed; only when the destination process is not responsive. Performing a send without a thread constitutes a synchronous send, which satisfies another important requirement.

The default buffer size for a TCP socket is 32K (32,768) bytes, but can be configured to the maximum size of 256K (262,144) bytes. This mechanism uses the maximum buffer size, but that is a tunable parameter, specified in our config file (ipc.config). The IPC library does multiple socket writes and reads in order to send messages of unlimited size, satisfying another requirement of our IPC library.

Our design allows for client processes to have two sockets between them, one for normal and one for high priority messages, although the extra socket will be created only as needed. Currently, very few clients send or receive high priority messages, although that could certainly change. Our event multiplexor has been enhanced so that it will flag high priority sockets as readable before normal priority sockets.

Since each target has its own socket, it is relatively easy to implement the requirement of selecting messages from a particular destination. The algorithm for doing this is explained in Section 5a.4.2.3.

Because we are using threads only when needed and eliminating an extra process context switch, the performance of this mechanism is quite impressive. In the words of Beavis and Butthead, "this system rocks, man!"

5a.2.3.3 Weaknesses

Like RPC, the HP DCE pthread library is somewhat of a black box, although not quite as mysterious as DMQ. We have encountered several HP bugs thus far, and there is probably more to uncover. One particularly amusing bug that HP has since fixed successfully was a math error when running an optimized multi-threaded executable on their latest and greatest(?) hardware. Another bug is not so amusing since HP refuses to fix it. The pthread library places wrappers around most system calls. The wrapper around the select() call does not work according to the specification in the man page. If select() returns due to an error, it is supposed to clear the bit masks which are returned to the caller, indicating which file descriptors are now ready. The pthread select() call does not do this, which causes some serious problems with the Tcl library which uses select(). Fortunately for us, the Tcl consortium agreed to apply a patch which has just been released in version 8 of Tcl/Tk.

We know that doesn't sound promising for a critical library of an operational forecast system, but we are pretty optimistic about HP's future plans for operating systems that support the multi-threaded programming model. HP just announced their plans to support kernel threads in HP-UX 10.30 which are entities that are visible to the OS kernel, as opposed to user threads which exist in user space and execute user code. Kernel threads promise to make signal handling in a multi-threaded environment more robust. We shall see about that. As we free our system from commercial off-the-shelf software (COTS), it will become more portable and will have more flexibility in pursuing superior thread libraries such as the one offered by Sun Microsystems.

5a.2.4 Requirement Comparisons

The following table compares how well the three mechanisms satisfy our requirements.

Table 5a.1 IPC Requirement Comparisons

                                DMQ   Server Socket  Thread Socket
------------------------------------------------------------------
Asynchronous Sends              Yes   Yes            Yes
Synchronous Sends               Yes   No+            Yes
Message Priority                Yes   Yes            Yes
Message Selection               Yes   No+            Yes
Friendly to Event Multiplexors  No    Yes            Yes
Unlimited Message Size          No    Yes            Yes
Reliable and Maintainable       No+   Yes-           Yes
Performs Well                   No+   Yes-           Yes
------------------------------------------------------------------

5a.2.5 Performance Comparisons

As development aids, we have implemented two test programs which are ideal for evaluating performance. The sending program sends a series of text string messages to a specified receiving process. The receiving program receives that text string, and then sends a reply back to the sending process. A timer is started when a message is sent, and is stopped when a reply for that message is received.

For the performance tests, we used an asynchronous, normal priority message containing the string, "This is a performance test". Each test sent the message 200 times; the time displayed in the following table is the average of those 200 times. We performed the test five times: twice for each of the three transport mechanism; once for a local transmission, and once for a remote transmission (the sender and receiver running on different hosts within the development network). Unfortunately, we were not able to perform a remote test for DMQ due to address configuration problems. The same hosts were used for all the tests, and performed when the hosts were relatively idle.

Here are the results from those tests.

Table 5a.2 IPC Performance Comparisons

Test             Time it took in seconds to send a message   
Description      and receive a reply from that message
-----------------------------------------------------------
DMQ (local)      0.0596364                                   
DMQ (remote)     Not Available                               
Server (local)   0.0168488                                   
Server (remote)  0.0160134                                   
Thread (local)   0.0022547                                   
Thread (remote)  0.0104642                                   
-----------------------------------------------------------

5a.3 Thread Based Socket IPC Design

After several failed attempts at describing the Thread Based Socket IPC implementation, we decided to rely on the power of analogy for delivering the "big picture" view of this transport design followed by a description of the main objects and their interactions.

5a.3.1 Consider This

Suppose you are a software engineer working for the hippest 3D visualization animation software producer in the industry. And your company is desperately trying to escape a hostile take over, and you're in charge of researching financial trends in the industry. You decide to enlist the help of some hot shot financial analyst, and unfortunately you find my number, an analyst with a shady Wall Street company who has been involved in covert financing of the Irish Republican Army.

Anyway, your desktop phone has multiple line capability (up to 10); each line is accessible by a button which flashes when an incoming call arrives on the line. So with my number in hand, you select a line on your phone; wait for the dial tone; dial; and voila, you are talking to my receptionist, 1800 miles away. You ask the receptionist for my extension. I have a similar phone and one of the buttons is flashing. I push the button and start listening to your company's woes.

What does this have to do with our design? With a little a bit of imagination, it's not hard to see at all. The companies for which you and I work are analogous to computer processes. Obviously, communication between other companies is part of our business, but our companies have specific missions and methods for accomplishing those missions. Certainly the notificationServer and the IGC_Process are processes needing communication with other processes, but both have specific reasons for being beyond IPC. You and I are analogous to objects in a process, each having a specific task in the context of a larger mission. The multi-line phone manages connections and thus would be considered a connection manager. Our design has a connection manager except we call it a SocketConnection object. Each button on the phone manages a communication endpoint. That certainly fits the definition of a socket. Between my button and your button is an electronic telephone line (or maybe fiber-optic, but neither of us really cares about the details). Between every two sockets is a two-way socket buffer; the actual implementation of this buffer can be many things, all transparent to us.

What about the receptionist? I'm glad you asked. No actual data is transferred between him/her and you. You're just requesting a connection between me and you. The receptionist uses the same phone we are using, so s/he has access to a button/socket also. Except this button is used only for accepting and forwarding connections. Every company usually has some sort of receptionist, albeit some are automated, but usually there is only one per company. Our design has a receptionist also, except we call ours an AcceptSocket object. Like the receptionist, there is only one per process, and its sole function is to accept and forward connections.

Back to me and you. As I talk, you listen and vice versa. This is a polite conversation, even though I have a Type A personality. The words I use are converted to electronic signals; sent over the phone line; and then converted back to words. Our IPC messages start out in data formats that the client object understands, but are converted to a binary stream that garners host/architecture independence. Once the message is received, it is quantized back into words for the benefit of the client objects that process the incoming message. Once you hear my message, I may look up some figures on my PowerBook, and then reply. Our communication is two way, and we are exchanging data back and forth. Just as the receptionist's button is analogous to a socket, it had a special purpose, so we gave it a special name, AcceptSocket. My button and your button are also analogous to sockets. In our design, TCP sockets that are endpoints between two-way communication lines that transfer actual data messages are encapsulated in what we call DataSocket objects.

A SocketConnection object can have multiple DataSocket objects; each connected to a different process. However like our phones, there is a limit to how many active connections a process can simultaneously maintain. What if I needed to call you, and all my lines were being used? One possible solution is to hang up on the person whose line has been idle the longest. Not a solution that will give you style points or good karma, but at least it will free up the line for that important phone call to me. Our design does the same thing. When a process has reached its limit of outside connections (usually about 20), then the SocketConnection object looks for the least recently used DataSocket object, and destructs the object which is very analogous to hanging up on someone. But in the digital world, it is not quite as rude.

Ok, we admit it. Here's where the analogy gets kind of creepy. Suppose that I try to call you. I get through to the receptionist, but since you are using the last line for some other conversation, I get put on hold. Perhaps, you are talking to your boss about a raise, or maybe to your significant other about some new way of (never mind...). Meanwhile, I have this important message to give you, and even though the music to the Hawaii Five-O TV show is very refreshing, I have some important work to do, and can't do it effectively being on hold. So what do I do? I clone myself. My clone waits for your line to become available, and then gives you my message and kills himself. Very tragic, I know, but at least I was able to get some work done while my clone waited for you to get off the line. This sounds very similar to the problem of a process sending a message to another process that has a full socket buffer because the destination process is busy talking to its significant other, the meteorologist. Our solution is similar to the cloning idea although it doesn't have as many moral ramifications. We cut a thread whose job is to wait for the destination process to read some of the socket buffer. The thread then writes the message and exits once the message is complete. Cloning can be expensive (new technology), so we want to clone only when absolutely necessary. Likewise, thread creation carries some overhead with it, so we want to cut a thread only when the socket buffer is full.

If you can indulge this metaphor a little longer, we have one more point to deliver. Suppose you are on the phone with your significant other, and your boss knocks on your door. BUSTED! He interrupts you with some important information about a tech review that you couldn't care less about. You have a great idea. You use a clone that will receive your boss's information. Then the clone calls you up, and waits on hold until you get off the phone. The clone then gives you that very important information, but he doesn't kill himself because he may be useful later on. We handle asynchronous signals in very much the same manner. Suppose the IGC is busy rendering a depictable and a SIGUSR1 signal comes in. We have a signal thread that is always running which receives the signal. The thread then writes the signal to a pipe managed by a SignalPipe object. A pipe is very similar to a socket, and in fact is identical to a socket as far as the EventDispatcher is concerned. The IGC reads the pipe and processes the signal once the depictable rendering is complete.

5a.3.2 Object Interaction

Now that we introduced the basic concepts, this section describes the major objects of the IPC library and how they interact.

5a.3.2.1 SocketConnection

SocketConnection objects manage a group of sockets of a particular priority. These objects are derived from the Connection class. This is to support a future possibility that our IPC library will have another transport mechanism in addition to TCP sockets.

Currently we support only two levels of message priorities, normal and high. At static initialization, a process creates a SocketConnection object for normal messages. A second SocketConnection object is created only if the process sends or receives high priority messages.

5a.3.2.2 AcceptSocket

The AcceptSocket encapsulates a TCP socket whose sole responsibility is to listen for and handle connection requests from other processes. Its parent is the normal priority SocketConnection object. There is only one AcceptSocket object active in a process. It handles connection requests by creating a new socket and passing it on to its parent.

5a.3.2.3 DataSocket

The DataSocket also encapsulates a TCP socket; however, its use differs from the AcceptSocket. DataSocket objects manage a data transmission endpoint between processes. These objects are capable of both receiving and sending data to/from a DataSocket object living in another process. Each is managed and owned by one of the two possible SocketConnection objects. Only 19 DataSocket objects can live in a process simultaneously. If the limit has been reached, and a new one is needed, then one of the two SocketConnection objects will destroy the least recently used DataSocket object.

DataSocket objects can be constructed in two ways depending on whether the object is used to initiate a connection to another process or if the object was created to receive a connection request initiated by another process. Both constructors perform some mutual initialization tasks which are executed by the private member, DataSocket::initialize() that both constructors invoke. These tasks include initializing the data structures needed for thread management, and adjusting the socket buffer size to the value specified in the configuration file, localization/nationalData/ipc.config.

The DataSocket object also manages the threads that may be used to send to the connecting process. If a DataSocket object is destroyed, running threads are also terminated. The object also maintains mutual exclusion locks (mutexes) to ensure that only one thread at a time will be writing to a socket.

5a.4 Thread Based Socket IPC Implementation

In describing the implementation, we've focused on the major tasks of the IPC library and how each task is accomplished. In the interest of brevity, we have omitted some of the less important details which are documented with the code. We have identified the essential tasks of the IPC library as the following:

5a.4.1 Addressing

In order to connect to a TCP socket in a remote process, the local process must know the IP (internet protocol) address of the remote host, and the port number assigned to the socket inside the AcceptSocket object of the remote process. This information is encapsulated into an IPC_Target object and stored internally using a UNIX struct, sockaddr_in (defined in /usr/include/netinet/in.h) which is the data format passed to UNIX socket system calls. This class has methods to encode and decode to/from a binary stream. An IPC_Target object can be constructed with either a text string or a sockaddr_in. A text string representation of the address is useful for logging and for passing the parent address as a program argument to a child process. The string format is dependent whether the process is anonymous or named.

5a.4.1.1 Anonymous Addresses

Anonymous IPC addresses are usually used by transient processes, such as a process awaiting a reply from a dæmon, or two processes (usually in a parent/child relationship) that need to converse, but don't need to identify themselves to the rest of the system or network. WarnGen, fxa, fxaWish, IGC_Process are all examples of anonymous processes.

Anonymous processes from the same executable will each have different IPC addresses since the port number will be generated by UNIX when the AcceptSocket object is created. This allows many instances of the same executable to be running simultaneously; each instance can be addressed independently.

The text string representation of an anonymous address has the following format: <host name>/<port number>/<process id>. Host name can either be in fully qualified domain format: vulture.fsl.noaa.gov or as a dotted IP address: 127.0.0.1. The process ID is not used in the internal address representation and is not needed to connect to a socket. However, it is useful for logging and debugging purposes since it is much easier to identify a process by its process ID than by its port number.

5a.4.1.2 Named Addresses

Named addresses are usually used by processes that provide a well-known service. Such processes are usually daemons, awaiting a request, performing some action, and sometimes sending a response back. NotificationServer, CommsRouter, and RadarTextDecoder are all examples of processes that use named addresses.

Only one instance of an executable with a named address can be running on the network at one time. Named processes must run on the host specified in the system configuration file (ipc.config) and their AcceptSocket object will use the port number specified in the config entry for that process.

A config entry for a named process has the following format: <process name> <host name> <port number>. The process name has no restrictions but usually coincides with the executable name. The text string representation of an address for a named process is simply the process name specified in the config file. As with anonymous targets, host names can either be in fully qualified domain format or as a dotted IP address. The port number can be any positive number less than 32K, but all the named processes running on the same host must have unique port numbers.

5a.4.1.3 Assigning an Address to a Process

Both anonymous and named processes must be assigned an IPC address before any messages can be received or sent. A client of the IPC library can query the process address by calling the static method, Connection::myTarget().

How a process is assigned its address depends on whether it's an anonymous or a named process:

5a.4.2 Waiting for Events

Most D2D display and ingest processes initialize and then wait for events or UNIX signals to arrive. As the event arrives, it is processed by the object interested in the event. This continues until an event or a signal causes the process to terminate. Objects interested in events fall into two categories.

The EventDispatcher singleton object maintains three sets of client objects.

  1. All objects derived from DescriptorEventClient that are currently active in the process.
  2. All DescriptorEventClient objects that are part of the IPC library. This is a subset of the first set.
  3. All objects derived from TimerEventClient that are currently active in the process.

In order for the EventDispatcher to maintain these sets, client objects must register with the singleton during construction and cancel their registration during destruction.

A process can wait for events to arrive in four different ways.

5a.4.2.1 UNIX Select

No matter which way a process waits for events, the UNIX select() routine is used to wait for the devices to become ready until a time-out has been reached. This routine is particularly efficient since processes truly sleep until an event arrives.

The IPC library contains a global function, selectDescriptorEvents, which is a wrapper around the select() call. It is passed a set of DescriptorEventClient objects and a pointer to UNIX struct that represents time with a granularity of microseconds. A null pointer can be passed in which indicates that the select() will never time out. This routine blocks its process until one of the devices is ready or the time-out value has been exceeded. It uses the objects to construct three sets of file descriptors indicating which devices are waiting for reading, writing, or an exception. It passes these sets and the pointer to the time struct to select(). If select() returns because one or more devices are ready, selectDescriptorEvents determines which objects are ready by examining the descriptor sets set by select(). It then invokes the callback, DescriptorEventClient::handleEvent() for each object that is ready. If two or more objects are ready at the same time, the objects with higher priority devices will be notified first. SelectDescriptorEvents() will return an enumerated value depending on the four possible results of calling select() which are:

5a.4.2.2 Waiting for IPC from any Source

In order to wait for a single message arriving from any process, clients of the IPC library should call the static method, Connection::waitForMessage() and pass in a time-out value. The time-out value is specified in deciseconds (a relic from our DMQ days we can now support time-out granularity up to microseconds, but we decided not to change the interface since developers are used to it). The client can also use the enumerated values: IPC_Types::NO_WAIT and IPC_Types::WAIT_FOREVER as the time-out value.

If the process wants to wait continually for IPC messages to arrive, this code fragment might be used in main() after all initialization is complete.

A caveat of using this approach is that only IPC devices are monitored. Objects that manage non-IPC devices will not be notified when their devices are ready. Also, TimerEventClient objects will not be notified when their timer has expired. If the process has these kinds of objects, then the approach described in Section 5a.4.2.4 should be used.

Here is how Connection::waitForMessage() implements this approach.

  1. Converts time-out value to the UNIX struct representing time in microseconds.
  2. Asks the EventDispatcher object for the set of DescriptorEventClient IPC objects.
  3. Calls selectDescriptorEvents(), passing in the data obtained in steps 1 and 2.
  4. If a signal arrived or if selectDescriptorEvents() times out, return IPC_Types::IPC_TRY_AGAIN_LATER. If invalid arguments were passed in, return IPC_Types::IPC_HOPELESS. Otherwise, return IPC_Types::IPC_SUCCESS.

5a.4.2.3 Waiting for IPC from a Particular Source

In order to wait for a message from a particular process, the client also calls Connection::waitForMessage(). In addition to the time-out value, the client passes a pointer to an IPC_Target object containing the address of the source process.

Here's how we implemented this approach.

  1. Two static variables are set. One contains the address of the source process, and the other is a flag indicating whether the desired message has arrived. Initially it is set to false.
  2. It performs the first three steps of the algorithm described in the previous section. However, step 3 is timed using a StopWatch object.
  3. Each DataSocket object that gets notified that its socket is ready for reading will check to see if the address of the process to which it is connected matches the address in the static variable set in step 1. If not, no data is read from the socket. If so, once the complete message is read, the flag from step 1 is set to true.
  4. Check the return status of selectDescriptorEvents() call. If it timed out, Connection::waitForMessage returns IPC_Types::IPC_TRY_AGAIN_LATER. If one of the sockets was able to read, check the flag to see if the socket connected to the source process received a message. If so, we are done, so return IPC_Types::IPC_SUCCESS. If not, subtract the time it took to execute selectDescriptorEvents() from the time-out. If there is still time remaining, repeat steps 2 - 4.

5a.4.2.4 Continually Waiting for All Events

Display processes that need to collectively wait for IPC, X, stream, keypad, and timer events must apply this approach. After process initialization, main() should call EventDispatcher::enterDispatchLoop(). This method will not return until an object handling an event or signal calls EventDispatcher::exitDispatchLoop(). Typically, process cleanup code is found after a return from EventDispatcher::enterDispatchLoop() since the process is about to terminate. The dispatch method operates on all the active DescriptorEventClient objects and TimerEventClient objects that have been registered during process initialization or in response to an event callback.

The EventDispatcher supports multiple interchangeable dispatch engines. Even though we currently have only one variety, we may have several in the near future, one for Tcl executables and one for non-Tcl/Tk executables. To specify which engine to use, a client passes one of these enumerated values into EventDispatcher::enterDispatchLoop(): EventDispatcher::GENERIC or EventDispatcher::TCL. If no engine type is specified, then the generic engine is assumed.

Note: Tcl/Tk interpreters built in our software tree do not use our event dispatcher at all but use the Tcl event notifier directly. Tcl/Tk interpreters are executables that are used to interpret Tcl scripts; Tcl/Tk executables are programs which have Tcl/Tk built into them, but otherwise are used for their own purposes instead of for script evaluation. These programs would use our EventDispatcher with the TclDispatchEngine, which is yet to be developed.

We will discuss only the generic dispatch engine since the Tcl engine has not been implemented yet. The implementation is very similar to what was described in Section 5a.4.2.2. The dispatch method will continue to loop until an internal flag inside the EventDispatcher singleton is set to false by calling EventDispatcher::exitDispatchLoop(). Inside the body of the loop, the method does the following:

  1. Invokes the callback for any registered timer objects that have expired by checking the current time against the timer's expiration time.
  2. While looking for expired timers, we also identify the timer object that is about to expire next. We subtract that timer's expiration from the current time and then convert the difference into the UNIX struct that represents time in milliseconds.
  3. SelectDescriptorEvents() is then called by passing all the registered DescriptorEventClient objects (not just the ones dedicated to IPC) and a pointer to the time-out that was calculated in step 2. It is quite possible that there are no timer clients that are ready to expire. In that case, a null pointer to the time struct is passed in, instructing the select() call to never time out.
  4. If selectDescriptorEvents() has timed out, then the callback for the timer object found in step 2 is invoked.

5a.4.2.5 Using Tcl/Tk's Notifier for Event Waiting

Tcl/Tk interpreters and scripts use the Tcl/Tk event notifier for dispatching X events, file and device I/O, and timer and idle events. In order for those executables to use our IPC library, all the objects that manage sockets and pipes must register with this mechanism so that these will be notified when their devices are ready.

For most of our interpreters, IPC is considered a module and is initialized by the global function, IPC_Init(). This function creates a Tcl command that a script can invoke which simply invokes either Connection::myTarget() and Connection::setMyTarget() which were explained in Section 5a.4.1.3. More importantly, all the IPC DescriptorEventClient objects that were registered with the EventDispatcher at static initialization are now registered with Tcl. This is done by calling Tcl_CreateFileHandler(), passing the UNIX file descriptor, a mask describing the type of I/O to wait for, a callback, and a pointer to the DescriptorEventClient object. Unregistering with Tcl is done with Tcl_DeleteFileHandler(). When Tcl invokes the callback, it passes in the pointer to the DescriptorEventClient object. Thus, the callback can call the callback method of the object, DescriptorEventClient::handleEvent().

After IPC_Init() is executed, socket and pipe objects will register with Tcl when they also register with the EventDispatcher singleton. Likewise, when these objects are destructed, they cancel their registration with both the EventDispatcher and Tcl.

5a.4.3 Client Interface for Sending a Message

This section covers how clients of the IPC library specify and send messages to other processes. We'll also discuss the implementation details of how a message is packaged and prepared for sending. The nitty gritty details of selecting a socket and writing to it are deferred to Section 5a.4.4.

5a.4.3.1 Packaging the Data

Clients who want to send a message must first pack the data comprising the message into an ArgPkg object. ArgPkg is a set of template classes that provides for marshalling of objects or variables into a sendable byte stream, and back. There is a class for each possible quantity of data objects up to 11 objects. There is also an abstract base class from which each of these 12 classes is derived. For example, if a client wanted to send a message that contained a DataTime object, a text string, and a double precision floating point value, the client would instantiate an ArgPkg3 object using the constructor that accepts the three pieces of data. The declaration might look like this:

An important caveat is that each data type must have an associated serialize(), serialLength(), and quantize() routine.

To convert from data to binary, the following set of functions is used:

The data to write (arg), and the binary stream to write to (addr) are passed in. The routines return the pointer to the byte on the stream after the data just written.

To determine how many bytes a piece of data occupies on the byte stream, the following functions are used:

The object or variable is passed in (arg) and the number of bytes it will take to encode onto a binary stream is returned.

To convert from binary to data, the following set of functions is used:

A pointer to the data to read (arg), and the binary stream to write to (addr) is passed in. The caller is responsible for allocating memory for the data. Quantize methods do not allocate any memory. The routines return the pointer to the byte on the stream after the data just read.

Fortunately, the conversion functions for most all of the common data types have been written. Occasionally, a client may want to send an unusual class of object across process boundaries and will have to write these conversion routines. Complex data types are converted by calling serialize, serialLength, or quantize on the individual data members that are to be converted. The author of the serialize/quantize functions for a particular class can choose which data members will be encoded/decoded to/from a binary stream. For atomic data types, we use the XDR package for our conversion needs. XDR was a good choice since it's available on a great number of UNIX and non-UNIX systems. In fact, any system that has NFS had better have XDR.

With an ArgPkg object, data is converted to binary by simply calling its serialize() method. This method in turn calls the serialize() routine for each of the packaged data types. An ArgPkg object also supports a serialLength() method that is implemented in the same recursive fashion.

5a.4.3.2 Packaging the Metadata

Once the data is packaged, the next step is to instantiate a ParameterizedMsg object. This is constructed with an ArgPkg object and the following metadata that describes how the message should be sent and received:

The delivery type and priority default to IPC_Types::ASYNC and IPC_Types::NORMAL_PRIORITY, so these can be omitted by the client.

Quite often, a client develops a class that inherits from ParameterizedMsg. This is usually convenient, but not necessary unless the client wants to directly control how the message data is translated to binary. In this case, the client passes a null pointer for the arg object when constructing the message object. The derived class should then provide implementations for the virtual methods: structToByteStream() and byteStreamLength().

5a.4.3.3 Initiating the Send

Now that the data and the attributes are packaged, the client can initiate the sending of the message by calling one of the two send methods, passing either a pointer to a ChildProcess object or an IPC_Target object (described in Section 5a.4.1), indicating where the message should be sent. These methods will return one of the enumerations of IPC_Types::ErrorCodes. If the delivery type is IPC_Types::SYNC, then these methods may take a while to return, especially if the destination process is not responsive, or network traffic is heavy. These methods will return immediately if the transmission is asynchronous. Clients should check the return values of these methods. If the methods return IPC_Types::IPC_TRY_AGAIN_LATER, then the destination is unreachable. If IPC_Types::IPC_HOPELESS is returned, then there is something wrong with the destination address that was passed in or some of the meta-data values are invalid.

As soon as one of the send methods return, the ParameterizedMsg object is no longer needed unless the client wants to send the same message at a later time.

5a.4.3.4 Preparing for the Send

Here is a peek at the implementation of the two ParameterizedMsg::send methods.

Most of the real work is done by ParameterizedMsg::send (const IPC_Target *target). ParameterizedMsg::send (const ChildProcess *process) simply extracts the IPC_Target from the process object and then calls the other send method which follows the following algorithm.

  1. Check the IPC_Target for validity. If it is null, we return immediately with the value of IPC_HOPELESS.
  2. Obtain a Connection object by passing the message priority to Connection::getConnection(). If this fails, then the message priority value is invalid so we return immediately with the value of IPC_HOPELESS. If the priority is normal, then the SocketConnection object created at static initialization is returned. If this is the first high priority message sent or received by the process, then a new SocketConnection object dedicated to high priority messages is constructed and returned.
  3. Construct the message header struct. This struct has the following fields: The message header struct also has serialize, quantize, and serialLength routines, since the header will be encoded and decoded to/from the binary byte stream.
  4. Dynamically allocate a byte stream (an array of 8 bit integers) big enough to hold the data and the header.
  5. Write the header to the front of the byte stream by calling the serialize method for the header struct. Write the message data to the byte stream behind the header by calling ParameterizedMsg::structToByteStream() which simply calls the serialize method for the arg object.
  6. Initiate the transmission by calling the send method for the Connection object obtained in Step 2, passing the priority, delivery type (sync or async), the destination address, and the byte stream. Return to the caller the IPC_Types::ErrorCodes value that is returned from Connection::send. Keep in mind that Connection::send is a pure virtual method. Currently, only SocketConnection is derived from Connection, although it is possible that we may someday add additional derivations of Connection objects.

The message is now completely packaged, converted and ready to transmit!

5a.4.4 Data Transmission

This section describes how a binary stream of bytes representing an IPC message is transmitted between two processes. Here we will assume that the connection between processes has already been made. Section 5a.4.6 will describe how connections are established. We are also assuming that the socket buffer between the processes is not full. If it is, then sending the message becomes much more complicated; this will be covered in Section 5a.4.7.

5a.4.4.1 Selecting a Socket

When a client initiates a send, SocketConnection::sendMsg() eventually gets called with the byte stream to send, the address to send to, and instructions on how to send (sync or async). A SocketConnection object maintains a dictionary of DataSocket objects indexed by the address (IPC_Target object) of the connecting process. This method uses the specified address to locate a DataSocket object from the table. If one is found, then we need to confirm that the socket managed by the object is still connected. We do this without actually sending or receiving data by using the UNIX routine, getpeername(). If that call fails, or we couldn't find an associated DataSocket object, then we should try to establish a connection. If we cannot establish a connection, then it is safe to abort the send with a status of IPC_Types::IPC_TRY_AGAIN_LATER. If getpeername() is successful, then we have a found a socket and we can begin writing to it.

5a.4.4.2 Writing to a Socket

After SocketConnection::sendMsg() has identified the socket object to use, the byte stream and a pointer to the data socket object are packaged into a PendingMsg object. PendingMsg is a fairly simple class of objects that supply a method for writing to a socket (PendingMsg::send()) and methods indicating whether the send is finished and whether it is successful.

SocketConnection::sendMsg() then invokes the DataSocket::writeMsg() method on the selected socket object, passing the delivery mode and the PendingMsg object. DataSocket::writeMsg() will return a status indicating whether the write was successful; that status value is then returned by SocketConnection::sendMsg().

The first task for DataSocket::writeMsg() is to record the current time. This will be useful in determining which is the least recently used socket.

Next, the socket object places the socket in non-blocking mode. This means that if we try to write to a full socket, the write() call will return immediately with a status of EAGAIN. We use UNIX fcntl() calls to toggle between blocking and non-blocking mode.

As we mentioned earlier, the PendingMsg object will perform the write to the socket. The constructor of this object determines how many bytes are in the message stream by reading the first four bytes of the binary stream. PendingMsg::send() will call the UNIX write() routine, which does the socket writing. Write() will return the number of bytes written to the socket. At this point, there are three possibilities:

Because of that third possibility, it may take several calls to PendingMsg::send() before the complete message is sent.

After the DataSocket object places the socket in non-blocking mode, the PendingMsg object is asked to try to write its message. The socket is then immediately placed back into blocking mode. If the write has completed either successfully or unsuccessfully, then we are done. Yippee! If the write did not complete, the receiving process must be busy with some other task which caused the socket buffer to fill up. With that being the case, the situation becomes much more complicated and is explained in detail in Section 5a.4.7.

After the write completes, the PendingMsg object will be destructed causing the binary stream to be deallocated (which can be a fair chunk of memory for large messages). DataSocket::writeMessage() will then return a status indicating whether the write has succeeded or not. If SocketConnection::sendMsg() gets a status back indicating the write has failed, we can assume that the connecting process no longer wants to engage in conversation. As a result, we destroy the DataSocket object which closes our end of the connection, and then remove the socket object from the table of connections.

5a.4.4.3 Detecting Message Arrival

The sending process has just successfully placed a complete binary stream message on the socket buffer. The destination process must punctually remove and process this message in order to avoid slowing down the sending process. Let's assume that the destination process is now idle waiting for events to arrive. SelectDescriptorEvents() (mentioned in Section 5a.4.2.1) detects that there is data to read on one of the socket devices that it is monitoring. It tells the object managing the socket by calling the object's DataSocket::handleEvent() method.

5a.4.4.4 Reading the Socket

The primary task of the method DataSocket::handleEvent() is to remove binary data from a socket and to assemble that data into a message that can be received and processed by clients of the IPC library. It is possible that the message will be sent in fragments, requiring multiple calls to this method in order to read the entire message. The following algorithm implements this method.

  1. We record the current time. This will be useful in determining which is the least recently used socket.
  2. If the client is waiting for a message from a particular process, and this socket is not connected to that process, return immediately without reading any data. See Section 5a.4.2.3.
  3. The DataSocket object maintains a pointer to a buffer capable of containing a binary stream. If that pointer is null, we are about to read a new message. If it is not null, we are about to read a fragment of a message that we started reading in an earlier invocation of this method. If we are reading a new message, read the first four bytes of the socket using the UNIX read() function which will return the number of bytes to read (the length of the message). If the read returns 0 bytes, that is a sign that the connecting process has broken the connection. In response to that, we close our socket by asking the parent SocketConnection object to destroy this socket object. Since the length is now known, we allocate a byte buffer big enough to hold the message and copy the length to the front of the buffer. Some data members are also initialized, recording the message byte length and how many bytes have been read thus far.
  4. We call UNIX read() again, attempting to read the rest of the message, passing a pointer to the unfilled portion of the read buffer, and the number of bytes still left to read. Again, if read() returns 0, we close our socket. Once read() returns, we update the number of bytes read thus far. If we haven't read the entire message yet, we return here and wait for DataSocket::handleEvent() to be called again.
  5. Once the entire message has been read, the binary byte buffer is passed to the static method, Dispatcher::route() which will pass the message on to the client. The Dispatcher singleton is responsible for deleting the byte buffer once the client is finished processing. The data members that keep track of the reading progress are initialized to null values.

5a.4.5 Client Interface for Receiving a Message

The previous section described how the IPC library receives a binary byte stream that represents a message arriving from some other process. This section describes how the binary message is passed to client code, and how the byte stream can be converted into a meaningful data representation that the client can use.

5a.4.5.1 Logical Modules and Message Types

IPC messages for a particular purpose are grouped into a logical module. For example, logical modules might contain messages that pertain to data notifications, log stream, the file access controller, extensions, etc. A process can have as many logical modules as required. Each logical module has an identifying enumerated value associated with it, defined in IPC_Types::Modules. As mentioned earlier, the metadata passed with every message includes this logical module identifier.

In addition to the logical module, every message has a type associated with it. The type tells the client what the format of the data is, and what the message should be used for. For example, one type of message could be an instruction for the IGC to load a color table. This type of message contains one piece of data, a color table key. A logical module may contain one or many types of messages. Like the logical module, a type has an enumerated value associated with it that is passed with every message as part of the metadata.

5a.4.5.2 Use of Receiver Objects

For every logical module, the client must instantiate a receiver object which receives all the messages for that module. Ideally, this object should be created during initialization before a process begins waiting for events. This object must be derived from the abstract base class Receiver containing a single pure virtual function, receive(), for which the client must provide an implementation. When the IPC library calls this method, it passes the byte stream, the number of bytes in the stream, the address (IPC_Target object) of the sending process, and the message type.

5a.4.5.3 Delivering the Message to the Receiver

As mentioned earlier, a process can have multiple logical modules and thus multiple receiver objects. The IPC library manages the multiple receiver objects with a singleton Dispatcher object (not to be confused with the EventDispatcher). The Dispatcher maintains a dictionary of receiver objects indexed by the logical module identifier. Thus, receiver objects must register with the Dispatcher soon after they are created. Lots of our receiver objects are instantiated and registered at static initialization time. The advantage of a static declaration in the .C file is that a client process just has to link with the .o file, and the receiver is all ready to do its thing.

Once a DataSocket object receives an entire message, it passes the byte stream to the static method, Dispatcher::route(). This method will then extract the message header from the byte stream, and use the logical module ID in the header to locate an associated registered receiver object. The receive method for that object is then called, passing the message byte stream without the header, the size of the header-less byte stream, and some other metadata that was included in the header. Once the receive method returns, Dispatcher::route() deallocates the memory for the byte stream.

5a.4.5.4 Interpreting the Message

A client must provide an implementation for the method that receives different types of messages for a particular logical module. Typically, a receive method will contain a switch statement, selecting on the various types of messages for that module. Each case in the switch should convert the binary stream to some meaningful data types and then pass that data to some object or module that can process the request and data. To convert from binary to random data types, an ArgPkg object is used. ArgPkg template classes were introduced in Section 5a.4.3.1 for converting multiple objects and variables into a binary stream. They can also be used to convert a binary stream into objects and variables. For example, if a client wanted to decode a message that contained a DataTime object, a text string, and a double precision floating point value, the client would instantiate an ArgPkg3 object using the constructor that accepts a binary byte stream. The individual pieces of data can then be obtained by calling the object's param1(), param2(), etc. methods. Here is a code fragment that illustrates this example:

5a.4.6 Establishing and Breaking Connections Between Processes

With our IPC system, a process can have up to twenty connections to other processes. Each connection is two way, allowing both the receiving and sending of messages. Connections are created as needed, and maintained until one of the processes decides to break a connection. Each process transmits data through these connections using a TCP socket, managed by a DataSocket object. This section describes how the IPC system manages connections between processes.

5a.4.6.1 Making Room for a New Connection

The IPC library has a self-imposed limit of allowing up to twenty connections. Since a process has a UNIX-imposed limit of sixty active I/O devices, we decided that the IPC library should at the most use a third of that valuable resource, leaving plenty of I/O devices for the client to use.

Whenever a process initiates or receives a connection request, we count the number of DataSocket objects that each SocketConnection object owns. Then we add two to the count since every process has at least one SignalPipe and AcceptSocket object also.

If our count is over twenty, we have to close down one of our existing connections. There are a number of ways we could choose the unlucky connection, all equally appealing. We opted for the least-recently-used approach. Each time a socket is written or read, we record the current time inside the DataSocket object. We destroy the DataSocket object with the oldest recorded time which will close the socket and allow us to have a new connection without exceeding our limit.

5a.4.6.2 Initiating a New Connection Request

When a process wants to send a message to another process, the SocketConnection object associated with that message's priority tries to locate a DataSocket object that is connected to the destination process. If one can not be found, then the sending process will initiate a new connection request.

To initiate a new connection request, the SocketConnection object makes room for the new socket if necessary. It then constructs a DataSocket object by passing the address of the destination process. This DataSocket constructor creates a TCP socket and then calls the UNIX routine connect(), passing the address. The address of the process to connect to is actually the IP address of process host and the port number of the TCP socket managed by the connecting process's AcceptSocket object. As long as the accept socket has been created, and a listen request has been submitted, a connection can be made, even if the connecting process is suspended or busy processing. Generally, connect() will determine quickly whether the connection is feasible, although it is possible that connect() could block the initiating process if network traffic is heavy. For now, we decided to allow the blocking because during our testing, connect() always returns almost immediately. If it is a problem later on, we can install a time-out value on the connection. This can be done by placing the socket in non-blocking mode, and then issuing a connect request. If connect() returns EINPROGRESS, invoke a select() call, waiting until the socket becomes writable or the time-out has been exceeded.

If connect() returns ECONNREFUSED, we know that the connection cannot be made because the address was bad, the network or the connecting host is down, or the connecting process is not running. The DataSocket constructor then closes the socket and sets the file descriptor data member to -1.

After the DataSocket object is created, the SocketConnection object checks to see if the object contains a valid file descriptor. If so, the object is added to the dictionary of connected DataSocket objects maintained by the SocketConnection object. If not, the DataSocket object is destroyed, and a problem is logged with LogStream, indicating that the connection attempt has failed.

5a.4.6.3 Receiving a New Connection Request

The previous section discussed the chain of events inside a process that tries to connect to some other process. This section describes how that other process receives the request and completes the connection.

Every process capable of IPC has a single AcceptSocket object that manages a TCP socket whose sole purpose is to accept connection requests. When a request arrives, this socket becomes ready for reading. SelectDescriptorEvents() detects this and calls the AcceptSocket::handleEvent() method for this object. This method will tell the owner of this object, the normal priority SocketConnection object, to make room for a new socket if necessary. Next, the method calls the UNIX routine accept(). This creates a new socket that is connected to the socket managed by the DataSocket object living in the process that initiated the connection. Accept() returns the file descriptor of this socket, which is used to construct a DataSocket object. This constructor doesn't have much to do since a socket was already created. It just saves the file descriptor passed in, and then performs the initialization common to both constructors. The new DataSocket object is then given to the parent SocketConnection object.

At this point, the SocketConnection object managing normal priority sockets cannot add this new DataSocket object to its dictionary of connected sockets because we do not know the address of the process to which the new socket is connected. We also do not know the priority level of this socket. The accept() call can optionally return the address of the peer socket in the other process; however, this won't help us, since that address contains the port number of the socket managed by a DataSocket object. The address of the connecting process must contain the port number of the socket managed by the AcceptSocket object. Fortunately, the process address and the priority is sent in the header of every message, so the mystery surrounding this new socket will be resolved once its first message arrives. In the meantime, the SocketConnection object adds the socket object to a set of sockets that do not know yet who their connecting processes are. DataSocket::handleEvent(), the method that reads messages from a socket, will check to see if the connection address is known. If it isn't, the read method reads the header from the front of the byte stream, looking for the address of the sending process and the priority. The address and priority now known, this DataSocket object can now be placed properly in the SocketConnection object that manages the priority of this socket. Also, this DataSocket object can be removed from the set of mystery sockets maintained by the normal priority SocketConnection object.

5a.4.6.4 Breaking a Connection

A connection is broken between processes when one of the processes terminates either normally or abnormally. Also, a process can break a connection if it needs to make space for a new connection. Usually, one process breaks the connection, so the other process needs to detect the break so it can clean up. We can detect a broken connection with several different scenarios:

In order to clean up a broken connection, one of the SocketConnection objects destructs the DataSocket object and removes it from its dictionary of known sockets or its set of unknown sockets. The DataSocket destructor closes the socket, freeing up the file descriptor for other incoming connections. Also, any threads waiting to write to this socket are terminated.

5a.4.7 Waiting to Write to a Full Socket Buffer

It's easy to imagine a scenario of how a socket buffer between two processes can become full. One process may be continuously writing to the buffer, while the other process is suspended, really busy processing, or initializing and is not able to read the buffer. If the socket buffer is in blocking mode, the UNIX write() routine will block until there is space on the buffer. Blocking is generally bad. Even though it is important to send the message, the sending process has more important things to do than to wait for a message to be sent. If the socket buffer size is small, then this blocking scenario is more likely to occur. Currently, we set our buffer size to be the maximum that TCP allows: 256K. We reduce the risk of filling the socket buffer, but at the cost of increased memory use. The socket buffer size can be easily configured; there is an entry for it in ipc.config. Even with a very large socket buffer, we have seen instances where a program fills the buffer. For example, the buffer between the notification server and an fxa process usually fills while the fxa process is initializing. Since a full buffer cannot be avoided, we offer two approaches for dealing with a full buffer.

The best approach to use is left up to the client of the IPC library. If the client needs to know that its message has been completely received and processed, then the synchronous wait should be used.

This section discusses how to detect if these approaches are needed, and the implementation of both approaches. However, before discussing the approaches, we discuss how our design utilizes mutual exclusion mechanisms.

5a.4.7.1 Detecting a Full Buffer

Most of the time, a message can be sent without blocking. Since both of the above approaches incur overhead and ramifications, they should be used only if necessary. Before sending a message, we apply two full buffer checks.

First, we count the number of active threads that the DataSocket object has invoked. If there is at least one, we know that the socket buffer is still full, since a thread will terminate as soon as it can do its write, and a thread is created only in response to a full buffer.

If there are no threads active, the buffer could still be full. For the second check, the socket buffer is placed in non-blocking mode and the PendingMsg object will attempt to write the entire message to the buffer. In non-blocking mode, the UNIX write() routine will return the number of bytes written, even if it wasn't able to write the requested number of bytes. If the entire message was not written, the socket buffer is full. After the write attempt, the socket buffer is returned to blocking mode. One hopes that the complete message will be written to the buffer, and neither of the following two approaches needs to be applied.

5a.4.7.2 Mutual Exclusion

For both approaches, this design makes use of the mutual exclusion mechanism (also called mutex locks) provided by the pthread library. Mutex locks are used to ensure that only one thread of execution has access to a data structure or an operation that is global to all threads. Thus, many threads may be waiting for a single thread to finish an operation like writing to a socket buffer. Once that thread is finished, it unlocks the mutex, and one of the waiting threads is then granted access. The choice of which thread goes next is left up to the pthread library. A first come, first served approach might make sense, but the library's choices seem to be random.

Each DataSocket object manages two mutex lock mechanisms. The first ensures that fragmented messages are not interleaved with other messages. Remember, sometimes it takes several write() calls to send a complete message. Without a locking mechanism, thread A might write 20 percent of its message causing the write() to return since the buffer is now full. Thread B is given access to the socket buffer and writes a different message. The DataSocket object in the receiving process has trouble reading the message from thread A, since it has no way of telling that a new message is interleaved.

The second mutex lock is used to protect the DataSocket object's thread table, a dictionary that keeps track of the running threads. If a lock were not used, it would be quite possible that the main thread of execution is reading or writing to the table while a sending thread was also writing to the table.

5a.4.7.3 Synchronous Waiting

For a synchronous message send, we not only have to wait for available space on the socket buffer, but we have to wait our turn for access to the buffer along with any other active threads. This is done by trying to lock the DataSocket's mutex lock for sending. It is possible that the main thread of execution could be blocked, waiting for other threads to do their writing.

Once the main thread of execution obtains the mutex lock, it can proceed with writing the message. However, we don't want to block indefinitely! The number of seconds to wait for space to become available on the buffer is specified in the configuration file, ipc.config. Current configuration is set for 3 seconds. Before trying to write to the socket, the UNIX select() routine is used to determine if the socket is ready for writing. If select() returns 1, we know that there is space on the buffer and the write() call will not block. If select() returns 0, our patience has reached its limit. At this point, it is safe to assume that the receiving process is truly unresponsive, so we should break our connection with the receiving process.

The synchronous approach may sometimes be used even though the client requested the asynchronous approach. Some processes link with middleware libraries that register handlers for asynchronous signals without using our signal catcher mechanism. If a sending thread is invoked, then those signals will not be delivered consistently to the library. This is the case with our D2D executables that link with the Freeway library (wfoApi, dialRadar, etc.). The best solution is to prevent these executables from ever creating threads. This can be done by calling the static method, Connection::preventThreadCreation().

5a.4.7.4 Asynchronous Waiting

With asynchronous waiting, a separate thread of execution waits for space on the socket buffer while the main thread of execution continues processing. As mentioned in Section 5a.4.4.2, a PendingMsg object encapsulates the byte stream for a message and coordinates the socket buffer writing. A PendingMsg object also has a pointer back to the DataSocket object managing the socket. The sending threads all execute the same routine, doThreadSend(), which takes a PendingMsg object as a parameter.

Here is the algorithm for doThreadSend()

  1. We try to lock the mutex lock that protects against interleaved message data. This thread will probably block until the lock is obtained.
  2. Keep invoking PendingMsg::send() on the message object that was passed in until PendingMsg::sendCompleted() returns true.
  3. Unlock the mutex so that some other thread can now write to the buffer.
  4. Tell the DataSocket object that this thread has just completed. The DataSocket object will then update its thread table, and deallocate the message byte stream. Also, if the send was not successful because the connecting process has broken the connection, a flag is set to true indicating that a thread send has failed. With every future message send, the flag is checked. If the flag is true, we destroy the DataSocket object, which will abort all pending sends.

This approach is far more efficient and sexier than the synchronous approach, but the client sending the message may not care about sexiness or efficiency.

5a.4.7.5 Thread Management

Every DataSocket object is responsible for managing the threads that are waiting to write to the object's socket. Each thread has an identifier assigned to it by the pthread library. As mentioned in the previous section, each thread is also passed a PendingMsg object, encapsulating the message to be sent by the thread. In order to keep tabs on the running threads, each DataSocket object maintains a dictionary of pointers to PendingMsg objects that have been passed to a thread. The dictionary is indexed by the identifier of the thread holding on to the associated PendingMsg object. As mentioned in Section 5a.4.7.2, access to this dictionary must be protected by a mutex lock since multiple threads may be accessing the dictionary simultaneously.

A thread is invoked by calling the pthread library routine pthread_create() by passing the thread routine doThreadSend() and a pointer to the PendingMsg object. The creation routine returns an identifier which can be used to make a new entry in the table.

We made a design decision that we should somehow limit the number of threads running in the system because an abundance of pending threads means that potentially a lot of message memory is allocated. Also, a plethora of running threads implies that the receiving process is not responding or is possibly hung. After mulling over several possibilities, we came up with the approach of counting the number of bytes for all the pending messages whenever we are about to create a new thread. If it is over a certain threshold value defined in ipc.config we will break our end of the connection, which involves destructing the DataSocket object and terminating all the running threads for that DataSocket.

Under normal circumstances, a thread terminates under its own control, and then tells its DataSocket object that it has finished. However, whenever a DataSocket is destructed, it needs to tell all the running threads to terminate, ASAP. This is done by using the pthread library routine pthread_cancel(), passing the thread identifier. Unfortunately, pthread_cancel() is just a request, and the return of this routine does not guarantee that all running threads have been cancelled. As a result, the destructor has to wait for all the threads to terminate before it can continue execution. It does this by trying to lock the sending mutex lock. Once it obtains the lock, we can be sure that all the threads have terminated. When a thread receives a cancel request, it does not automatically give up any mutexes that it locked. Fortunately, the pthread library calls a callback routine whenever a thread is about to be cancelled. Our callback routine, doThreadCancel(), simply unlocks the sending mutex.

5a.4.8 Signal Handling

Keep in mind that a process using this IPC library is essentially a single-threaded process until at least one message-sending thread has been invoked. Since a new thread of execution will be created only as a result of sending a message to a process whose socket buffer is full, many processes will never become multi-threaded. However, as soon as a process becomes multi-threaded, the algorithm for delivering signals in a single-threaded process no longer works. Thus, our signal catcher has been enhanced to deliver signals in both single- and multiple-threaded environments.

This section describes how a client process can register for signals and also control their delivery. It then explains the difference between synchronous and asynchronous signals and how these kinds of signals can cause re-entrancy problems. Finally, this section discusses the implementation of delivering signals in both single- and multiple-threaded environments.

5a.4.8.1 Registering for Signals

A client process handles signals by instantiating an object derived from the SignalClient class, and then registering that object by calling the static method, SignalCatcher::registerSignalClient(). The SignalClient class contains five virtual methods, each representing a logical grouping of signals. A client process can handle signals from one or more categories by providing an implementation for one or more of the corresponding virtual functions. Notice that these virtual functions are not pure. If the derived class does not override the implementation, then the base class will provide reasonable behavior for handling that category of signals.

A client process does not have to provide a SignalClient object; one with default behavior is created and registered at static initialization. Should the client process register several SignalClient objects, signals will be passed to only the most recently registered client.

Signals are categorized in the following way:

5a.4.8.2 Controlling Signal Delivery

Some of our D2D processes link with middleware packages that will not function properly if a handler is installed for the SIGCHLD signal. Tcl/Tk is an example of such a package. Thus, client processes can instruct the signal handling software to keep its grubby little paws off of SIGCHLD by calling the static method SignalCatcher::preventChildSignalHandling().

A few of our D2D processes such as the acqserver receive IPC messages from clients that do not link with our IPC library. As a result, these processes do not wait for events using the approaches described in Section 5a.4.2. Since our approach for handling asynchronous signals safely is contingent on the process using one of our event-waiting approaches, these processes must handle asynchronous signals at the risk of re-entering code that is not designed to be re-entered. If a client process calls the static method SignalCatcher::useUnsafeSignalDelivery(), then asynchronous signals will be handled as soon as they are delivered. Normally, their delivery is deferred until the client process is ready to handle them.

5a.4.8.3 Synchronous and Asynchronous Signals and Re-entrancy

Even in a single-threaded process, delivery of signals differs depending on whether the signal was generated synchronously or asynchronously because of the risk of re-entering non-re-entrant code.

Synchronous signals are the result of some error condition that occurs inside a process, and are delivered synchronously with respect to that error. For example, if a floating point calculation results in an overflow, a SIGFPE (floating point exception signal) is delivered to the process immediately following the instruction that resulted in the overflow. Our signal catching software handles the following synchronous signals: SIGILL, SIGTRAP, SIGABRT, SIGEMT, SIGFPE, SIGBUS, SIGSEGV, SIGPIPE, and SIGSYS.

Synchronous signals can be handled immediately without the danger of re-entering code since the signal was in response to an error in the code that it is interrupting.

Asynchronous signals are normally the result of an event that is external to the process, and are delivered whenever during the process execution such an event occurs. For example, when a user running a program types the interrupt character at the terminal (generally <ctrl-c>), a SIGINT (interrupt signal) is delivered to the process. Our signal catching software handles the following asynchronous signals: SIGHUP, SIGINT, SIGTERM, SIGALRM, SIGUSR1, SIGUSR2, SIGCHLD, SIGLOST, SIGPROF, and SIGQUIT.

If asynchronous signals are handled as soon as they arrive, it is quite possible to re-enter code that doesn't support re-entrancy, such as malloc() which may cause memory errors or a process crash. For example, suppose a depictable is busy allocating memory for a radar table and a SIGCHLD signal is delivered to our process, signaling that one of our child processes has died and needs to be restarted. While restarting the child, some memory is allocated and the process crashes due to a segmentation violation. Why? Malloc() accesses some shared global data structures. The second malloc() call during the signal handling was issued before the first malloc() call from the depictable had a chance to complete.

In order to avoid re-entrancy problems, our signal catching software delivers async signals using a strategy of deferred delivery. The vast majority of D2D processes alternate in a loop between waiting for events and handling new events. An event can be a timer expiring, an IPC message, a mouse button click, etc. As we saw with the previous example, async signals can be delivered while a process is busy handling an event. However, we defer the handling of that signal until the process has finished its event handling and is now waiting for a new event to arrive. The new event could be that deferred signal. We procrastinate on signal processing by using a pipe which is UNIX's implementation of a FIFO (first-in, first-out) queue. The details of using a pipe are covered in the next section.

5a.4.8.4 Single Thread Implementation

The SignalCatcher singleton object solicits signals from UNIX and passes them on to the registered SignalClient object by calling the appropriate method that matches the incoming signal.

Signal solicitation is done during static initialization time before the process executes its main() routine. The UNIX routine sigaction() is used to install one of two signal handling routines for each signal that we are handling. When one of those signals is sent to our processes, UNIX will call one of those routines, passing the signal number.

Also, during static initialization, the SignalCatcher will create a SignalPipe object that encapsulates a UNIX pipe (FIFO queue). During construction, the object creates the pipe and saves the two file descriptors for reading and writing to the pipe. It also registers with the EventDispatcher as a DescriptorEventClient so that the object will be notified when its pipe is ready for reading.

The signal handling routine for synchronous signals is called handleSignal(). This routine contains one large switch statement with cases for each supported signal. Each case will invoke the appropriate method for the currently registered SignalClient object. Thus, synchronous signals are not deferred; they are handled as soon as they arrive.

The signal handling routine for asynchronous signals is called receiveAsyncSignal(). If the client process instructed the SignalCatcher object not to defer signal delivery, then this routine will call handleSignal() which will handle the signal immediately. However, most processes will not do this since they do not want to risk a crash because of re-entrancy. Instead, receiveAsyncSignal() will tell the SignalPipe object to write the signal number to its pipe. When the process has finished its processing of other events, and is idle waiting for an event to arrive, selectDescriptorEvents() routine (covered in Section 5a.4.2.1) will determine that the pipe managed by the SignalPipe object is now ready for reading, and will call SignalPipe::handleEvent(). This method will read all the data in the pipe into an array of integers. Each cell in the array will contain an async signal that has been delivered but not handled. Looping through the array, handleSignal() is called for every cell in the array. Thus, the pipe is used to defer the delivery of async signals until a process is ready to handle them.

5a.4.8.5 Multiple Thread Implementation

In a multiple thread environment, synchronous signals are delivered to the thread that generates the signal. All other threads continue as if nothing happened. With our design, a process can have a single main thread that does most of the process work, and one or more threads that are waiting to write to a full socket buffer. For the main thread, we use sigaction() to install handleSignal() as the handler for all synchronous signals, just as we do in the single-threaded implementation. For sending threads, we rely on the default UNIX handling for synchronous signals except for SIGPIPE. The default UNIX behavior for SIGPIPE is to exit the process. That is not a good action for us. If another process closes its end of the connection, a SIGPIPE is generated; that's a signal for our process to close our socket but not to exit the process. The first thing that every sending thread does is to install a handler for SIGPIPE. This handler doesn't do anything useful, but if the sending thread tells UNIX to ignore SIGPIPE, then the thread will not be able to detect when a connection is closed.

Handling asynchronous signals in a multi-threaded environment is fairly complicated since an async signal can be delivered to any one of the running threads. The decision of which thread gets the signal is dependent on the implementation of the DCE pthread() library. Also, this library does not allow clients to install handlers for async signals with sigaction() once the process becomes multi-threaded.

After consulting with HP, we developed the following approach for soliciting and delivering async signals in a multiple threaded environment. Right before the first sending thread is invoked, the SignalCatcher object will start a thread whose sole purpose is to listen for async signals. This async signal thread will block all other threads from receiving the supported set of asynchronous signals using the UNIX routine, sigprocmask(). It then calls another UNIX routine, sigwait(), that will block the thread's execution until one of the async signals arrives and returns the signal number.

At this point, if the client process instructed the SignalCatcher object not to defer signal delivery, then the thread will call handleSignal() which will handle the signal immediately. This is very dangerous since the signal handling will occur in a different thread but might possibly simultaneously access some of the same data structures that the main thread is accessing. Fortunately, only a few D2D processes require the immediate delivery of async signals.

More likely, once the async signal thread receives a signal from sigwait(), it uses the SignalPipe object to write that signal number to the pipe. The signal is then read by the main thread of execution without interrupting any of the main thread's processing. The signal is then handled and processed by the main thread by calling handleSignal(). This approach not only addresses the re-entrancy problem, but also the issue of two threads accessing the same data structures simultaneously.

5a.5 Conclusions and Future Direction

Clearly, interprocess communication in the D2D and data components of WFO-Advanced (FX-Advanced) is a complex issue. Even though we now have quite likely the clearest IPC implementation so far, it is still a daunting prospect to completely understand what's all involved. That is, of course, the entire reason for this document, which describes the IPC system in its entirety.

In this document, we have covered the requirements and history of IPC in WFO-Advanced, including descriptions of the two predecessors to the current thread-based implementation. We learned how well each product worked and what features made the previous systems unacceptable. We also examined the current thread-based implementation and discussed its performance gains, its simplicity, its complexity, and its weaknesses.

By using an analogy of a business telephone system, we discovered what components existed in the thread-based implementation by correlating them with real-world objects. The analogy served as a focal point for the discussion of the implementation itself.

The majority of this document exposed the thread-based IPC implementation in great detail, enabling even the passive observer to understand and maintain the system. The discussion included addressing, event dispatching, message receipt, message transmission and metadata, synchronous and asynchronous sends, the use of threads, signal handling, and most of the classes and objects involved.

The WFO-Advanced IPC library continues to be open-ended and still allows for multiple concurrent transports, with threaded IPC coexisting seamlessly with any future transports. Any future work on the IPC library might include the addition of a new transport, but this is doubtful as the thread-based implementation satisfies all of our requirements quite well and in a highly advanced fashion. Future development on the thread-based implementation may evolve as vendors' thread libraries and kernel thread support improve. And although our use of threads doesn't involve concurrency, multiprocessor versions of UNIX may have some interesting (if minor) impacts on our library's performance. We look forward to better thread support as threaded programming becomes more and more the norm for software engineering.

 
Table of Contents Next Chapter


This document is maintained by Joe Wakefield. Last updated 2 Oct 97.