Monitoring an Operational Weather Forecast System Using the World Wide Web

Joseph S. Wakefield (1), Ronald J. Kahn (2), and Brian L. Moore (3)
NOAA Forecast Systems Laboratory
Boulder, Colorado

Thirteenth International Conference on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, 2 - 7 February 1997, Long Beach, CA.

Table of Contents


1. INTRODUCTION

For several years, the Forecast Systems Laboratory (FSL) has been working with the National Weather Service (NWS), developing and demonstrating prototype weather forecast support systems.

The installation of WFO-Advanced workstations at the Denver NWS Forecast Office (WSFO) in May 1996 represents a significant milestone in the NWS modernization. Based on functional specifications for the Advanced Weather Interactive Processing System (AWIPS), WFO-Advanced (MacDonald and Wakefield, 1996) supports essentially all diagnostic and forecasting operations at the Denver WSFO.

Components of WFO-Advanced include data ingest and management, user interface, display, and text generation. Each of these components needs to be monitored to ensure that the system operates as planned, providing the required support to WSFO operations. Experience gained in this monitoring effort can also be applied to the operation of the AWIPS Network Control Facility (Thigpen, 1996), whose responsibilities include remote monitoring of operations at AWIPS sites.

In this paper, we describe three aspects of WFO-Advanced monitoring. A data monitor (covering the ultimate success of the data ingest and management system), designed for both forecaster and developer use, is described in Section 2. In Section 3, we discuss the process monitor and restart mechanism, intended primarily for forecaster use. System performance monitoring (addressing the general status of the computer system), of more interest to systems administrators and developers, is outlined in Section 4.

2. MONITORING OVERVIEW

The principal user interface for WFO-Advanced monitoring is a World Wide Web (WWW or Web) browser. As with the forecaster workstations, a design goal is to minimize window manipulation to make the monitoring job as simple as possible. During critical weather situations, we need to minimize any chances that a complex interface could interfere with a forecaster's timely issuance of warnings and forecasts.

2.1 Monitor Layout

Shown in Figure 1 is a schematic of the WFO-Advanced Monitor display. In line with our "one-stop shopping" design philosophy, areas are set aside for notes and breaking news on various datasets. The latter, including items such as planned data outages, is maintained at present by FSL staff. In the future, this responsibility will be transferred to a forecast office focal point.

Figure 1.

2.2 Data Monitor

In order for our operational-forecaster audience to make effective use of data status information, we must present an appropriate level of detail. We have chosen to provide a two-level display, showing a summary of several related datasets on the main page, as shown in Figure 2, with more detail available via hyperlink. By clicking on any of the links shown in the figure, the user sees information on individual datasets. For example, the Point Data page shows the status for METAR, RAOB, lightning, profiler, and local surface data and the FSL Data page includes experimental datasets transmitted from FSL.

Figure 2.

The monitor checks individual data directories, finding the time of the most recent data in each case. The summaries display the "weak link" state of each set.

2.2.1 Selecting Data to Monitor

Our choice of what to monitor and how to summarize for forecasters is based on both data source and type. We receive satellite imagery, NCEP grids, graphics, surface and upper-air observations, and other text over the AWIPS Satellite Broadcast Network (SBN); WSR-88D data from the local radar; local observations from a Local Data Analysis and Dissemination server, and experimental data from FSL. At the top level, the data monitor reports on appropriate broad categories.

2.2.2 Data States

As suggested above, the state of each dataset is reported based on its timeliness. Three states are used, with data-dependent time-out criteria. Via hyperlink, users can view these criteria for each dataset, as shown in the sample here.

Figure 3.

The example data-set, profiler plot, should be available each hour. Thus, it shows a green check mark if the latest data have been received within an hour, a yellow triangle if between one and two hours, and an X in a red circle if it has been more than two hours since the last dataset was received.

2.2.3 Information Pages

Information about each dataset's source and storage is also provided. Not shown here is a "who to call" section, including the NCF for SBN problems and FSL staff for local data.

2.3 Additional Features

Since we store all grids from a specific model run in one file, noting that the current model run is "in" is not sufficient, since that indicates simply that at least one grid of perhaps thousands has arrived. To provide more information on the completeness of the data, an inventory is performed on each file, and the percentage of grids actually present is shown with the check mark, in increments of 5%.

3. PROCESS MONITORING

Data can be late for one of two broad reasons. The first is a failure somewhere in the data delivery system, and the second is a problem once the data arrive at the WSFO. To minimize problems of the latter variety, the data ingest system has a number of automatic fault handling and restart mechanisms built in. At times, however, these too fail, so we monitor most of our ingest processes, and provide this information to forecasters.

3.1 Monitor Layout

The Ingest Process Monitor checks 25 processes in six classes. In the example shown below, at least one of the Point processes is indicated as down. The user can select a hyperlink which shows the specific process in question. Clicking on the red X brings up a menu to allow the failed process to be restarted.

Figure 4.

Our monitor uses scripts to restart the workstation ingest processes. These scripts provide an interface that lets the forecaster choose which process(es) to restart.

3.2 Restart Mechanism

When the user clicks on the X (Fig. 4, above) indicating that the Point ingest is down, a restart menu appears, as shown in Figure 5. The user selects the subsystem(s) to restart and presses "Go!"

Figure 5.

4. SYSTEM PERFORMANCE MONITORING

Monitoring system performance is just as important as monitoring incoming data in an operational Weather Service office. Regular monitoring of system resources can help identify malfunctioning systems or software, and can also provide guidance for system sizing.

The WFO-Advanced monitoring system has been used by systems administrators, software developers, and management for these purposes. The monitor was written using SAR (System Activity Reporter), a tool that is included with most UNIX operating systems, including Hewlett-Packard's HP-UX, as well as a standard Web server, Perl, and the PBM-Plus and gnuplot packages, all readily available on the Internet.(4)

4.1 Performance Data Collection

SAR samples cumulative activity counters in the operating system and periodically writes them to a disk file. Resources of interest for WFO-Advanced that can be monitored with SAR are CPU utilization, queue lengths, system tables, and activity of buffers, swapping, and block devices. Shell scripts that were supplied with SAR are used with a minor modification for archiving purposes.

Data collection occurs on each data server, applications processor, and display workstation in the WFO-Advanced configuration. Striking a balance between reporting detail and system loading, we collect performance data averaged over 15 minute periods for each node. Data are stored for each CPU and disk present on the system, consuming 300 to 500 kilobytes per day for each WFO-Advanced host.

The WWW server gathers these data from each host and saves them for use by the performance monitor page. The script that copies these data also removes old data from those hosts. Currently, the server stores data for the past 30 days for each of seven hosts, for a grand total of approximately 100 MB.

Each host's data files are stored in an appropriately named subdirectory on the WWW server. Once a day, the directory structure is scanned and a new main page for the performance monitor is generated. Newly monitored hosts are automatically added to the list of hosts with available performance data.

4.2 Performance Data Requests and Display

A main menu page (Figure 6) enables users to request data from the data files collected by this system, and can be used from any WWW browser. No special features of any specific browser were used in the construction of the page.

Figure 6.

From the main page, users select the data to display, choosing as many hosts, dates, and datasets as they wish. This request is then submitted to the server by clicking on Show Data, and the requested data for each host appears on a new page that is generated by the script handling the request from the main page. The data page includes appropriate identifying information, plus a back link and the time when the page was generated.

Data can be displayed in either a tabular or graphical format. The former is the raw output from SAR plus a legend to explain the abbreviations used.

For graphical display, gnuplot is used to create Portable Bitmap (PBM) format files. As seen in the sample in Fig. 7, the abscissa is time of day (UTC), with the actual collected data as the ordinate. Where data have no natural scale, the range is selected automatically by gnuplot.

Figure 7.

PBM-Plus filters are used to convert the graphics from PBM to Graphical Interchange Format (GIF) for display. Since these GIF files will not be displayed until the script has exited, the script cannot remove them. They and other temporary files used for constructing the displays are removed periodically by a cron script.

4.3 Interpretation of Performance Data

The data from SAR presented by this monitor are wide-ranging and can be interpreted in different ways. They can be used to show the general health of a host and utilization of system resources. This is also a useful tool for systems administrators and developers to assist in troubleshooting performance-related problems. The WWW-based monitor simplifies performance data retrieval, archival, and display for all levels of users.

SAR breaks CPU use into four states. These include user CPU time (time spent running user programs, including executing numeric and other calculations), system (time spent by the system executing kernel code on behalf of user programs, such as input/output (I/O) requests), I/O wait (where a process is waiting on a read from or write to physical memory or disk), and idle.

UNIX uses buffers in main memory to help improve disk I/O performance. Data written to disk by a user program will first be written to this cache, then later to physical disk. These data can be re-read from cache until the space is needed by other data. Similarly, when the system reads from disk, it uses an input cache, and will usually read more data than requested. Buffer caches are monitored by SAR, reporting the number of cache reads and writes, the number of disk reads and writes, and the cache hit ratio.

Block device activity (file system I/O) is also monitored by SAR. Each file system is monitored independently for the portion of time it is busy servicing requests, the average number of outstanding I/O requests for that file system, transfers and bytes per second to and from the file system, the average time transfer requests waited idly on queue, and the average time to service transfer requests. If these statistics show an imbalance in file system activity, the system administrator might wish to relocate some datasets in order to balance the load.

Another monitored operation is tty device activity, traffic to and from modems and terminals. The number of input and output characters and interrupt rates is monitored.

System calls occur when user programs request system services. Parameters that are monitored include reads, writes, forks, execs, and the number of characters transferred by system calls to block (random-access storage) devices.

System swapping and switching activity is one of the most important parameters that can be monitored on a UNIX system. A process is swapped in, or moved to primary memory, when it is ready to be run by the CPU. It remains in memory until the system needs the space for use by another process, at which point the contents of the space allocated to the process is moved to swap space on disk. SAR monitors the number of swap-ins and swap-outs per second and the amount of data transferred during these swaps. It also monitors the number of process context switches. The context of a process is generally defined as its state, including values of user variables and data structures and machine registers. A context switch occurs when the UNIX kernel decides to run (execute in the context of) another process (Bach, 1986, p. 29). Excessive swapping is an indication that the machine is memory-starved, and overall performance suffers.

UNIX run and swap queue lengths are monitored by SAR. The run queue contains the processes that are either running or waiting to run. The swap queue is the list of processes that are swapped out but ready to be run. In addition to the average lengths of these queues, SAR also reports the percentage of time these queues are occupied.

Information on several internal UNIX tables is reported by SAR. The current and maximum sizes of the process table, representing the number of processes that may be run on the system at any time, are displayed. When the process table is full, the system will not start any new processes until the ones that are currently running exit. The inode table is a cache of data structures that contain information about files in filesystems, and is monitored for its size. When the inode table fills, the system removes older entries and replaces them with new ones. The file table limits the number of files that can be open at any given time. When it fills, subsequent opens on files fail until those that are currently open are closed and its table entry released. SAR also reports the number of times each of these tables filled (Loukides, 1990, pp 74-75).

Messages and semaphores are System V interprocess communications facilities, and the number of calls to these are monitored by SAR.

5. CONCLUDING REMARKS

FSL has made effective use of the World Wide Web in an intranet mode to monitor the state of data ingest (both processing and data availability) and the overall activity of the WFO-Advanced system. We will continue to refine the displays based on operational experience and feedback from forecasters.

6. REFERENCES

Bach, Maurice J, 1986: The Design of the UNIX Operating System, Prentice-Hall, Inc., 471pp.

Loukides, Mike, 1990: System Performance Tuning, O'Reilly & Associates, Inc., 312pp.

MacDonald, A. E., and J. S. Wakefield, 1996: WFO-Advanced: An AWIPS-like Prototype Forecaster Workstation. Preprints Twelfth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Atlanta, Amer. Meteor. Soc., 190-193.

Thigpen, R. K., 1996: The AWIPS Network Control Facility, An Introduction. Preprints Twelfth International Conference on Interactive Information and Processing Systems for Meteorology, Oceanography, and Hydrology, Atlanta, Amer. Meteor. Soc., 528-530.


Footnotes

(1)
Corresponding author address: Joseph S. Wakefield, NOAA/ERL/FSL R/E/FS4, 325 Broadway, Boulder, CO 80303-3328.
(2)
Joint collaboration with the Cooperative Institute for Research in the Atmosphere, Colorado State University, Fort Collins, CO 80523.
(3)
Contract with Gonzales Consulting Services, Denver, CO 80203.
(4)
Perl is by Larry Wall; PBM-Plus is copyright 1989 by Jef Poskanzer; and gnuplot was written by Thomas Williams and Colin Kelley, with enhancements by others.

This document is maintained by Joe Wakefield.
Last updated 3 Oct 97