D2D Operations Info

v fxa-3.01

Quick access internal links

Audience

This document is directed at FSL and WSFO staff who may be called upon to diagnose problems with the WFO-Advanced data ingest, internal communications, and display software at the Denver WSFO. The operational staff at the WSFO has higher-level monitoring and restart tools available that are not described here.

Support

During the first months of WFO-A operation, FSL is closely monitoring the Denver systems, with staff on call 24 hours per day. For data and workstation problems, Joe Wakefield, (448-2456), Darien Davis (448-2458), Ron Kahn (448-2449), Susan Williams (448-2457), Carl Bullock (448-2464), or Frank Tower (one of the above numbers) will be on call. For system/hardware/network problems, Gregg Phillips can be paged at 448-2448. [Note for Boulder folks: use 361-0674 to get hold of Denver forecasters; 361-0673 for Bob and Greg.] FSL operators are on duty 7 a.m to 10:30 p.m., daily and can be reached at 497-6887.

Our goal is to make the Denver office operate in an AWIPS-like fashion as far as system support is concerned. Of course, we aren't using AWIPS, and recognize that this goal will be difficult to reach.

Environment

When you become fxa on dendata1, several scripts are automatically run to set a number of environment variables, etc. The .cshrc script sets this process in motion. Settings of interest include FXA_DATA (/data/fxa on dendata1, /data/fxab on dendata2), FXA_HOME (/usr/local/fxa), and TZ (GMT).

All D2D processes are found in ~fxa/bin, and data files (tables, menus, WarnGen templates, etc.) are in ~fxa/data. Most ingest logs are in $LOG_DIR/yymmdd, with a few in $LOG_DIR. Display logs are on disks local to the workstation in /logs/display/[:0.0|:1.0]/yymmdd/session-hhmmss (with the latter inexplicably in local time). (On workstations, LOG_DIR is defined as /logs.)

Since FXA_HOME/bin is in fxa's PATH, there's no need to include that when entering process commands, and that's reflected in the commands included in these instructions. All commands that you'll need to enter are shown in bold type. Except as noted, all will be run from the fxa account on dendata1 or dendata2. (In general, all references to dendata1 apply equally to dendata2.)

You can get to today's ingest log directory simply by typing logs, and up will get you to its parent, where some logs live. The naming convention for ingest logs is <processName><pid>dendata1<hhmmss>.

Overview

The diagram below outlines the flow of messages through the WFO-A ingest system. Following sections discuss each interface in detail.

General data ingest

All ingest processes are started automatically when dendata1 is booted. Should it be necessary to restart, use stopIngest and startIngest.

Processes included in stop/startIngest, in the order they are started:

/usr/local/fxa/bin/acqserver 900
  /usr/local/fxa/bin/acqserver 900
  /usr/local/fxa/bin/acqserver 900
/usr/local/fxa/bin/CommsRouter COMMS_ROUTER
/usr/local/fxa/bin/CommsRouter GRID_ROUTER
/usr/local/fxa/bin/DataController COMMS_ROUTER TextCont.config
  /usr/local/fxa/bin/MetarDecoder
  /usr/local/fxa/bin/RaobDecoder
  /usr/local/fxa/bin/profilerDecoder
/usr/local/fxa/bin/DataController COMMS_ROUTER TextCont2.config
  /usr/local/fxa/bin/AlertDecoder
  /usr/local/fxa/bin/binLightningDecoder
  /usr/local/fxa/bin/CdotDecoder
  /usr/local/fxa/bin/shefEncoder
/usr/local/fxa/bin/DataController COMMS_ROUTER TextDB_Controller.config
  /usr/local/fxa/bin/CollDB_Decoder
  /usr/local/fxa/bin/StdDB_Decoder
  /usr/local/fxa/bin/RadarTextDecoder
/usr/local/fxa/bin/DataController GRID_ROUTER GribController.config
  /usr/local/fxa/bin/GribDecoder
/usr/local/fxa/bin/DataController COMMS_ROUTER SatelliteController.config
  /usr/local/fxa/bin/Satdecoder
/usr/local/fxa/bin/DataController COMMS_ROUTER RadarController.config
  /usr/local/fxa/bin/RadarStorage
/usr/local/fxa/bin/notificationServer
/usr/local/fxa/bin/RadarServer
/usr/local/fxa/bin/DialServer
/usr/local/fxa/bin/RadarMsgHandler
/usr/local/fxa/bin/watchFreeway
The stop/startIngest scripts handle the non-indented items in the list. Indented items are children spawned by the process listed immediately above.

Should you restart the ingest and still receive no SBN data, check on the acqserver processes (proc acq). One child process handles the TG data and the other, NESDIS. The former should connect almost immediately, while the latter may take a few minutes. If there are not 3 of them, and you have full data coming in on the other db machine, the system is not connecting to the SBN CP. Restart the NRS CP (dencp1 or dencp2), per the NOAAPort section. (In a partial-failure situation you might see only text (and thus METAR) or only satellite data arriving, and maybe only two of the acqserver processes. It will be necessary to restart the CP.)

Sometimes the GRIB decoder will hang on bad grids. (We've not seen this problem for a long time.) You'll see this by the GribDecoder process using lots of CPU time for extended periods, and a check on the log will show nothing happening. Issue kill -10 <pid> to force a crash. The signal handler will remove the bad grid and the controller will start a new decoder.

If you get a call that radar is not auto-updating, you'll probably need to restart the notificationServer. When you use stopNotificationServer to kill the server, it may take some time to write out its client list. Make sure you give it a chance (check the log to see if it's heard the signal 15) before using kill -9. Otherwise, when the server is restarted, the workstations won't receive green time and auto-update messages until they, in turn, are restarted. After it's stopped, use startNotificationServer to get it running again.

NOAAPort ingest

The bulk of our datasets are received over the Satellite Broadcast Network (SBN) via the NOAAPort Receive System (NRS) communications processors. Please note that dencp1 (but not dencp2) is monitored by the AWIPS Network Control Facility (NCF), which is also responsible for their maintenance. There is a switch box near dencp1 that must usually be in `Modem' position, so NCF operators can check on its operation.

If SBN data (satellite, METARs, text, grids) are not arriving, check the CP operation, to see if it's hung. rlogin dencp1 or dencp2 as user root. (Note: if you need to log in at the console, you'll need to move the CP switch to the Monitor position.) Type acq_stats -k0 -k1 -i3 to run the acquisition monitor (use ctrl-C to exit). If one or both lines is not up to date, you'll need to restart. (If both CPs have stopped at the same time, it's likely that there's an uplink problem at the NCF, or there could be a downlink problem. Check with NCF (301-713-1284) before restarting.) Other problem indicators are lots of buffers or distribution headers in use. First, stop the system with acq_ctl -A -S -f. Type ps -xaf to see what processes are running. Kill any /awips/bin/acq* that's running, then start with start_cpsbn_acq. A lot of text will scroll by as the software is downloaded from dendata1/dendata2. In many cases, you'll need to push Enter to get your prompt back. Monitor the system again with acq_stats -k0 -k1 -i3; you should see the TG line connect within a few seconds, though the NESDIS line may take several minutes. Log out (exit) (and switch back to Modem if at the console). The child acqservers will go down when you stop the CP, then will come back as data are sent. proc acq should show 2 acqservers almost immediately.

If this doesn't work, check the Sync and Signal green lights on the demod. If these are out, you'll need to contact the NCF for information. (This is unlikely, as the NCF monitors that portion of the system.) If the signal looks good, but you can't connect, you may need to reboot the CP. Log in and enter /etc/reboot. Ingest processes start automatically. (If you can't log in, you can press the reset button that's just above the CP's power switch at lower right. The system will reboot itself. Using the reboot command is preferred.)

If necessary, either CP can be configured to send data to both data servers. Follow this procedure:

  1. Log onto the cp that is running.
  2. cd /awips/data
  3. vi acqparms.sbn
  4. Make sure that the last two lines of the file define HOST 0 and HOST 1. On dencp1, HOST 0 should be dendata1 and HOST 1 should be dendata2. On dencp2 HOST 0 should be dendata2 and HOST 1 should be dendata1. If they are not defined correctly in the file, make the appropriate changes at this time.
  5. Change the following two lines in the file from
    ENABLE GOES_WEST 0
    ENABLE NWSTG 0
    to:
    ENABLE GOES_WEST 0,1
    ENABLE NWSTG 0,1
  6. Exit the file, saving the changes and the run the following command:
    acq_setupshm_dist -p /awips/data/acqparms.sbn
  7. Log off of the cp.
To prevent problems, go to the other CP's host (dendata1 or dendata2) and sudo vi /awips/hprt/data/acqparms.sbn to disable ingest (just comment out the two ENABLE lines). Otherwise, when the failed CP does come back, it will immediately try to send data to its default data server. That will hose the acqserver there. Once the CP is back up and stable, follow the above procedure first to remove the backup service, then to re-enable normal ops.

The diskless SBN CPs boot off of the data servers (cp1 from dendata1, cp2 from dendata2). One side effect of this is that ingest log files are available on the data server disks, in directory /awips/hprt/logs/Products/dencpn/acq_clntm_h0/mcProduct.log, where m is 0 for TG data and 1 for NESDIS. The system breaks these logs when they hit 1MB size (keeping a previous version called mcProduct.old), so there may not be a whole lot of history available (particularly for the TG side), but these can be useful in diagnosing missing data.

Radar ingest

A Simpact Freeway box, denfrwy1, handles the low-level comms from the RPG, making the data understandable by dendata1. A second box, denfrwy2, stands ready as a backup, but normally is not used.

syncComms is a script that runs wfoApi, which handles the transfer of data between the freeway and dendata1. Files are stored temporarily in $FXA_DATA/radar/raw and /text. Files in /raw are moved by RadarStorage to the appropriate product directories, e.g., /kftg/Z or V. The /text files are processed by the RadarTextDecoder process; output goes to the text database (e.g., WSRVWPFTG).

Radar ingest processes also include the RadarServer and the DataController/RadarStorage pair. The former communicates via the wfoApi process with the RPG over an X.25 link, while the latter are responsible for storing radar products as they are received.

  1. On dendata1, a cron job runs /usr/local/fxa/bin/x25_restart0 every minute, checking whether the ingest (syncComms & wfoApi) is up and starting it if necessary. If a clean shutdown of the WFO-A connection (at the UCP) is performed, this process will work properly, restarting the ingest once the connection is re-established. If, however, the RPG dies or the connection to WFO-A is pulled without doing a clean shutdown, it will be necessary to run x25_stop0.
  2. If the PUP and DARE are getting data, but WFO-Advanced is not, you can stop the radar ingest and let it restart itself. Execute
    /usr/local/fxa/bin/x25_stop0
    General Status Messages (GSMs) can be used to check on the status of the 88D. This can be checked either from the workstation `radar status' window, or by running DecodeGSM kftg > $FXA_DATA/logs/GSM.<97mmdd_hhmm> to get a list of recent GSMs (in the specified file). (This program decodes and removes files in $FXA_DATA/radar/kftg/GSM.)
  3. If Step 2 does not work, first make sure that the RadarServer is running. If it is, and you've restarted the radar ingest, then the freeway may need to be rebooted, using /usr/local/freeway/bin/icpReset0. If that procedure does not succeed, and tells you to reboot manually, follow these steps:
    1. rlogin to denfrwy1
      user: freeway
      password: password
    2. Select 1 (shutdown options)
    3. Select 2 (reboot server)
      You will see the message
      System Reboot in Progress...
      You must hit Enter on the keyboard to return to your dendata1 session.

    Wait for about 60 seconds and run /usr/local/freeway/bin/icpReset0. This will do the following:
    1. kill wfoApi and syncComms running on all ports on ICP0
    2. reset ICP0
    3. run the x25_manager to reconfigure buffers and circuits

    A series of messages will be written to your screen, including `Rebooting the Freeway'. The last message will be `Buffers initialized'.

If for some reason the x25_manager cannot configure buffers, it will try five times, printing out a message each time saying `Buffers not yet initialized, so retry'. If the x25_manager fails after five times, you can run the x25_manager yourself as follows:

cd /usr/local/freeway/bin
x25_manager < fw_init

If the following lines do not appear, you will need to repeat the above command until buffers and circuits are configured.

Please note that the RadarSever process must be running in order to send the RPS list and get data. The radar ingest (syncComms & wfoApi) will start but will not stay up if the RadarServer is down. RadarServer is started as part of startIngest.

LDM ingest

While most datasets used by D2D are delivered by the SBN, MAPS grids and NOWrad data are relayed from FSL via Unidata's Local Data Manager (LDM). In addition, WSR-88D data are passed from the dendata1 ingest to dendata2 by LDM. LDM processes include

rpc.ldmd -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ld
  (5 identically-named children)
  pqact
  pqexpire
  feedhere -f FSL -v -p ^FSL\.CompressedNetCDF|FSL\.GRIB.H.X -
/usr/local/fxa/bin/ldmBridge -b ldmBridgeGrid -h dendata[1 or 2].fs
/usr/local/fxa/bin/ldmBridge -b ldmBridgeRadar -h dendata[1 or 2].f
LDM ingest is managed with stopLdm and startLdm. However, in many cases, only the Bridge processes have problems. Check on $FXA_DATA/Grid/FSL/Raw. If there are many grids there, use stopBridges.scr and startBridges.scr. The decoder will catch up.

If you can't get LDM working, here's something to check. With LDM shut down, enter rpcinfo -p. If you see one or more lines beginning 300029 at the bottom, type sudo rpcinfo -d 300029 5 to remove this open socket, then startLdm again.

Text ingest and database

The text database system is also managed separately from the general startIngest and stopIngest. Text products are stored in an Informix database.

A number of informix processes are normally running on dendata1, all identified as `oninit' in a proc informix listing. Usually, only the parent process will consume significant amounts of CPU time.

If you find Informix down, it probably shut itself down due to an error. As both text and hydro use Informix databases, the problem is not necessarily a text one. To check on this, become informix and type logs, to get to the text database logs directory. Look at the end of online.log1. Lines like this:

  15:27:48 Assert Failed: WARNING! Incorrect BLOB stamps.
  15:27:48 Who:Session(8, fxa@fsldata1.fsl.noaa.gov, 1821, -1059350808) Thread(31,sqlexec, c0d98948, 1)
  15:27:48 Results: BLOBSpace textblobspace, BLOB addr: 0xa0be14, BLOB stamp 25317
indicate a corrupted text database. If you see errors other than textblobspace here, the failure is related to the hydro database. In this case, the text database is OK, and you need only stopTextDB, restore the hydro database with restoreHydro, and startTextDB (with appropriate becomes) to get going again.

The workstation uses 4 processes to communicate with the text database, to wit:

  /usr/local/fxa/bin/TextDB_Server -Write
  /usr/local/fxa/bin/TextDB_Server -Read
  /usr/local/fxa/bin/textdb
  /usr/local/fxa/bin/textdbRemote
The first two of these, along with the AFOS comms server and SHEF decoder, are started and stopped by the startTextDB and stopTextDB scripts. Another script, stopTextNotification, will stop the textNotificationServer (it's started, if necssary, by startTextDB). We prefer not to stop it, because doing so necessitates restarting all text workstations to get alarm/alert notices. The others, textdb and textdbRemote, run as needed to read/write the database. (The former communicates directly with the database, while the latter goes through the read/write server.)

Managing the text database requires care, because of the nature of the database software. In particular, it's not safe simply to kill the write server, as it may be in the middle of a transaction, and the text database could get corrupted. Thus, stopTextDB issues a KILLSERV command to the text database.

If stopTextDB/startTextDB does not clear up text storage/retrieval problems, there may be something wrong with Informix. In that case,

  1. Shut down the Read and Write servers with stopTextDB.
  2. become informix, then ./stopInformix and ./startInformix. The latter includes a status check on Informix. You should see something like this:
    INFORMIX-OnLine Version 7.12.UC1 -- On-Line -- Up 00:32:55 -- 14056 Kbytes
    The numbers don't matter much. What you're looking for is On-Line or Fast Recovery in the middle. If it says anything else try the commands again.
  3. Become fxa again and bring the Read and Write servers back up with startTextDB.
You may find database errors (things like `database update error: -346' or `database insert error -239') in the TextDB_Server logs. (Use finderr <nnn> to see information on these error codes.) This can probably be cleared up by issuing onmode -l (as informix), then become fxa and stop/startTextDB.

If the database is corrupted (usually as a result of a system crash), it will be necessary to restore it from a backup or another database.

Method 1: The Informix database (text and hydro) is backed up daily (at 0630Z on dendata1, 0830Z on dendata2). If it's not too long after archive time, the easiest thing to do is restore from archive:

  1. Shut down the Read and Write servers with stopTextDB.
  2. become informix and type logs. Look at the end of online.log1. You should see lines like this:
    02:30:12 Level 0 Archive started on rootdbs, textblobspace, ldadblobs, textdbs, textdbs2, textdbs3, textdbs4, textdbs5, ldad, wfodendbs
    02:39:16 Archive on rootdbs, textblobspace, ldadblobs, textdbs, textdbs2, textdbs3, textdbs4, textdbs5, ldad, wfodendbs Completed.
    
    (indicating a clean archive) before any lines that read
    15:27:48 Assert Failed: WARNING! Incorrect BLOB stamps.
    15:27:48 Who:Session(8, fxa@fsldata1.fsl.noaa.gov, 1821, -1059350808) Thread(31, sqlexec, c0d98948, 1)
    15:27:48 Results: BLOBSpace textblobspace, BLOB addr: 0xa0be14, BLOB stamp 25317
    
    (These latter are the indication of your corrupted database. Note: If you see errors other than textblobspace here, the failure is related to the hydro database. In this case, the text database is OK, and you need only stopTextDB, restoreHydro, and startTextDB (with appropriate becomes) to get going again.) If you don't have a clean archive, or if it's been many hours and you don't want to lose intervening data, you'll have to use one of the other methods. Skip past steps 3 & 4 for more fun!
  3. cd, then ./restore to restore the database. Press Enter when asked to mount tape 1, answer y to the continue restore? question, and n to the rest. This will take around 15 minutes, at the end of which you should see On-Line status echoed on your display.
  4. become fxa and startTextDB.
  5. If you have difficulty, check the Informix log ($FXA_DATA/logs/informix/online.log1). If you see mention of quiescent mode, become informix and stop/startInformix.
Method 2: If you don't have a clean archive, but do have a good database on another machine (e.g., dendata2, cmsdata1, fsldata1), you can make a backup, copy, and restore. (Note that you can't use the same archive method or an archive from another system, because the archive carries host information with it and is non-transferable.) Plan on spending an hour on this procedure.

  1. Start unload
    1. Go to system that is not corrupted (except bluejay), become informix, and find a disk that has at least 50 MB free (usually /data/fxa-2).
    2. Enter touch <path>/fxatext.out. (If you get an error, become fxa and set appropriate protections on /data/fxa-2/.)
    3. Enter onunload -t <path>/fxatext.out fxatext.
    4. You will get a prompt of
      Please mount tape and press Return to continue ...
      Just press Return.
  2. Prepare corrupted system while onunload is running
    1. Make sure that the TextDB_Server -Read and -Write are shut down with the stopTextDB script as fxa.
    2. become informix and go into dbaccess by typing dbaccess
    3. Do the following commands:
      1. Type d to go to the Database Menu option.
      2. Type d again to select Drop. You should see fxatext, sysmaster, and wfoden in the list. If not, you'll need to rebuild the database from scratch, as outlined in `Database Restoration Instructions' (~fxa/doc/userGuides/DBrestoration).
      3. Choose fxatext@ONLINE with the arrow keys or type in fxatext at the prompt.
      4. Informix will give you ONE chance to verify that this is the database that you want to drop. If fxatext is chosen hit y for yes, if any other database is chosen, hit n for no and start from ii. again.
      5. Hit e until dbaccess exits.
    4. At the command line type onspaces -d textblobspace. You will get a prompt for verification of the blobspace to be dropped and, after pressing y, a statement that there will have to be a Level 0 archive before any of the space can be reused.
    5. cd ~/etc and vi onconfig. Find the TAPEDEV variable and comment out the currently active TAPEDEV. Uncomment the line that says: #TAPEDEV /dev/null
    6. Save and exit and enter ontape -s. This will run a Level 0 archive and send the archive to /dev/null.
    7. Re-enter the onconfig file and change the TAPEDEV variable back to the old value.
    8. Issue the command onmode -l (letter l) to move to the next logical log file and make it possible to connect to the instance of Informix.
    9. cd and grep blob spacesetup. You'll get
      onspaces -c -b textblobspace -g 2 -p /dev/informix-1 -o 550000 -s 200000
    10. Run that command to recreate the blobspace.
  3. Move fxatext.out file and restore database
    1. When onunload is finished running on the uncorrupted system, ftp the file to the corrupted system, wherever you find room for it.
    2. cd and run the command
      onload -t <path>/fxatext.out -d textdbs5 fxatext
      Note: be sure to use the full path name of the file, even if you're in the directory. Informix will prompt you to mount the tape and press return so just press return. Then it will ask you if you want to relocate any of the blob spaces ­ answer n.
    3. If you get errors like this:
      ISAM error: illegal argument to ISAM function.
      Error building TBLspace.
      in step b, do the onload again, but this time answer y, then enter textblobspace. This will take a bit longer, but should work.
  4. Run ./archive.sh to create a clean archive of the restored database. (Archive requires the pre-existence of the archive.tape file. If it's missing, you'll get an error. A simple touch will do it.)
  5. When the load is finished, restart the Read and Write servers and delete the fxa text.out files on both the good and bad machines.
Method 3: If you don't have a clean archive, and your only good database is on a machine (such as bluejay) with a different database structure, then you're going to have to extract the data and insert it into an empty database you build. Once again, plan to spend an hour on this procedure.
  1. Start unload on uncorrupted system
    1. become informix on uncorrupted system and type dbaccess to run dbaccess.
    2. Enter the following commands:
      1. Hit q for Query Language.
      2. Select fxatext@ONLINE with arrow keys or type fxatext.
      3. Hit c for choose and select `unloadtext'.
      4. If the system being unloaded does not indicate the correct path, hit u for Use Editor, then hit return to get to vi and change the directory to the appropriate corresponding name. Save and exit vi to get back to dbaccess.
      5. Hit r for Run to start the data unload.
      6. Once the rows are unloaded, hit e twice to exit.
  2. Prepare corrupted system
    1. Do all of the steps (a - j) from Method 2 section 2.
    2. Go back into dbaccess to recreate the database.
    3. Run the following commands:
      1. Hit d for Database.
      2. Hit c for Create.
      3. Type in fxatext at the prompt and hit return.
      4. Hit d to choose the dbspace to put the database.
      5. Type textdb at the prompt and hit return.
      6. Hit e to exit dbspace section.
      7. Hit c to choose Create Database.
      8. Hit e to exit Database section.
      9. Hit q to go to Query Language.
      10. Hit c for Choose.
      11. Select fxatext with the arrow keys or type fxatext at the prompt.
      12. Hit r to run SQL.
      13. Hit e until dbaccess exits.
  3. Move data files and reload database
    1. On corrupted system, go to /data/fxa-2 and ftp to non-corrupted system and get the stdTextProd.out, lrgTextProd.out and textInfo.out files.
    2. Go back into dbaccess and run the following commands:
      1. Hit q for Query Language
      2. Select fxatext@ONLINE or type fxatext at the prompt.
      3. Hit c for choose and select loadtext.sql.
      4. If the data files are not in /data/fxa-2 hit u for Use Editor and modify the directory names.
      5. Hit r to Run the SQL and reload the database.
      6. Hit e until dbaccess exits.
  4. Run ./archive.sh to create a clean archive of the restored database.
  5. Re-start the Read and Write servers.
Note: When the database is down while the data ingest is running, text messages will queue up inside the TextDB DataController process. Once the database is back up and accepting messages, this queue will be processed. It may take a long time to catch up, however. (To see what's being processed, look at the end of the CollDecoder or StdDecoder logs.) If it's necessary to empty the queue (due to excessive length), you must kill the TextDB DataController (use proc TextDB to get the pid) and restart it using DataController COMMS_ROUTER TextDB_Controller.config & (most easily done by using X to copy this line out of ~fxa/bin/startIngest).

Hydro decoder & database

A SHEF decoder runs as part of the hydrology package. /usr/local/hydro/wfo/shef/bin/shefdecode runs under fxa, and is started as part of startIngest. Data are stored in an Informix database, separate from the text database. Other hydro cron jobs are run to manage the database, to wit:

  # hydro scripts
  00 0 * * * /usr/local/hydro/wfo/bin/CleanBad.scr
  01 20,0,4,8,12,16 * * * /usr/local/hydro/wfo/bin/CleanWFO
  03 9 * * * /usr/local/hydro/wfo/bin/run_db_cleanup 
  03 11 * * * /usr/local/hydro/wfo/bin/run_db_tuneup
  15 * * * * /usr/local/hydro/wfo/bin/run_precip_accum
  2,7,12,17,22,27,32,37,42,47,52,57 * * * * /usr/bin/perl /usr/local/fxa/bin/renameHydroFiles.pl
  3,8,13,18,23,28,33,38,43,48,53,58 * * * * /usr/local/fxa/bin/moveProds.ksh
Hydro logs are found in /usr/local/hydro/wfo/data/logs, and are purged by a cron job.

AFOS product storage

Products created on the text workstations are stored in the local Informix database and are sent to AFOS for dissemination. Process afoscommsrv handles this connection and is started as part of the startTextDB suite. Should this process go down, a message like this will be seen when trying to `Save & Exit' a text product:
Service: afoscommsrv host dendata1 connect failed: connection refused error sending to AFOS
Use startAFOS to restart it. Logs are written to $FXA_DATA/logs/afoscommsrv.*.

Interprocess communication

Messages are passed between processes using UNIX sockets. On each workstation and server, /usr/local/fxa/bin/rpc.ipcd runs to manage IPC.

If IPC problems occur (lost connection messages in log files), first stop all data ingest, then stop IPC and restart all. Use these commands in this order: stopIngest, stopTextDB, stopTextNotification, stopBridges.scr, and stopIPC. After making sure that rpc.ipcd is stopped (kill if necessary), startIPC, startIngest, startTextDB, and startBridges.scr. After restarting IPC, all text stations must be restarted, and display stations when convenient in order to receive data notifications (green times and auto-update).

After a crash or inelegant shutdown, the rpc.ipcd lock file can get left behind in /var/run (on the workstation). If the workstation startup shows startIPC activated, and the IGCs never come up, check to see if rpc.ipcd is running. If not, go into /var/run and delete rpc.ipcd.lock, then try starting the workstation again.

Cron

The lists of scripts and processes run on a by-time (cron) basis are stored in /var/spool/cron/crontabs/<username>. The fxa list is found in ~fxa/bin/ingest.crontab, shown here:

  0 0 * * * csh -c '${FXA_HOME}/bin/breakLogIngest >&! ${FXA_DATA}/logs/breakLogIngest.log'
  15,45 * * * * csh -c '${FXA_HOME}/bin/master.purge >&! ${FXA_DATA}/logs/master.purge.log'
  00,15,30,45 * * * * csh -c '(cd ${FXA_HOME}/xfer/nowrad; ./xferNowrad_v3.com ${FXA_HOME}/xfer/nowrad) >&! ${FXA_DATA}/logs/xfer_nowrad.log'
  5,20,35,50 * * * * csh -c '${FXA_HOME}/bin/ftpCdot >&! ${FXA_DATA}/logs/ftpCdot.log'
  5,20,35,50 * * * * csh -c '${FXA_HOME}/bin/ftpAlert >&! ${FXA_DATA}/logs/ftpAlert.log'
  10,40 * * * * csh -c '${FXA_HOME}/bin/checkDisk >>& ${FXA_DATA}/logs/checkDisk.log'
  0 0 * * * csh -c '${FXA_HOME}/bin/checkDisk.breaklog >&! ${FXA_DATA}/logs/checkDisk.breaklog.log'
  0 * * * * csh -c '${FXA_HOME}/bin/purgeAllRedbook >&! ${FXA_DATA}/logs/purgeAllRedbook.log'
  0 0 * * * /usr/local/ldm/bin/ldmadmin newlog
  59 * * * * csh -c '${FXA_HOME}/bin/startScour >&! ${FXA_DATA}/logs/startScour.log'
  * * * * * csh -c '${FXA_HOME}/bin/x25_restart0 >&! ${FXA_DATA}/logs/x25_restart0.log'
  0,15,30,45 * * * * csh -c '/usr/local/fxa/bin/checkWfoApi0 >>& ${FXA_DATA}/logs/checkWfoApi0.log'
  0 0 * * * csh -c '${FXA_HOME}/bin/breakLog checkWfoApi0.log 4'
  0,30 * * * * csh -c '${FXA_HOME}/bin/syncKFTGlists >&! ${FXA_DATA}/logs/syncKFTGlists.log'
  0 0 * * * csh -c '${FXA_HOME}/bin/breakLog watchFreeway.log'
  0,6,12,18,24,30,36,42,48,54 * * * * csh -c '/usr/local/fxa/bin/checkDialRadar 3 >>& ${FXA_DATA}/logs/checkDialRadar.log'
  0 0 * * * csh -c '${FXA_HOME}/bin/breakLog checkDialRadar.log 4'
  0,15,30,45 * * * * csh -c '${FXA_HOME}/bin/ldmBridgeRestart >&! ${FXA_DATA}/logs/ldmBridgeRestart.log'
  0 0 * * * csh -c '${FXA_HOME}/bin/breakAnnouncementFiles >&! ${FXA_DATA}/logs/breakAnnouncementFiles.log'
In order, these scripts break (close old and open new) ingest log files at 0Z each day, run a purger twice an hour, process NOWrad every 15 minutes, gather CDoT and ALERT data every 15 minutes, check disk space twice hourly, break the latter's log daily, purge Redbook graphics at the top of each hour, break the ldm log daily, run Scour (an LDM-based purger) hourly, run two radar ingest checks and a log-file breaker, synchronize the current RPS list between dendata1 and dendata2 every 30 minutes, break the watchFreeway log daily, check dial-out capability every 6 minutes and break its log daily, check that the LDM bridge processes are running, and break the announcement files (the message lines at the bottom of the workstation displays) daily. Note the names of the log files. (The hydro cron jobs noted earlier are also included in this file, though not shown here.)

Note that the radar ingest jobs (x25_restart0 and breakLog syncComms0.log) run only on dendata1. These lines are commented out of dendata2's ingest.crontab.

Data purging

There are three purgers, all run by cron, as noted in the previous section. The first, master.purge, runs twice an hour. It in turn runs ~fxa/bin/fxa-data.purge and ~fxa/bin/laps-data.purge. The second, startScour, runs hourly, and starts ~ldm/bin/scour, which reads ~ldm/etc/scour.conf for the list of directories to clear out. The third, purgeAllRedbook, manages Redbook graphics. Logs for these processes are in $FXA_DATA/logs/master.purge.log, ~ldm/logs/scour.log, and $FXA_DATA/logs/purgeAllRedbook.log, respectively. Each is overwritten each run.

Data and process monitoring

The data monitor comprises a series of perl scripts that run via cron on dendata1 and dendata2. These scripts build HTML pages that are then copied to dendata1, dendata2, and cardinal (www-sdd) Web server directories for display. Cron entries include

  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/grid.cfg -o ${FXA_HOME}/bin/grid_data.html -h "Grid Data"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/graphic.cfg -o ${FXA_HOME}/bin/graphic_data.html -h "Redbook Graphics Products"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/text.cfg -o ${FXA_HOME}/bin/text_data.html -h "SBN Text Products"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/radar.cfg -o ${FXA_HOME}/bin/radar_data.html -h "Radar Data"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/point.cfg -o ${FXA_HOME}/bin/point_data.html -h "Point Data"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/fsl.cfg -o ${FXA_HOME}/bin/fsl_data.html -h "Local Data"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/http.pl -c ${FXA_HOME}/data/sat.cfg -o ${FXA_HOME}/bin/sat_data.html -h "Satellite Data"'
  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/diskUsage.pl -o ${FXA_HOME}/bin/disk_usage.html'
  3,13,23,33,43,53 * * * * csh -c '${FXA_HOME}/bin/monitorSummary.pl'
The ingest process monitor runs via cron on dendata1 and dendata2:

  0,10,20,30,40,50 * * * * csh -c '${FXA_HOME}/bin/startProcMon.sh'
This script in turn starts ~fxa/bin/ingestProcMon.pl, which checks processes in ~fxa/data/processes.txt, and builds an HTML file (DD[1,2]_ingestProcMon.html) showing what's up and down. These are copied to dendata1 and dendata2 Web server directories (/opt/httpd/htdocs).

The restart mechanism

Included at the bottom of the process monitor Web page (see above) is a link to bring up a restart menu, dendata1:/opt/httpd/cgi-bin/restart_DD[1,2]-setup.sh. This runs ~fxa/bin/restart-ingest.sh on the appropriate server, which in turn runs ~fxa/bin/restart-ingest-display.tcl. That finally runs ~fxa/bin/restart-ingest.tcl, which puts up a menu and takes action based on the user's selection. This tcl script calls the various *.tclProg shell scripts to stop and start processes. A write-up of this is found in ~fxa/doc/userGuides/IngestRestart.fm.

Text workstation

Procedures are stored in $FXA_DATA/scripts/<username>. Each procedure is in a file, and consists of a list of commands. The usernames are found in ~fxa/data/fxa-users.

Text `stuff' is stored in ~textdemo/textWSwork/dentextn:0. (Text WS 2, 4, and 6 are on dendata2.) Subdirectories include logs (not used any longer ­ see next paragraph), saved (copies of all products that have been created on this station), and journals (in-progress editing, saved for crash recovery). Also here is textAlarmAlertProducts.txt, the list of (surprise!) alarm/alert products for this workstation.

Log files are in $FXA_DATA/logs/yymmdd/textWish<pid>dendata1hhmmss. Logs exist only for the text windows (not the parent textWS.tcl processes). Logs do not identify the host Xterm.

If an Xterm gets mis-configured, the title window will come up, but the individual text windows will not. (You'll get a tcl error when you try to start one.) Press F12 on the keyboard for a second or two, then select Server. Press the Access Control button (middle button in second panel) `on' and click OK (upper right). Answer OK in the dialog box, wait for the reset, log in, and you should be ready to roll.

Local LAPS processing

LAPS (analysis) runs on denapp1, hourly by cron. Information here was provided by John Smart. Incomplete directory paths below are relative to the LAPS home directory, /usr/local/laps/laps (~laps/laps).

For WFO-Advanced, the crontab file is nest7grid/sched/lapscron_hp_wfo. This cron file activates four processes, as shown here:

  21 * * * * csh -c '(/usr/local/laps/laps/nest7grid/sched/schedule.com /usr/local/laps/laps/nest7grid)>& /usr/local/laps/laps/nest7grid/sched/schedule.err; /usr/local/laps/makeWFObigfile/xferLaps.com'
  12,42 * * * * /usr/bin/sh /usr/local/laps/laps/nest7grid/sched/sched_lga.com /usr/local/laps/laps/nest7grid 1> /usr/local/laps/laps/nest7grid/sched/sched_lga.log 2> /usr/local/laps/laps/nest7grid/sched/sched_lga.err
  05,11,20,26,35,42,50,56 * * * * /usr/bin/sh /usr/local/laps/laps/nest7grid/sched/run_lvd_driver_cron.sh /usr/local/laps/laps/nest7grid 1> /usr/local/laps/laps/nest7grid/log/lvd_cron_log 2> /usr/local/laps/laps/nest7grid/log/lvd_cron_err
  01,16,31,46 * * * * /usr/bin/sh /usr/local/laps/laps/nest7grid/sched/run_vrc_driver_cron.csh /usr/local/laps/laps/nest7grid 1> /usr/local/laps/laps/nest7grid/sched/vrc_cron.log 2> /usr/local/laps/laps/nest7grid/sched/vrc_cron.err
  1. The schedule.com script runs the analysis starting at 21 past the hour. This script runs a sequence of processes that ingest various datasets, run the analysis, purge analysis and intermediate files, etc. Log files (notably schedule.log, schedule.err, purger.log, and purger.err) are in nest7grid/sched. Immediately following completion of the analysis, cron starts xferLaps.com, which writes the LAPS grids into Grid/FSL/netCDF/LAPS_Grid/LAPS on both /data/fxa and /data/fxab (both are mounted on denapp1). This `bigfile' contains all LAPS grids generated for workstation display for that cycle. If no bigfile is generated, then there will be no workstation graphics available. A grid notification is also generated for each bigfile; each starts and stops IPC, so two rpc.ipcd logs for denapp1 appear each hour in /data/fxa/logs/<yyddhh>.
    Logfiles and errfiles for the individual processes (named *.log.<hhmm> and *.err.<hhmm>) are written to nest7grid/log. Analyses and intermediate ingest files are written to nest7grid/lapsprd/* in which the `*' refers to the appropriate product subdirectory. (Note that this currently is directed via soft link to dendata1:/data/fxa/lapsprd, whence it is redirected to /data/fxa-2/lapsprd.)
  2. The second crontab entry generates the model background (first guess) fields that are used in the analyses. (Though this runs twice an hour, in most cases it does nothing, requiring new RUC grids to do its thing.) The grids are named 97jjjhh00pp00.lga, where jjj is Julian day, hh is analysis hour, and pp is projection hour. (An appropriate file is used as first guess for each analysis.) These are written to nest7grid/lapsprd/lga/. While the success/failure of lga is written to the `log' subdirectory, the success of the crontab is written to nest7grid/sched/sched_lga.log and .err.
  3. The third crontab entry activates the satellite ingest process (called lvd) 8 times an hour (to accommodate rapid scan operations ­ again, in many cases, it does nothing, requiring appropriate satellite files). This creates files in nest7grid/lapsprd/lvd/. Similar to the other processes, logs are written to log/lvd.log.hhmm and log/lvd.err.hhmm.
  4. The last entry runs the radar (NOWrad) ingest process for LAPS.
The entire LAPS ingest/analysis generally completes in approximately5 minutes. Run times longer than 15 minutes or shorter than 2 minutes may indicate a problem. Use tail ~laps/laps/nest7grid/sched/runtime.log to see what time the run completed.

More information about LAPS run-time details is available in the LAPS README file, http://www.fsl.noaa.gov/frd/laps/tar/README_files/README_9611191600.html. (Now points to the most recent LAPS README.)

Some other stuff

Graphics procedures are stored in $FXA_DATA/procs/<username>. Each procedure is in its own directory, which contains an index file and the bundles. The index is a paired list of bundle file names and descriptive names. The usernames are found in ~fxa/data/ fxa-users.

On occasion, the X server gets to be a memory hog or otherwise needs to be restarted. On screen :0's keyboard, press ctrl-Pause (ctrl-shift-Break - the key at the right end of the top row), to kill the X servers. (It's possible to do just :1, but there's not much point in that, because you have to log out and back in to get back in business, anyway.)

The printers are known as denprtr1 (the default printer on graphics stations), used for graphics, and denprtr2 (the default printer on data servers), used for text, as set in ~textdemo/start. Text printing uses ~fxa/bin/textPrint.tcl.

Data sources and storage

Data are stored on a set of three 2-GB disks. The 3 disks are configured as one logical volume, known to the data ingest software as $FXA_DATA.

Use bdf to check on disk space.

Click here for data storage information.

This page is maintained by Joe Wakefield.

Last modified: 11 Jul 97 [LAPS README link updated 15 Aug 03.]