HWRF  trunk@4391
HWRF Log Files

Table of Contents

This page documents log file locations, and provides guidance on how to read the log files. We discuss operational locations, and the Rocoto-based (repository HWRF) log file locations.


Nomenclature

Most log files are inside the $WORKhwrf, $COMhwrf, or $log directories. This section describes the nomenclature so you'll understand those and other shell-like variable names in this page.

Here are the variable names you may need to refer to:

These are specific to the NCO system (operational HWRF on WCOSS):

These are specific to the non-NCO HWRF workflow:

S


Log Location Quick Reference

A quick reference for the current operational HWRF locations:

Default repository (non-operational) HWRF locations:

Log locations within those directories:


Log Files

The $jlogfile

The most useful tool for getting a quick glance at HWRF's status is the jlogfile. NCO configures the jlogfile location using the $jlogfile environment variable. For everyone else, it is in this location:

$CDSCRUB/$SUBEXPT/log/jlogfile

where $CDSCRUB is set in your system.conf file. The $SUBEXPT (sub-experiment) is user-defined, but defaults to the value of your $EXPT (also user defined). The jlogfile will contain log messages for all jobs run by that sub-experiment, for all storms and cycles. Only the highest-level messages are reported in the file.

Per-Job Log Files

Each HWRF batch job will generate lots of output to its stdout and stderr stream. Depending on which system you're using, they may be in a single output file or split into two. Generally, the stdout stream contains the most detailed information since it logging level is INFO while the stderr is at level WARNING. However, either one may contain error messages from executed commands or the operating system.

There are a few jobs in particular where the stdout vs stderr have special meaning:

  1. coupled forecast — coupler logging is in stdout. Generally, we redirect this to a file due to its extreme size (>400 MB).
  2. products — tracker's stderr is products' jobs stderr. This means the tracker messages about waiting for files to show up are in the products stderr.
  3. relocate, merge — the Fortran programs these jobs run write extensively to both stdout and stderr. For this reason, the stderr stream has INFO logging level to make it easier to follow.

Forecast Log Files

As of this writing, there are three coupled components:

The coupler and ocean share a log stream and both are redirected to:

This special extra log file exists due to the extreme size of the coupler log: >400 MB.

The WRF has many log files, one for stdout and one for stderr for each of its MPI ranks. These are all in the runwrf/ directory:

where "RANK" is the rank within the WRF communicator, zero-padded to four digits. The first line of every log file will tell the name of the machine on which that rank is running. The WRF master process is the only one that does extensive logging, and its main log file is here:

That file is updated multiple times per timestep with information about the WRF run.

However, if WRF fails, the problem could be in any rank. In that situation, it is critical to check the other rsl.* files, especially the rsl.error.* files.

Post-Processing and Regribbing

To understand post-processing logging, you have to know how the work of post-processing is divided. The work of the HWRF post-processing is split into a products job, and one or more post jobs. The post jobs run the Unified Post Processor (UPP), which converts WRF output to native E grid GRIB files. The products job regrids the E grid output to more standard grids and copies the resulting GRIB files to COM. The products job also runs the GFDL Vortex Tracker.

Depending on the system, either the post or products job will copy forecast job native outputs to COM, compressing any NetCDF files. Which job does it depends on how the NetCDF Operators were compiled. On NOAA Zeus, the post job copies files, while all other systems, the products job copies them.

Post Job Logging

Most required information from the post job is logged to its per-job stdout and stderr files. The stdout tells what is run when, and will list stack trace information for any failed post-processing operations. The stderr stream may have additional information from the post stderr. The stdout of the post itself is extremely long, so it is redirected to a file, and deleted if the post succeeds. If the post fails, the failed post stdout is found here:

You can search the post job's stdout or stderr file to see which directory any failed post job ran in.

When the post job copies native model output to COM, most error messages are found in the stderr stream of the post job.

Products Job Logging

The products job is split into multiple subprocesses launched by the MPI program "mpiserial." They fall into three categories:

The output from gribbers is enormous, so it is redirected to files:

The stdout stream is best for finding out what the gribber is doing at any given time, while the stderr is best for finding errors. Most of the programs run by the gribber report errors in stderr. These include the cnvgrib, wgrib and hwrf_egrid2latlon (copygb) programs. Also, if you forget to install one of those programs, error messages about the program's absence will be in stderr.

The tracker logs can be found in two places:

Most information from the tracker is in its stdout stream. The only stderr information of note is:

Copier Log Files

Logging from the hwrf_expt.wrfcopier, which copies native model output to COM, can be in one of two places. In NCEP operations, and in most other platforms, it is in the products job, and can be found here:

On NOAA Zeus, the copier is run by the post, so its information is in the per-job log files for the post jobs.

In the products job, the stdout is a good way of tracking its progress, but stderr is better for finding error messages. Errors usually come from disk problems, or from failures of the ncks program. Typically, ncks errors will be cryptic messages from the NetCDF library or NetCDF Operators library, followed by a human-readable explanation that may span multiple lines. In any case, all errors should be readily visible as stack traces that include the hwrf.copywrf module.

Init and Bdy Job Log Files

The init and bdy jobs in HWRF take parent model data and prepare it for use by the rest of the HWRF system. That involves running many programs, all of which have extensive logging. Most of the progress and error information can be derived from the per-job log files. However, when something fails, it may be necessary to delve deeper.

There are two types of initialization: the "gfsinit" and the "fgat" initialization. The "gfsinit" directory is still used even if the parent model is not GFS (GDAS, FNL, etc.) These directories are:

Within each init directory, there are subdirectories for each component.

See the sections on post-processing or forecast for information about logging for the post, gribber, tracker, real, and WRF.

Errors from Metgrid or Ungrib indicate problems with input GRIB files. Either the files themselves are invalid, or the chosen Vtable is not appropriate for the inputs.

The geogrid and prep_hybrid should never fail unless the requested datasets are invalid (such as a bad or non-existent file, or an I/O error).


Python-Generated HWRF Log Files

In this section, we give guidance on how to read HWRF log files generated by the Python scripts. These log files follow a common structure. They each have job prologue and epilogue information, which consists of hostfile, environment and other diagnostics needed to debug jobs on failed compute nodes, or find other system information.

Eventually, after all the prologue, you will see something like this:

07/14 00:10:06.963 jlog (exhwrf_post.py:15) INFO: starting post
07/14 00:10:07.736 hwrf (launcher.py:152) INFO: Running cycle: 2015071318
07/14 00:10:07.736 hwrf (launcher.py:157) INFO: /lfs3/projects/hwrf-vd/hurrun/pytmp/H215_ensemble3/00/2015071318/11W/tmpvit: read vitals for current cycle
07/14 00:10:07.764 hwrf (launcher.py:161) INFO: Current cycle vitals: JTWC 11W NANGKA    20150713 1800 221N 1366E 350 042 0952 1002 0407 48 012 0278 0278 0250 0222 D
07/14 00:10:07.765 hwrf (launcher.py:164) INFO: /lfs3/projects/hwrf-vd/hurrun/pytmp/H215_ensemble3/00/2015071318/11W/oldvit: read vitals for prior cycle
07/14 00:10:07.785 hwrf (launcher.py:168) INFO: Prior cycle vitals: JTWC 11W NANGKA    20150713 1200 214N 1369E 355 042 0944 1002 0398 54 018 0278 0278 0250 0222 D
07/14 00:10:07.785 hwrf (hwrf_expt.py:127) INFO: Initializing hwrf_expt module...
07/14 00:10:14.387 hwrf (hwrf_expt.py:401) INFO: Done in hwrf_expt module.

Each line has a specific form:

There are many log streams, and their name can aid you in finding out where the logging is coming from, and hence why things are happening. For example:

07/14 00:10:14.532 hwrf.nonsatpost-f00h00m (fileop.py:435) INFO: hires_micro_lookup.dat: move from ...

(Abbreviated for readability.) Here, the hwrf.nonsatpost-f00h00m means the log message is about the analysis time (f00h00m) non-satellite post job (nonsatpost).

Python Logging Levels

The Python standard library logging module has multiple levels of logging, which HWRF uses for different purposes, and sends to different places:

Level stdout stderr jlogfile Meaning
DEBUG no no no Debug messages only usable by developer.
INFO yes no no Regular status information.
WARNING yes yes no Information that may be useful in debugging failed jobs.
ERROR yes yes yes Errors that will degrade forecast or disable components.
CRITICAL yes yes yes Failures that require operator intervention.

Note that higher levels of logging go to more streams. Log messages from all log streams at level ERROR or higher will go to the jlogfile. Other messages go to only the per-job output files.

Log messages sent to the special "jlog" stream also go to the jlogfile, even if they're at lower log levels. This is to allow each job to send start and completion messages without the messages looking like errors.