The HWRF system can be run through the Rocoto workflow manager. The full end-to-end operational system, and many developmental variants, can all be run through the run_hwrf.py (see: run_hwrf_main) script, which is a wrapper around Rocoto's rocotorun command. The Rocoto workflow manager submits jobs, monitors running jobs and tracks file and time dependencies. A suite of XML files in the rocoto/ subdirectory tell Rocoto all of these dependencies, and how to run the scripts (see Scripting Layer).

Note that the multi-storm HWRF has its own special page with details on running HWRF via Rocoto. Make sure you read that page too before attempting the multi-storm configuration.

HWRF Multistorm Configuration

Starting HWRF Using Rocoto

The rocotorun program is the main program for Rocoto, but it requires an XML file to run. The run_hwrf.py wrapper script (see run_hwrf_main) creates this XML file, finds rocotorun, and executes it. The run_hwrf.py program is executed like so:

1 me@cluster> cd rocoto

2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=myexpt

The myexpt should be the name of the directory in which HWRF was installed (parent of the ush, rocoto, exec, etc. directories). For example, if you install HWRF here:

1 /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557

Then your config.EXPT should be trunk-r3557:

1 me@cluster> cd rocoto

2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557

If all goes according to plan, you should see a isaac.xml file show up, filled with Rocoto XML workflow information, and a isaac.db SQLite3 database file, filled with Rocoto internal information about tasks and cycles. The first job in the workflow, scripts.exhwrf_launch, will be running for the first cycle.

After launching the first job, you must continue running the run_hwrf.py, every 5-20 minutes or so. Every time it is run, Rocoto will check dependencies and submit new jobs.

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557

Note the added -f argument. This tells run_hwrf.py that the isaac.xml and isaac.db files are already present, and that you approve of the script replacing them. If you omit the -f then run_hwrf.py will ask you if you want to replace them.

Sanity Checks and Other Paranoia

Several things can go wrong if your installation is not quite right. Either the HWRF sanity checks or Rocoto will warn you about them. HWRF will print a message about sanity checks failing and suggestions about how to fix the problem. Rocoto will print messages about timeouts and hangups, or XML syntax errors. The most common problems are:

Wrong config.EXPT — The config.EXPT argument must be set to the name of the directory above ush/ otherwise HWRF will use the wrong files, or not find its files at all.
Fix files missing — Make sure you have the fix files. There should be a fix/ directory or symbolic link at the top level of the installation. See the README.fix file for information on where the fix files reside.
Executables missing — Remember to compile and install the executables. If you are using data assimilation, you need the exec/hwrf_gsi too.
Batch system is down or very slow — The batch system is the server on the supercomputer that keeps track of what jobs are running, and decides which compute nodes will run new jobs. Sometimes that batch system will go down, or will be unusably slow. This is especially a problem on WCOSS, where the batch system commands will sometimes take 1-2 minutes to respond. If you see error messages from Rocoto about hangups or timeouts, this is likely a batch system problem.
Huge flood of syntax errors about XML problems — If you modify the .xml or *.ent files manually, and do it wrong, LibXML2 will report numerous syntax errors.
Jobs fail randomly when running WRF, prep_hybrid or the post — The HWRF system needs a lot of stack. Make sure your stack limit is at least 2 GB.

Alternative Configurations

In the run_hwrf.py command, you can provide configuration information after the config.EXPT= option. Additional arguments are configuration files to read or specific options to override. You can find some configuration files in the parm directory. For example, this would run the 3km version of HWRF:

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf

You can run a GEFS ensemble at 3km with 43 levels, but then you must also provide the list of ensemble members to run at the beginning of the run_hwrf.py command and some additional config files. Lets say we want to run all twenty-one members (control and 20 perturbed):

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 00-20 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf ../parm/hwrf_ensemble.conf

The changes we added to the command are:

00-20 — The list of ensemble members.
../parm/hwrf_43lev.conf — Request forty-three vertical levels with a 50 mbar top.
../parm/hwrf_3km.conf — Request the 3km configuration.
../parm/hwrf_ensemble.conf — Request a GEFS-based HWRF ensemble.

See HWRF Experiment Configuration for more detail about the available alternate configurations of HWRF.

Monitoring the Rocoto-Based HWRF System

There are a few commands in Rocoto 1.2 and later that let you monitor a running HWRF system and request re-submission of failed jobs.

Automatic Re-Submission

Rocoto will automatically resubmit failed jobs until some threshold for the maximum number of retries is met. This automatic re submission is critical when running on NOAA R&D networks since the hardware and software is not 100% reliable. HWRF's Rocoto XML workflow generally sets this limit to 3 for most jobs, and 7 for jobs that access the network. After that limit is reached, the user must intervene.

The rocotostat Command

The first is rocotostat. The rocotostat command will print a list of all jobs for cycles Rocoto has started. For each job, rocotostat will print the job status, the number of times Rocoto resubmitted and other information:

1 me@cluster> rocotostat -w isaac.xml -d isaac.db -c ALL

The output will look like this:

        CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
 ==================================================================================================================
 201208201200               ensda_pre                    43320461           SUCCEEDED                   0         1          46.0
 201208201200              launch_E99                    43320156           SUCCEEDED                   0         1          23.0
 201208201200               input_E99                    43320283           SUCCEEDED                   0         1         214.0
 201208201200                 bdy_E99                    43321045           SUCCEEDED                   0         1        1906.0
 201208201200          init_GFS_0_E99                    43320462           SUCCEEDED                   0         1         770.0
 201208201200        init_GDAS1_3_E99                    43320463           SUCCEEDED                   0         1        1319.0
 201208201200        init_GDAS1_6_E99                    43320464           SUCCEEDED                   0         1        1281.0
 201208201200        init_GDAS1_9_E99                    43320465           SUCCEEDED                   0         1        1295.0
 201208201200          ocean_init_E99                    43320466           SUCCEEDED                   0         1         883.0
 201208201200      relocate_GFS_0_E99                    43321046           SUCCEEDED                   0         1         772.0
 201208201200    relocate_GDAS1_3_E99                    43321341           SUCCEEDED                   0         1         771.0
 201208201200    relocate_GDAS1_6_E99                    43321342           SUCCEEDED                   0         1         944.0
 201208201200    relocate_GDAS1_9_E99                    43321343           SUCCEEDED                   0         1        1020.0
 201208201200            bufrprep_E99                    43320467           SUCCEEDED                   0         1          16.0
 201208201200             gsi_d02_E99                    43321974           SUCCEEDED                   0         1        2009.0
 201208201200             gsi_d03_E99                    43321975           SUCCEEDED                   0         1        1059.0
 201208201200            gsi_post_E99                    43332079           SUCCEEDED                   0         1        1702.0
 201208201200               merge_E99                    43332080           SUCCEEDED                   0         1         525.0
 201208201200          check_init_E99                    43332318           SUCCEEDED                   0         1          15.0
 201208201200    coupled_forecast_E99                    43332438           SUCCEEDED                   0         1       10693.0
 201208201200    uncoupled_forecast_E99                           -                   -                   -         -             -
 201208201200              unpost_E99                    43335424           SUCCEEDED                   0         1          20.0
 201208201200                post_E99                    43335547           DEAD                        0         3       10112.0
 201208201200         post_helper_E99                    43335548           DEAD                        0         3       10105.0
 201208201200            products_E99                    43336031           DEAD                        0         3       10905.0
 201208201200              output_E99                             -                   -                   -         -             -

Some of the jobs are "DEAD," and two jobs were never even submitted (the jobs with hyphens). A job is dead once it fails too many times. At that point, you have to resubmit it. There are two ways to convince Rocoto to resubmit something.

The rocotorewind Command

This is the preferred method to failed job resubmission. The rocotorewind command will tell Rocoto to "undo" the effects of the job, and resubmit it again when the dependencies are met. This command would tell Rocoto to rewind the three failed jobs:

1 me@cluster> rocotorewind -w isaac.xml -d isaac.db -c 201208201200 -t post_E99 -t post_helper_E99 -t products_E99

2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...

Later runs of run_hwrf.py will submit jobs once the dependencies are met. In this case, the post and post_helper jobs will be resubmitted as soon as you run run_hwrf again. The products job will be started once the post job is running or completed.

Todo:: At present, it is not possible to use rocotorewind if the cycle is entirely complete (that is, when the special "completion" job is done). This is a limitation of Rocoto that we hope to fix in the next release.

The rocotoboot Command

The alternative is to run rocotoboot, which tells Rocoto to run a job immediately even if the retry limit is reached. The problem with rocotoboot is that it ignores dependencies, so you need to make sure dependencies are met on your own. We know the post's dependencies are met, so we can submit that job:

1 me@cluster> rocotoboot -w isaac.xml -d isaac.db -c 201208201200 -t post_E99

2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...

Warning: Beware though, that if you rocotoboot a job that is not DEAD, and whose dependencies are met, Rocoto will submit the same job twice. It is better to use rocotorewind, instead of rocotoboot. The rocotorewind command ensures the jobs are only submitted once it is safe to do so, and are only submitted once at a time.

The rocotocheck Command

1 me@cluster> rocotocheck -w *xml -d *db -c 201208201800 -t uncoupled_forecast_E99

The output will look something like this:

 Task: uncoupled_forecast_E99
   account: hwrfv3
   command: /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush/rocoto_pre_job.sh /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/scripts/exhwrf_forecast.py
   cores: 464
   final: false
   jobname: hwrf_atm_forecast_09L_2012082018_E99
   maxtries: 3
   memory:
   metatasks: meta_fcst
   name: uncoupled_forecast_E99
   nodes: 30:ppn=12+8:ppn=11+2:ppn=8
   queue: batch
   seqnum: 2
   stderr: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.err
   stdout: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.out
   throttle: 9999999
   walltime: 4:59:00
   environment
     CONFhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.conf
     HOMEhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557
     HWRF_FORCE_TMPDIR ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/tmpdir
     PARAFLAG ==> YES
     PYTHONPATH ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
     TOTAL_TASKS ==> 464
     USHhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
     WHERE_AM_I ==> jet
     WORKhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L
     envir ==> prod
     jlogfile ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/log/jlogfile
 /bin/sh -c grep -v 'RUN_COUPLED=YES' /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.ocean_status (Mon Aug 20 18:00:00 UTC 2012)
   dependencies
     AND is not satisfied
       check_init_E99 of cycle 201208201800 is SUCCEEDED
       OR is not satisfied
         'YES'!='YES' is false
         grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false
 
 Cycle: 201208201800
   Valid for this task: YES
   State: active
   Activated: Tue Jun 02 20:09:53 UTC 2015
   Completed: -
   Expired: -
 
 Job: This task has not been submitted for this cycle
 
 Task can not be submitted because:
   Dependencies are not satisfied

Note the last bit: "Task can not be submitted because: Dependencies are not satisfied." That tells you why the task is not submitted, but why are the dependencies not satisfied? Look at this part:

 dependencies
   AND is not satisfied
     check_init_E99 of cycle 201208201800 is SUCCEEDED
     OR is not satisfied
       'YES'!='YES' is false
       grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false

That is the Rocoto dependency decision tree, which tells you why the dependencies are not met. In this case, the upstream check_init_E99 job is done, so the OR condition is the problematic part. It contains two conditions, only one of which must be met. The first is a string comparison: "YES" must not equal "YES". Clearly that condition will never be met. The other condition is a real one:

1 grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false

It appears that the ocean init decided to run coupled. Hence the uncoupled forecast was never run, and the coupled forecast was run. This is not an error, so we can be satisfied that Rocoto is working as requested.

More Information About Rocoto

The Rocoto project is based on GitHub:

https://github.com/christopherwharrop/rocoto

Documentation is available in the Rocoto wiki:

https://github.com/christopherwharrop/rocoto/wiki/Documentation