HWRF  trunk@4391
HWRF Rocoto Workflow

The HWRF system can be run through the Rocoto workflow manager. The full end-to-end operational system, and many developmental variants, can all be run through the run_hwrf.py (see: run_hwrf_main) script, which is a wrapper around Rocoto's rocotorun command. The Rocoto workflow manager submits jobs, monitors running jobs and tracks file and time dependencies. A suite of XML files in the rocoto/ subdirectory tell Rocoto all of these dependencies, and how to run the scripts (see Scripting Layer).

Note that the multi-storm HWRF has its own special page with details on running HWRF via Rocoto. Make sure you read that page too before attempting the multi-storm configuration.

Starting HWRF Using Rocoto

The rocotorun program is the main program for Rocoto, but it requires an XML file to run. The run_hwrf.py wrapper script (see run_hwrf_main) creates this XML file, finds rocotorun, and executes it. The run_hwrf.py program is executed like so:

1 me@cluster> cd rocoto
2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=myexpt

The myexpt should be the name of the directory in which HWRF was installed (parent of the ush, rocoto, exec, etc. directories). For example, if you install HWRF here:

1 /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557

Then your config.EXPT should be trunk-r3557:

1 me@cluster> cd rocoto
2 me@cluster> ./run_hwrf.py -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557

If all goes according to plan, you should see a isaac.xml file show up, filled with Rocoto XML workflow information, and a isaac.db SQLite3 database file, filled with Rocoto internal information about tasks and cycles. The first job in the workflow, scripts.exhwrf_launch, will be running for the first cycle.

After launching the first job, you must continue running the run_hwrf.py, every 5-20 minutes or so. Every time it is run, Rocoto will check dependencies and submit new jobs.

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557

Note the added -f argument. This tells run_hwrf.py that the isaac.xml and isaac.db files are already present, and that you approve of the script replacing them. If you omit the -f then run_hwrf.py will ask you if you want to replace them.

Sanity Checks and Other Paranoia

Several things can go wrong if your installation is not quite right. Either the HWRF sanity checks or Rocoto will warn you about them. HWRF will print a message about sanity checks failing and suggestions about how to fix the problem. Rocoto will print messages about timeouts and hangups, or XML syntax errors. The most common problems are:

Alternative Configurations

In the run_hwrf.py command, you can provide configuration information after the config.EXPT= option. Additional arguments are configuration files to read or specific options to override. You can find some configuration files in the parm directory. For example, this would run the 3km version of HWRF:

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf

You can run a GEFS ensemble at 3km with 43 levels, but then you must also provide the list of ensemble members to run at the beginning of the run_hwrf.py command and some additional config files. Lets say we want to run all twenty-one members (control and 20 perturbed):

1 me@cluster> ./run_hwrf.py -f -w isaac.xml -d isaac.db 00-20 2012 09L HISTORY config.EXPT=trunk-r3557 ../parm/hwrf_3km.conf ../parm/hwrf_ensemble.conf

The changes we added to the command are:

See HWRF Experiment Configuration for more detail about the available alternate configurations of HWRF.

Monitoring the Rocoto-Based HWRF System

There are a few commands in Rocoto 1.2 and later that let you monitor a running HWRF system and request re-submission of failed jobs.

Automatic Re-Submission

Rocoto will automatically resubmit failed jobs until some threshold for the maximum number of retries is met. This automatic re submission is critical when running on NOAA R&D networks since the hardware and software is not 100% reliable. HWRF's Rocoto XML workflow generally sets this limit to 3 for most jobs, and 7 for jobs that access the network. After that limit is reached, the user must intervene.

The rocotostat Command

The first is rocotostat. The rocotostat command will print a list of all jobs for cycles Rocoto has started. For each job, rocotostat will print the job status, the number of times Rocoto resubmitted and other information:

1 me@cluster> rocotostat -w isaac.xml -d isaac.db -c ALL

The output will look like this:

1  CYCLE TASK JOBID STATE EXIT STATUS TRIES DURATION
2 ==================================================================================================================
3 201208201200 ensda_pre 43320461 SUCCEEDED 0 1 46.0
4 201208201200 launch_E99 43320156 SUCCEEDED 0 1 23.0
5 201208201200 input_E99 43320283 SUCCEEDED 0 1 214.0
6 201208201200 bdy_E99 43321045 SUCCEEDED 0 1 1906.0
7 201208201200 init_GFS_0_E99 43320462 SUCCEEDED 0 1 770.0
8 201208201200 init_GDAS1_3_E99 43320463 SUCCEEDED 0 1 1319.0
9 201208201200 init_GDAS1_6_E99 43320464 SUCCEEDED 0 1 1281.0
10 201208201200 init_GDAS1_9_E99 43320465 SUCCEEDED 0 1 1295.0
11 201208201200 ocean_init_E99 43320466 SUCCEEDED 0 1 883.0
12 201208201200 relocate_GFS_0_E99 43321046 SUCCEEDED 0 1 772.0
13 201208201200 relocate_GDAS1_3_E99 43321341 SUCCEEDED 0 1 771.0
14 201208201200 relocate_GDAS1_6_E99 43321342 SUCCEEDED 0 1 944.0
15 201208201200 relocate_GDAS1_9_E99 43321343 SUCCEEDED 0 1 1020.0
16 201208201200 bufrprep_E99 43320467 SUCCEEDED 0 1 16.0
17 201208201200 gsi_d02_E99 43321974 SUCCEEDED 0 1 2009.0
18 201208201200 gsi_d03_E99 43321975 SUCCEEDED 0 1 1059.0
19 201208201200 gsi_post_E99 43332079 SUCCEEDED 0 1 1702.0
20 201208201200 merge_E99 43332080 SUCCEEDED 0 1 525.0
21 201208201200 check_init_E99 43332318 SUCCEEDED 0 1 15.0
22 201208201200 coupled_forecast_E99 43332438 SUCCEEDED 0 1 10693.0
23 201208201200 uncoupled_forecast_E99 - - - - -
24 201208201200 unpost_E99 43335424 SUCCEEDED 0 1 20.0
25 201208201200 post_E99 43335547 DEAD 0 3 10112.0
26 201208201200 post_helper_E99 43335548 DEAD 0 3 10105.0
27 201208201200 products_E99 43336031 DEAD 0 3 10905.0
28 201208201200 output_E99 - - - - -

Some of the jobs are "DEAD," and two jobs were never even submitted (the jobs with hyphens). A job is dead once it fails too many times. At that point, you have to resubmit it. There are two ways to convince Rocoto to resubmit something.

The rocotorewind Command

This is the preferred method to failed job resubmission. The rocotorewind command will tell Rocoto to "undo" the effects of the job, and resubmit it again when the dependencies are met. This command would tell Rocoto to rewind the three failed jobs:

1 me@cluster> rocotorewind -w isaac.xml -d isaac.db -c 201208201200 -t post_E99 -t post_helper_E99 -t products_E99
2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...

Later runs of run_hwrf.py will submit jobs once the dependencies are met. In this case, the post and post_helper jobs will be resubmitted as soon as you run run_hwrf again. The products job will be started once the post job is running or completed.

Todo:
At present, it is not possible to use rocotorewind if the cycle is entirely complete (that is, when the special "completion" job is done). This is a limitation of Rocoto that we hope to fix in the next release.

The rocotoboot Command

The alternative is to run rocotoboot, which tells Rocoto to run a job immediately even if the retry limit is reached. The problem with rocotoboot is that it ignores dependencies, so you need to make sure dependencies are met on your own. We know the post's dependencies are met, so we can submit that job:

1 me@cluster> rocotoboot -w isaac.xml -d isaac.db -c 201208201200 -t post_E99
2 me@cluster> ./run_hwrf.py -f 00-20 -w isaac.xml -d isaac.db 2012 09L HISTORY ... other arguments ...
Warning
Beware though, that if you rocotoboot a job that is not DEAD, and whose dependencies are met, Rocoto will submit the same job twice. It is better to use rocotorewind, instead of rocotoboot. The rocotorewind command ensures the jobs are only submitted once it is safe to do so, and are only submitted once at a time.

The rocotocheck Command

1 me@cluster> rocotocheck -w *xml -d *db -c 201208201800 -t uncoupled_forecast_E99

The output will look something like this:

1 Task: uncoupled_forecast_E99
2  account: hwrfv3
3  command: /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush/rocoto_pre_job.sh /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/scripts/exhwrf_forecast.py
4  cores: 464
5  final: false
6  jobname: hwrf_atm_forecast_09L_2012082018_E99
7  maxtries: 3
8  memory:
9  metatasks: meta_fcst
10  name: uncoupled_forecast_E99
11  nodes: 30:ppn=12+8:ppn=11+2:ppn=8
12  queue: batch
13  seqnum: 2
14  stderr: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.err
15  stdout: /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/hwrf_atm_forecast.out
16  throttle: 9999999
17  walltime: 4:59:00
18  environment
19  CONFhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.conf
20  HOMEhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557
21  HWRF_FORCE_TMPDIR ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L/tmpdir
22  PARAFLAG ==> YES
23  PYTHONPATH ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
24  TOTAL_TASKS ==> 464
25  USHhwrf ==> /lfs3/projects/hwrfv3/Samuel.Trahan/trunk-r3557/ush
26  WHERE_AM_I ==> jet
27  WORKhwrf ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/2012082018/09L
28  envir ==> prod
29  jlogfile ==> /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/log/jlogfile
30 /bin/sh -c grep -v 'RUN_COUPLED=YES' /pan2/projects/hwrfv3/Samuel.Trahan/pytmp/trunk-r3557/com/2012082018/09L/storm1.ocean_status (Mon Aug 20 18:00:00 UTC 2012)
31  dependencies
32  AND is not satisfied
33  check_init_E99 of cycle 201208201800 is SUCCEEDED
34  OR is not satisfied
35  'YES'!='YES' is false
36  grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false
37 
38 Cycle: 201208201800
39  Valid for this task: YES
40  State: active
41  Activated: Tue Jun 02 20:09:53 UTC 2015
42  Completed: -
43  Expired: -
44 
45 Job: This task has not been submitted for this cycle
46 
47 Task can not be submitted because:
48  Dependencies are not satisfied

Note the last bit: "Task can not be submitted because: Dependencies are not satisfied." That tells you why the task is not submitted, but why are the dependencies not satisfied? Look at this part:

1 dependencies
2  AND is not satisfied
3  check_init_E99 of cycle 201208201800 is SUCCEEDED
4  OR is not satisfied
5  'YES'!='YES' is false
6  grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false

That is the Rocoto dependency decision tree, which tells you why the dependencies are not met. In this case, the upstream check_init_E99 job is done, so the OR condition is the problematic part. It contains two conditions, only one of which must be met. The first is a string comparison: "YES" must not equal "YES". Clearly that condition will never be met. The other condition is a real one:

1 grep -v 'RUN_COUPLED=YES' /pan2/projec... returned false

It appears that the ocean init decided to run coupled. Hence the uncoupled forecast was never run, and the coupled forecast was run. This is not an error, so we can be satisfied that Rocoto is working as requested.

More Information About Rocoto

The Rocoto project is based on GitHub:

https://github.com/christopherwharrop/rocoto

Documentation is available in the Rocoto wiki:

https://github.com/christopherwharrop/rocoto/wiki/Documentation