HWRF  trunk@4391
Public Member Functions | Public Attributes | List of all members
hwrf.input.InputSource Class Reference

Fetch data from multiple sources. More...

Detailed Description

Fetch data from multiple sources.

This class knows how to fetch data from remote clusters, or the local machine. The data locations are specified by a several DataCatalog sections, each of which is given a priority, a valid set of dates and a file transfer mechanism. Data catalogs are tried in priority order. Files are obtained in multiple threads at once, and several file transfer mechanisms are understood:

However, only one DataCatalog is examined at a time. All threads work on that one DataCatalog until all data that can be obtained from it is done. Then the threads exit, and new ones are spawned to examine the next DataCatalog.

For example, suppose you are on the Jet supercomputer running a HISTORY (retrospective) simulation. You set up this configuration section in your hwrf.conf config file:

1 [jet_sources_prod2014]
2 jet_hist_PROD2014%location = file:///
3 jet_hist_PROD2014%histprio=90
4 jet_hist_PROD2014%fcstprio=90
5 
6 prod15_data_sp%location=htar://
7 prod15_data_sp%histprio=59
8 prod15_data_sp%dates=2015011218-2015123118
9 
10 [jet_hist_PROD2014]
11 @inc=gfs2014_naming
12 inputroot2014=/lfs3/projects/hwrf-data/hwrf-input
13 gfs={inputroot2014}/HISTORY/GFS.{aYYYY}/{aYMDH}/
14 gfs_sfcanl = gfs.t{aHH}z.sfcanl
15 
16 [prod15_data_sp]
17 inputroot=/NCEPPROD/2year/hpssprod/runhistory/rh{aYYYY}/{aYYYY}{aMM}/{aYMD}
18 gfs={inputroot}/
19 gfs_sfcanl = {gfs_tar}#./gfs.t{aHH}z.sfcanl
20 
21 [hwrfdata]
22 inputroot=/pan2/projects/hwrfv3/John.Doe/hwrfdata
23 gfs={inputroot}/hwrf.{aYMDH}/
24 gfs_sfcanl = gfs.t{aHH}z.sfcanl

and this is the code:

1 is=InputSource(conf,"jet_sources_prod2014","2015071806")
2 hwrfdata=DataCatalog(conf,"hwrfdata")
3 is.get([
4  {"dataset":"gfs", "item":"gfs_sfcanl","atime"="2015071800"},
5  {"dataset":"gfs", "item":"gfs_sfcanl","atime"="2015071806"},
6  {"dataset":"gfs", "item":"gfs_sfcanl","atime"="2015071812"} ],
7  hwrfdata,realtime=False)

In this example, the InputSource will look for three GFS surface analysis files. It will search two possible locations for them: the on-disk Jet "PROD2014" history location and the NCO production tape files. The disk location will be searched first because its history priority is 90, while the tape area has a priority of 59.

Three files will show up eventually:

Each file will come from either here:

or here:

Definition at line 301 of file input.py.

Inheritance diagram for hwrf.input.InputSource:

Public Member Functions

def __init__
 InputSource constructor. More...
 
def add
 Adds a DataCatalog to this InputSource. More...
 
def open_ftp
 Opens an FTP connection. More...
 
def rsync_check_access
 Checks to see if rsync can even access a remote server. More...
 
def fetch_file
 Internal implementation function that fetches one file. More...
 
def list_for
 Returns the list of DataCatalog objects for FORECAST or HISTORY mode. More...
 
def priotable (self, dclist)
 Generates a string containing a human-readable, prioritized list of data sources. More...
 
def get
 Transfers the specified set of data to the specified target. More...
 
def get_one (self, dataset, item, dest, logger=None, timeout=20, realtime=True, kwargs)
 This is a simple wrapper around fetch_file that gets only one file. More...
 

Public Attributes

 conf
 The hwrf.config.HWRFConfig object used for configuration info.
 
 section
 The section in conf that contains the data catalog list and relevant info.
 
 anltime
 The default analysis time. More...
 
 forecast
 List of forecast mode DataCatalog objects. More...
 
 history
 List of history mode DataCatalog objects. More...
 
 locks
 Lock objects to restrict access to FTP servers to one thread at a time. More...
 
 htar
 A produtil.prog.ImmutableRunner that runs htar. More...
 
 hsi
 A produtil.prog.ImmutableRunner that runs hsi. More...
 
 valid
 Data source validitiy information. More...
 

Constructor & Destructor Documentation

def hwrf.input.InputSource.__init__ (   self,
  conf,
  section,
  anltime,
  htar = None,
  logger = None,
  hsi = None 
)

InputSource constructor.

Parameters
confthe hwrf.config.HWRFConfig to use for configuration info
sectionthe section that specifies the list of data catalogs
anltimethe default analysis time
htarthe produtil.prog.Runner that runs htar
loggera logging.Logger for log messages
hsithe produtil.prog.Runner that runs hsi

Definition at line 384 of file input.py.

Member Function Documentation

def hwrf.input.InputSource.add (   self,
  dc,
  location,
  fcstprio = None,
  histprio = None,
  dates = None 
)

Adds a DataCatalog to this InputSource.

Called automatically from the constructor to add a DataCatalog to this InputSource. The list of add() calls is generated from the config section specified in the constructor. You should never need to call this function unless you want to explicitly add more DataCatalog objects that are not listed in the config files.

The location parameter is a URL from file, sftp, ftp or htar. Examples:

Warning
Bad things will happen if you add the same source twice. Bad things.
Note
If fcstprio and histprio are both None, this call has no effect.
Parameters
dcthe DataCatelog object
locationthe URL of the data source, including the username if needed.
fcstpriothe priority for using this source in FORECAST (real-time) mode. If missing or None, the source will not be used in FORECAST mode.
histpriothe priority for using this source in HISTORY (retrospective) mode. If missing or None,the source will not be used in HISTORY mode.
datesDates for which this source is valid. This is passed to the trange argument of in_date_range(t,trange)

Definition at line 511 of file input.py.

Referenced by produtil.fileop.FileWaiter.add().

def hwrf.input.InputSource.fetch_file (   self,
  streams,
  dc,
  dsurl,
  urlmore,
  dest,
  logger = None,
  timeout = 20,
  realtime = True 
)

Internal implementation function that fetches one file.

You should not call this directly; it is meant to be called by "get" and re-implemented in subclasses. This grabs one file, potentially from a remote location. The URL for the base directory of some dataset is in dsurl, while the specific file is in urlmore. The urlmore will be appended to the file part of dsurl via urljoin, and the resulting file will be transferred.

Parameters
streamsa list used to store opened streams
dcthe DataCatalog being obtained
dsurlthe URL of the DataCatalog
urlmoreadditional parts of the URL such as the reference or HTTP Get
destThe local disk destination
loggerthe logging.Logger for log messages
timeoutthe connection timeout in seconds
realtimeTrue for FORECAST mode, False for HISTORY mode.
Returns
True if successful, False if not

Definition at line 611 of file input.py.

Referenced by hwrf.input.InputSource.get_one(), hwrf.input.InputSource.list_for(), and hwrf.input.InputSource.rsync_check_access().

def hwrf.input.InputSource.get (   self,
  data,
  target_dc,
  realtime = False,
  logger = None,
  skip_existing = True 
)

Transfers the specified set of data to the specified target.

The "target_dc" is a DataCatalog that specifies the destination filenames. The "realtime" argument is True for FORECAST (real-time) mode runs, and False for HISTORY (retrospective) mode runs. The "data" argument should be an iterable (list, tuple, etc.) where each element is a dict-like object that describes one file to obtain. Each dict contains:

dataset - string name of the dataset (gfs, gdas1, gefs, enkf, etc.) item - string name of the object (ie.: sf, sfcanl, bufr) atime - Optional: a datetime.datetime specifying the analysis time. Default is the atime from the InputSource's constructor. ftime - Optional: a datetime.datetime specifying the forecast time. ...others... - any other keyword arguments will be sent to the .location functions in any of this InputSource's DataCatalog objects.

Definition at line 944 of file input.py.

Referenced by hwrf.wrfbase.WRFDomains.__contains__(), hwrf.wrfbase.WRFDomains.__getitem__(), hwrf.wrfbase.WRFDomains.add(), hwrf.wrf.WRFSimulation.analysis_name(), produtil.datastore.UpstreamFile.check(), hwrf.regrib.GRIB1Product.getgrib1grbindex(), hwrf.regrib.GRIB1Product.getgrib1grid(), hwrf.regrib.GRIB1Product.getgrib1index(), hwrf.regrib.GRIB2Product.getgrib2grid(), hwrf.regrib.GRIB2Product.getgrib2index(), and hwrf.input.InputSource.priotable().

def hwrf.input.InputSource.get_one (   self,
  dataset,
  item,
  dest,
  logger = None,
  timeout = 20,
  realtime = True,
  kwargs 
)

This is a simple wrapper around fetch_file that gets only one file.

It will fail if the file requires pulling an archive.

Parameters
datasetthe dataset to transfer
itemthe desired item in the dataset
destthe on-disk destination filename
loggera logging.Logger for log messages
timeoutthe connection timeout in seconds
realtimeTrue for FORECAST mode, False for HISTORY mode
kwargsextra keyword arguments are passed to DataCatalog.locate()

Definition at line 1068 of file input.py.

Referenced by hwrf.input.InputSource.get().

def hwrf.input.InputSource.list_for (   self,
  realtime = True 
)

Returns the list of DataCatalog objects for FORECAST or HISTORY mode.

Parameters
realtimeTrue for FORECAST mode, False for HISTORY
Returns
self.forecast or self.history
Postcondition
_sort() has been called, sorting self.forecast and self.history in order of priority

Definition at line 787 of file input.py.

Referenced by hwrf.input.InputSource.get(), and hwrf.input.InputSource.get_one().

def hwrf.input.InputSource.open_ftp (   self,
  netpart,
  logger = None,
  timeout = 20 
)

Opens an FTP connection.

Opens the specified ftp://user@host/... request subject to the specified timeout, logging to the specified logger (if present and non-Null).

Parameters
netpartThe netpart portion of the URL
loggerthe logging.Logger for log messages
timeoutthe connection timeout in seconds

Definition at line 556 of file input.py.

Referenced by hwrf.input.InputSource.fetch_file().

def hwrf.input.InputSource.priotable (   self,
  dclist 
)

Generates a string containing a human-readable, prioritized list of data sources.

Parameters
dclistThe data source list from list_for()
Returns
A multi-line string containing the table.

Example: Prioritized list of data sources: PRIO- LOCATION = SOURCE @ DATES 100 - file:/// = DataCatalog(conf,'wcoss_fcst_PROD2014',2015080518) @ '1970010100-2038011818' 098 - file:/// = DataCatalog(conf,'wcoss_prepbufrnr_PROD2014',2015080518) @ '1970010100-2038011818' 097 - file:// = DataCatalog(conf,'zhan_gyre',2015080518) @ '2011060718-2011111200,2013051800-2013091018'

Definition at line 922 of file input.py.

Referenced by hwrf.input.InputSource.get().

def hwrf.input.InputSource.rsync_check_access (   self,
  netpart,
  logger = None,
  timeout = 20,
  dirpath = '/' 
)

Checks to see if rsync can even access a remote server.

Parameters
netpartthe netpart portion of the URL
loggerthe logging.Logger for log messages
timeoutthe connection timeout in seconds
Returns
True if the server is accessible and False otherwise

Definition at line 594 of file input.py.

Referenced by hwrf.input.InputSource.fetch_file(), and hwrf.input.InputSource.get().

Member Data Documentation

hwrf.input.InputSource.anltime
hwrf.input.InputSource.forecast

List of forecast mode DataCatalog objects.

Definition at line 402 of file input.py.

Referenced by hwrf.input.InputSource.list_for().

hwrf.input.InputSource.history

List of history mode DataCatalog objects.

Definition at line 404 of file input.py.

Referenced by hwrf.input.InputSource.list_for().

hwrf.input.InputSource.hsi

A produtil.prog.ImmutableRunner that runs hsi.

Definition at line 410 of file input.py.

Referenced by hwrf.input.InputSource.list_for().

hwrf.input.InputSource.htar

A produtil.prog.ImmutableRunner that runs htar.

Definition at line 409 of file input.py.

Referenced by hwrf.input.InputSource.list_for().

hwrf.input.InputSource.locks

Lock objects to restrict access to FTP servers to one thread at a time.

Definition at line 406 of file input.py.

Referenced by hwrf.input.InputSource.fetch_file().

hwrf.input.InputSource.valid

Data source validitiy information.

Definition at line 411 of file input.py.


The documentation for this class was generated from the following file: