This page last changed on May 28, 2008 by chuckp.
How to Fix FASTCopy Outgoing
Owned by: Philip Hart
The FASTCopy Outgoing monitoring page shows the transfer process states ("new, batchdone, etc.; see diagram below); this is the first piece of information to use when diagnosing problems (e.g., the GSSC wants to know why some product has not yet arrived).
How to Fix
A file stuck in:
- NEW (should not be there
more than 1 ~ 2 minutes.)
- BATCHDONE
- INITDONE
- LOCALDONE
- EXITDONE
|
|
Erroneous: SUCCESS ERROR
see Example:
Audit Trail for LISOC_2008136200914.tar |
Notes:
- Sender may, for example, be:
- L1 processing using FASTCopy.py to send the GSSC a data product.
- Mission Planning Tool (MPT), e.g., when the operator sends a planning product to the MOC, or to the GSSC.
Note: The names of the MPT submitters must be in the FCOPY_SUBMITTERS table for the submittal
to work.
Note also: If the GSSC is down, the MPT can resend its products directly to the MOC.
- Someone runing FC_send.sh as a test. * A procedure is then invoked to tar the products, and the
resulting tarball is copied to /nfs/farm/g/glast/u23/ISOC-.../glastops/Outgoing and entries are made
in the Oracle FCOPY_OUTGOING and FCOPY_PKGINFO tables (the latter is a manifest of the contents).
- The cron process on glastlnx11 runs FASTCopyDB::Cron() to send NEW tarballs to flogicd@glastlnx11 via fcopy -batch.
The daemon, which is a simple state machine, runs FASTCopyDB:Init(), fcopys the .tar to the destination, runs
the remote post-transfer code, runs Local() to do local cleanup, then exits.
- Script installations and rpms are as in the incoming section.
|
1. File stuck in "NEW"
To Test: Check FASTCopy Outgoing. If the file has been stuck in the "NEW" state for more that 1~2 minutes, there's probably a problem with the cron process on glastlnx11.
To Fix: Check Nagios to ensure that glastlnx11 is okay. If it is not, call xHELP at (650) 926-4357. You will get the Help menu. Select "page the on-call Technical Coordinator" who will try to assist, or will contact a member of the Infrastructure Group as needed.
Tip: If glastlnx11 is okay, connect to glastlnx11; then check /var/log/flightops/cron.log for python exceptions, or whatever. |
2. File stuck in "BATCHDONE"
Note: There is probably a problem with the flogicd daemon on glastlnx11. This function is not serviced by the usual FOS daemons infrastructure described in the confluence Daemons and daemon control pages.
To Test: See if flogicd is running on glastlnx11. [E.g., ssh to glastlnx11, then see if flogicd is running under root by doing ps aux | grep flogicd. Or do service --status-all | grep flogicd.]
To Fix: If necessary, run /sbin/service start flogicd. [This can be done as glastops or via sudo, requiring a level of permissions that may require calling an expert.] |
Note: At this point, the local tarball still exists, so you can reset the FCOPY_OUTGOING table entry to NEW in the same way described in the How to Fix FASTCopy Incoming: Ingest Faliure
For example, the tarball initially lives in:
- /nfs/farm/g/glast/u23/ISOC/glastops/Outgoing/B33_2008054211250.tar
and gets moved by the FASTcopyDB::init() [maybe local()?] method to:
- /nfs/farm/g/glast/u23/ISOC/Archive/fcopy/2008/02/054.02.23.Sat/utc21/21.19.34/B33_2008054211250.tar
To Test: Launch FASTCopy Monitoring (FCWebView), select Outgoing and check the Status column.
- If there is package stuck in "BATCHDONE", and you need to reset the "submitted" flag to "new", so the cron job
will pick it up and resubmit the job*.*
- First, hover over the Package name of the failed package.
- From the status bar at the bottom of the page, copy the the icdfile_pk number
(e.g., icdfile_pk=63677).
- From an ISOC environment terminal, you can access the relevant Oracle instance via a wrapper, e.g.,
rlwrap sqlplus /@isocnightly
or:
... flight
Note: For others, see_: $TNS_ADMIN/tnsnames.ora*_*
Tip: It may be useful to inspect:
desc fcopy_icdfile
select * from fcopy_jobstate*
To Fix:
- To change the table status, run:
update fcopy_icdfile set jobstate_fk = 1 where icdfile_pk in (1234, 234, 545)
- Or, to reingest an entire tarball:
... where icdfile pk = 123456789
|
3. File stuck in "INITDONE"
There is probably some network issue or problem with the fcopyd at the destination.
To Fix: Contact the destination or SLAC computing. |
4. File stuck in "LOCALDONE"
There is a problem with the remote post-transfer process.
Check the log in: /nfs/farm/g/glast/u23/ISOC-flight/glastOps/Outgoing/ |
5. File stuck in "EXITDONE"
Note: The file is not expected to be in EXITDONE. At this stage, flogicd will handle problems by closing the transfer record and starting a new one, which will be duly noted in the central log. |
Audit Trail for LISOC_2008136200914.tar
When you click on a link in FASTCopy Monitoring: Outgoing to a successfully "fcopyed" file (e.g., LISOC_2008136200914.tar), you will see a "SUCCESS ERROR" message as shown below. Ignore the "ERROR" part of the message, which is just fastcopy annoyingness.
..... |
COMPLETED JOBS *******
Famil Paren Job Id Node State Fcopy User Group Pr
-------------------------------------------------------------------------------
00003 00000 00003 fcopy batch glyph.gsfc DONE -1 glasto users
Thu May 15 13:10:06 2008:*Issuing (synchronous) Init Command: flogic_init.sh 64195 glyph.gsfc.nasa.gov LISOC_2008136200914.tar
Thu May 15 13:10:10 2008: Init Command completed successfully
Thu May 15 13:10:10 2008:*FASTCopying...
Thu May 15 13:12:06 2008: FCOPY Batch job is successfully done!
Thu May 15 13:12:06 2008:*Issuing (asynchronous) Exit Command: flogic_exit.sh 64195 glyph.gsfc.nasa.gov LISOC_2008136200914.tar 1 SUCCESS ERROR |
|