This page last changed on May 28, 2008 by chuckp.

How to Fix FASTCopy Outgoing

Owned by: Philip Hart

The FASTCopy Outgoing monitoring page shows the  transfer process states ("new, batchdone, etc.; see diagram below); this is the first piece of information to use when diagnosing problems (e.g., the GSSC wants to know why some product has not yet arrived).

How to Fix
A file stuck in:
  1. NEW (should not be there
    more than 1 ~ 2 minutes.)
  2. BATCHDONE
  3. INITDONE
  4. LOCALDONE
  5. EXITDONE
Erroneous: SUCCESS ERROR
see Example:
Audit Trail for LISOC_2008136200914.tar
Notes:
  • Sender may, for example, be:
    • L1 processing using FASTCopy.py to send the GSSC a data product.
    • Mission Planning Tool (MPT), e.g., when the operator sends a planning product to the MOC, or to the GSSC.
      Note: The names of the MPT submitters must be in the FCOPY_SUBMITTERS table for the submittal
      to work.
      Note also: If the GSSC is down, the MPT can resend its products directly to the MOC.
    • Someone runing FC_send.sh as a test. * A procedure is then invoked to tar the products, and the
      resulting tarball is copied to /nfs/farm/g/glast/u23/ISOC-.../glastops/Outgoing and entries are made
      in the Oracle FCOPY_OUTGOING and FCOPY_PKGINFO tables (the latter is a manifest of the contents).
  • The cron process on glastlnx11 runs FASTCopyDB::Cron() to send NEW tarballs to flogicd@glastlnx11 via fcopy -batch.
    The daemon, which is a simple state machine, runs FASTCopyDB:Init(), fcopys the .tar to the destination, runs
    the remote post-transfer code, runs Local() to do local cleanup, then exits.
  • Script installations and rpms are as in the incoming section.

 1. File stuck in "NEW" 

To Test: Check FASTCopy Outgoing.  If the file has been stuck in the "NEW" state for more that 1~2 minutes, there's probably a problem with the cron process on glastlnx11.

To Fix: Check Nagios to ensure that glastlnx11 is okay.  If it is not, call xHELP at (650) 926-4357.  You will get the Help menu. Select "page the on-call Technical Coordinator" who will try to assist, or will contact a member of the Infrastructure Group as needed.

Tip: If glastlnx11 is okay, connect to glastlnx11; then check /var/log/flightops/cron.log for python exceptions, or whatever. 

2. File stuck in "BATCHDONE"

Note: There is probably a problem with the flogicd daemon on glastlnx11.  This function is not serviced by the usual FOS daemons infrastructure described in the confluence Daemons and daemon control pages. 

To Test: See if flogicd is running on glastlnx11.   [E.g., ssh to glastlnx11, then see if flogicd is running under root by doing ps aux | grep flogicd.  Or do service --status-all | grep flogicd.]

To Fix: If necessary, run /sbin/service start flogicd.  [This can be done as glastops or via sudo, requiring a level of permissions that may require calling an expert.]
Note: At this point, the local tarball still exists, so you can reset the FCOPY_OUTGOING table entry to NEW in the same way described in the How to Fix FASTCopy Incoming: Ingest Faliure

For example, the tarball initially lives in:
  • /nfs/farm/g/glast/u23/ISOC/glastops/Outgoing/B33_2008054211250.tar
    and gets moved by the FASTcopyDB::init() [maybe local()?] method to:
  • /nfs/farm/g/glast/u23/ISOC/Archive/fcopy/2008/02/054.02.23.Sat/utc21/21.19.34/B33_2008054211250.tar


    To Test: Launch FASTCopy Monitoring (FCWebView), select Outgoing and check the Status column.

  • If there is package stuck in "BATCHDONE", and you need to reset the "submitted" flag to "new", so the cron job
    will pick it up and resubmit the job*.*
    • First, hover over the Package name of the failed package.
    • From the status bar at the bottom of the page, copy the the icdfile_pk number
      (e.g., icdfile_pk=63677).
  • From an ISOC environment terminal, you can access the relevant Oracle instance via a wrapper, e.g., 

           rlwrap sqlplus /@isocnightly
    or: 
           ... flight

    Note: For others, see_:  $TNS_ADMIN/tnsnames.ora*_*

    Tip: It may be useful to inspect:
           desc fcopy_icdfile
           select * from fcopy_jobstate*

    To Fix:
  • To change the table status, run:

           update fcopy_icdfile set jobstate_fk = 1 where icdfile_pk in (1234, 234, 545)

  • Or, to reingest an entire tarball:

            ... where icdfile pk = 123456789

3. File stuck in "INITDONE"

There is probably some network issue or problem with the fcopyd at the destination. 
To Fix: Contact the destination or SLAC computing.

4. File stuck in "LOCALDONE"


There is a problem with the remote post-transfer process. 
Check the log in:  /nfs/farm/g/glast/u23/ISOC-flight/glastOps/Outgoing/ 

5. File stuck in "EXITDONE"

Note: The file is not expected to be in EXITDONE.  At this stage, flogicd will handle problems by closing the transfer record and starting a new one, which will be duly noted in the central log.

Audit Trail for LISOC_2008136200914.tar  

When you click on a link in FASTCopy Monitoring: Outgoing to a successfully "fcopyed" file (e.g., LISOC_2008136200914.tar), you will see a "SUCCESS ERROR" message as shown below.  Ignore the "ERROR" part of the message, which is just fastcopy annoyingness.

..... COMPLETED JOBS *******
Famil Paren Job Id Node State Fcopy User Group Pr
-------------------------------------------------------------------------------
00003 00000 00003 fcopy batch glyph.gsfc DONE -1 glasto users
Thu May 15 13:10:06 2008:*Issuing (synchronous) Init Command: flogic_init.sh 64195 glyph.gsfc.nasa.gov LISOC_2008136200914.tar
Thu May 15 13:10:10 2008: Init Command completed successfully
Thu May 15 13:10:10 2008:*FASTCopying...
Thu May 15 13:12:06 2008: FCOPY Batch job is successfully done!
Thu May 15 13:12:06 2008:*Issuing (asynchronous) Exit Command: flogic_exit.sh 64195 glyph.gsfc.nasa.gov LISOC_2008136200914.tar 1 SUCCESS ERROR

Document generated by Confluence on Jan 21, 2010 11:37