Falling Dominos: Planning a Recovery Strategy

High Availability / Disaster Recovery
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

After working on disaster recovery plans for a number of years and performing system administration at the same time, I have a different perspective on the term recovery. When you think of recovery, you typically think of the big disasters in the categories of natural disasters (tornados, hurricanes, floods) and man-made disasters (bombs, toxic spills, fire, civil disturbances).

Although you certainly must plan for the big disasters, never underestimate the power of human or computer error. It is extremely difficult to plan for, challenging to recover from, and is likely to occur more frequently. For example, a user may delete a critical e-mail or a group of e-mails and then want them back. Or an employee may resign and decide to take databases off of the system or to remove a critical name and address book. Or an operator may accidentally delete a database while perusing a directory. My favorite is when the server lets you know that its “gone to sleep” in the middle of the day by freezing and sending you a calling card, via fatal errors in the Domino console. These are not big disasters to most, but will nonetheless require a recovery.

Working with Application Failures

Since the opportunity for application failures is ever-present, I’ll discuss them first. Two examples of application (server) failures are the freeze, where the application continues in the Domino subsystem but does not respond to requests, and the fatal crash, where not only does the server not respond, but it also ends all jobs and tasks that were running in the Domino subsystem at the time. During a freeze, the Notes client on the PC desktop will present a message that there is no response from the server. If you look at the Domino subsystem, using Work with Active Jobs (WRKACTJOB) command, there is little to no CPU time being used by the Domino server job or tasks.

Overall, the CPU looks good, and there is plenty of memory available. Logging to the Domino console stops for all but maybe a few sporadic messages, and after 5 to 10 minutes still nothing is running. You are now faced with having to decide whether or not to officially end the server—now comes the fun. You should always use a controlled end first:



(ENDDOMSVR SERVER(domino_server_name) OPTION(*CNTRLD))

If the controlled end doesn’t faze the server, the only course of action may be an immediate end:

(ENDDOMSVR SERVER(domino_server_name) OPTION(*IMMED))

An immediate end should be used only as a last resort, because ending the server in this manner can cause corruption.

Now, on to another challenge: the fatal crash caused by an unknown application problem that causes the Domino server to fail. During a fatal crash, everything is wonderful one minute, and the phones are ringing the next. The Domino server is not responding. Your first instinct is to look at the Domino console, and there it is, the scarlet letter of the Domino world: “fault recovery in progress” messages. The Domino subsystem may be running, but all jobs and tasks in the subsystem are gone.

These two different types of failures are not only annoying, they are going to require that you put effort into clean-up and preventative measures. You will also likely gain an opportunity to spend time with Lotus support to determine the cause of the failure, and determine if there is a recommended course of action to follow to keep it from reoccurring.

Looking for Clues

When the server comes down hard, meaning that it fails and lists failure messages in the console, it will likely create a file in the /notes/data directory that ends with the extension
.nsd. Depending on the severity of the crash, this file will contain job information about the component that caused the failure—similar information to what is found using the Work with Jobs (WRKJOB) command. The file can be found by using the WRKLNK /notes/data/*.nsd command; you then look for the file with a name that is as close to the date and time of the crash as possible. To view this file, use the Edit File (EDTF) command (as I discussed in my last article, “Falling Dominos: Planning a Backup Strategy,” MC, April 2001) when reviewing the notes.ini file. In this case, the command string will be EDTF ‘/notes/data/ xxxxxxx.nsd’, where xxxxxxx represents the name of the .nsd file. Some of the information may be cryptic, but the file may give you enough information to do some research on your own or to pass along to Lotus support.

If you frequently experience fatal errors with the server, you can be proactive in finding and reporting .nsd files by using a monitor program to look for *.nsd files and subsequently notify your administrator or operator of a potential error condition. Using the sample CL program in Figure 1 (page 47), the FTP input file in Figure 2, and the job description in Figure 3, you can setup a monitor program to capture and send a notification whenever an .nsd file is found. This monitor program is especially helpful when used with a paging software application, because it allows notification of a server problem after-hours and on weekends, when support staff is normally not available.

Precautionary Measures Following a Crash

When a failure occurs and the Domino server ends abnormally, it is time to be proactive. There are specific files open at the time of the crash that have a greater potential for becoming corrupt due to their high use rate. As a general rule, after a hard crash, you should rename these files before starting the server and let them rebuild as the server comes back up. Early on, I learned (the hard way) that restarting a server with corrupt files would eventually result in the server hitting that corrupt object again—causing subsequent server failures. The two files I had the most difficulty with were mail.box and log.nsf, so, as a general rule, these files are always renamed before restarting the server after the crash.



Once the server is started successfully, the old mail.box file is then checked for any undelivered mail, and that mail is cut out of the old mail.box file and pasted into the new one.

Be prepared for the restart time on the server to be lengthy following a failure, depending on the size of the database files in the /notes/data directories. Once the option is taken to restart the server, do not cancel it. Let it run until the server is active or fails again. Since it has just failed, the Domino server will be consumed with consistency checks and other assorted tasks in addition to the normal startup process. Canceling the server as it is trying to start only compounds the failure, and you will still have a lengthy startup time during the next start.

Just because you are able to restart the server and have users continue working does not necessarily mean that you are done. As a next step, consider running Fixup and Updall to find and repair potentially corrupted databases and update indexes. Fixup locates and fixes corrupt databases; Updall updates all changed views and full-text indexes for all databases. These functions can be run with the server up from the Domino console, or they can be run with the server down, provided that the appropriate environment variables are set. If you are going to run them with the server up, run them while the server is up and experiencing only a light load. While this is time-consuming on larger systems, it is well worth doing to prevent further crashes due to corrupted database files. As the Fixup and Updall processes are an entire topic on their own, I suggest that before implementing this as a process in your environment, you conduct a bit of research on the Lotus Web site (www.support.lotus. com/lshome.nsf). Once on the Lotus Web site, select Self-Service, Lotus Knowledge Base (under Problem Resolution), and Notes/Domino Knowledge Base. From there, you will have the option of performing a search on either Fixup or Updall, and you can gain valuable insight into these two functions.

Now, envision a more challenging scenario: A user’s mail file (mailbox) has become corrupt. Or someone has accidentally deleted a database or mail files and you must recover them. Whenever I receive a request to restore individual application files or mail files, I prefer to have the administrator restore the files to an alternate directory, such as /notes/mailrst, then move or copy them from that location to an appropriate directory in /notes/data. This way, if you type in a wrong but valid name, you do not accidentally overwrite a good file. If you are restoring to find a specific piece of mail, again, rather than potentially overwriting the entire file, you should restore the mail file to an alternate directory, then open the file from that directory and look for the piece of mail. Remember not to leave mail files in the alternate directory to work with long-term.

Examples for Recovering Files

The examples in this article assume that the Domino directory is /notes/data. IBM recommends that files be restored while the Domino server is down and users are logged out of the files; however, in the following example, the user would need to be logged out of his mail file, but the Domino server could remain online:

1. Have the owner of the mail file log out of Notes.

2. Sign on as a user with *JOBCTL and *SAVSYS special authorities.

3. Use the RST command to restore the file from tape. Change the tape device name to your own, as in the following example:

RST DEV(‘qsys.lib/tap03.devd’) OBJ((‘/notes/data/mail/jwright.nsf’ *INCLUDE
‘notes/mailrst/jwright.nsf’))

4. With the file restored, rename or delete the user’s mail file in the /notes/data/mail directory.



5. Copy or move the restored mail file into the /notes/data/mail directory.

6. Verify that the owner of the mail file is QNOTES.

This RST command example restores a specific mail file to a temporary working directory, and it can be modified to allow the restoration of multiple files simultaneously. A multiple restore would look similar to this:

RST DEV(‘qsys.lib/tap03.devd’) OBJ((‘/notes/data/mail/jwright.nsf’ *INCLUDE
‘/notes/mailrst/jwright.nsf’) (‘/notes/data/mail/hfriedman.nsf’ *INCLUDE
‘/notes/mailrst/hfriedman.nsf’))

After the file, or files, are restored, moved, or copied, always verify the authority on the file. Domino is very particular in that the QNOTES profile must own all of the files in the /notes/data directory. If QNOTES does not own everything in the /notes directory, it can cause significant problems with the Domino server.

Individual databases are restored using the same command as used in the previous example by specifying the name of the directory and files to be restored. The previous example was specific mail files, for which you can ask the user to be out long enough to move the mail file into /notes/data/mail. However, for other databases, this is not the case, since you cannot restore a database that is in use; have users log out of the database and shut down the server before the restore. A sample server restore procedure is as follows:

1. Have users log out of Notes.

2. Sign on with a profile that has *JOBCTL and *SAVSYS authority.

3. Stop the Domino server by using the ENDDOMSVR command.

4. Use the RST command to restore the file from tape. You can substitute the tape drive for a save file if necessary. Change the tape device name to your own as follows:

RST DEV(‘qsys.lib/tap03.devd’) OBJ((‘/notes/data/sample/*.nsf’))

5. Verify that the owner of the restored file is QNOTES.

6. Restart the Domino server.

The examples I’ve presented so far for restoring mail and database files assume that you have a complete save available and do not require the restoration of incremental backups, so now take a look at an incremental recovery. Assume that on Wednesday the entire /notes/data directory must be restored and your backup strategy is such that you perform full saves of the directory on Sunday nights and incremental backups on all other nights. The following steps might be used to recover to Tuesday night’s backup:

1. Have users log out of Notes.

2. Sign on with a profile that has *JOBCTL and *SAVSYS authority.

3. Stop the Domino server by using the ENDDOMSVR command.

4. Locate Sunday, Monday, and Tuesday nights’ backup tapes, then load Sunday night’s tape.



5. Use the RST command to restore the /notes/data directory from tape. Change the tape device name to your own as follows:

RST DEV(‘qsys.lib/tap03.devd’) OBJ((‘/notes/data/*’))

6. Once /notes/data is restored, mount Monday night’s changed object save and reissue the command in Step 5. Repeat this step again for Tuesday night’s tape.

7. Verify that the owner of the restored file is QNOTES.

8. Restart the Domino server.

Much information has been covered here; however, I would be remiss if I did not cover directory synchronization after mentioning it in my April article. If you are using directory synchronization, there are special considerations for recovery. Both the AS/400 System Distribution Directory and the Domino Public Name and Address book should be recovered at the same time. The process for recovering the AS/400 System Distribution Directory and the Domino Name and Address book could be as follows:

1. Stop directory synchronization by using the call qnotesint/qnndiend.

2. Restore the system distribution files as follows:

Rstobj obj(qaok*) savlib(qusrsys) objtype(*file) dev(tap01)

3. To restore a name and address book file to the Domino directory, use the following command:

RST DEV(‘qsys.lib/tap03.devd’) OBJ((‘/notes/data/names.nsf’))

4. Verify that QNOTES is the owner of /notes/data/names.nsf.

5. Start directory synchronization by using the call qnotesint/qnndistj.

Final Suggestions

I have touched on only a few possibilities and scenarios for rebuilding Domino. I hope that you will not have the occasion to use this information often or be required to take it any further. I leave you with a few final administrative suggestions:

1. Always keep a paper copy of your notes.ini file contents.

2. Always keep a paper copy of your Configure Domino Server (CFGDOMSVR) setup in the event that you have to completely remove and rebuild your Domino server.

3. Make sure that you have access to all of the server ID files and that you know the passwords. This can be critical during a full removal or rebuild of the server.

4. Verify your backups to ensure that you are backing up the correct files.

5. Be extremely careful with your security for the Domino server. The QNOTES profile is intended to be the owner of the Domino environment; if it is not, it causes significant problems within the Domino application.



6. Avoid mapping PC drives to the /notes directory or maintaining drive maps to /notes for long periods of time. If you lock a directory or file and the server needs it, problems can occur with the server.

7. I have had it recommended to me by several Lotus support representatives that file moves and renames in the /notes directory are best handled via the OS/400 commands MOV and REN. There are situations where moving or renaming files in OS/400 IFS via Windows causes problems with the file being worked on.

For significant Domino recoveries, work with a Lotus or IBM representative. It will save you time and frustration.

REFERENCES AND RELATED MATERIALS

• “Falling Dominos: Planning a Backup Strategy,” Julie Wright, MC, April 2001
• Lotus Domino on iSeries home page: www.as400.ibm.com/domino
• Lotus Support home page: www.support.lotus.com/lshome.nsf

NSDMON: PGM

DCLF FILE(DOMADM/LSOUTPUT)

START: CLRPFM FILE(DOMADM/QCLSRC) MBR(NSDOUT)

CHGCURLIB CURLIB(DOMADM)

OVRDBF FILE(INPUT) TOFILE(DOMADM/QCLSRC) MBR(NSDLST)

OVRDBF FILE(OUTPUT) TOFILE(DOMADM/QCLSRC) MBR(NSDOUT)

FTP RMTSYS('AS400SYS')

LOOP: RCVF

MONMSG MSGID(CPF0864) EXEC(GOTO CMDLBL(DLY))

SNDPGMMSG MSG('NSD file ' *CAT &LSOUTPUT *TCAT ' has +

been detected in /notes/data. There may +

be a Domino problem. Contact Domino or +

AS400 Admin... ') TOUSR(*SYSOPR)

SNDJRNE JRN(QAUDJRN) TYPE('D1') ENTDTA(&LSOUTPUT) /* +

If using security auditing entry.*/

MONMSG MSGID(CPF0000)

CHGCURDIR DIR('/notes/data/')

MONMSG MSGID(CPF0000)

MOV OBJ(&LSOUTPUT) TODIR('/notes/nsdhold') /* +

You must create nsdhold directory */

MONMSG MSGID(CPF0000) EXEC(DO) /* Looking for +

duplicate files */

/* If program grabs NSD file before Domino complete, +

Domino will create 2nd with same name. */

MOV OBJ(&LSOUTPUT) TODIR('/NOTES')

MONMSG MSGID(CPF0000) EXEC(SNDPGMMSG MSG('There are +

multiple NSD* files in /notes/data with +

the same name. Please review and move +

them to /notes/nsdhold with new names. +

CHGCURDIR(''/notes/nsdhold'' then WRKLNK +

''/notes/data/nsd*'' and use option 2.') +

TOUSR(*SYSOPR))

ENDDO

/* If NSD found, delay program while attempting restart +

if Domino Server not set up for auto restart */

NSD: CPYF FROMFILE(DOMADM/LSOUTPUT) +

TOFILE(DOMADM/LSOUTPUT1) MBROPT(*REPLACE) +

FMTOPT(*NOCHK)

MONMSG MSGID(CPF0000)

DLYJOB DLY(120)

STRDOMSVR SERVER(DOMINO01) /* Can be added to +

automatically attempt restart if server is not set up +

for restart */

DLYJOB DLY(60)

GOTO CMDLBL(LOOP)

DLY: CLRPFM FILE(DOMADM/LSOUTPUT) /* FTP output log */

DLYJOB DLY(60)

TFRCTL PGM(DOMADM/NSDMON)

GOTO CMDLBL(START)

ENDPGM: ENDPGM

Figure 1: Create a monitor for .nsd files in the Domino directory to notify the system administrator of failures with the Domino server.



5769PW1 V4R4M0 990521 SEU SOURCE LISTING 02/22/01 15:51:52 PAGE 1

SOURCE FILE . . . . . . . DOMADM/QCLSRC

MEMBER . . . . . . . . . NSDLST

SEQNBR*...+... 1 ...+... 2 ...+... 3 ...+... 4 ...+... 5 ...+... 6 ...+... 7 ...+... 8 ...+... 9 ...+...
0

1 notesadm password 07/12/99

2 cd /notes/data 07/12/99

3 lcd domadm 07/12/99

5 ls /notes/data/*.nsd (disk 09/13/99

6 ls /notes/data/*.nsd 09/13/99

7 quit 07/12/99

* * * * E N D O F S O U R C E * * * *

Figure 2: This is a sample input listing for displaying .nsd files to an output file using FTP.

Job Description Information Page 1

5769SS1 V4R4M0 990521 AS400SYS
02/22/01 15:50:10

Job description: NSDMON Library: DOMADM

User profile . . . . . . . . . . . . . . . . . . : NOTESADMID

CL syntax check . . . . . . . . . . . . . . . . : *NOCHK

Hold on job queue . . . . . . . . . . . . . . . : *NO

End severity . . . . . . . . . . . . . . . . . . : 30

Job date . . . . . . . . . . . . . . . . . . . . : *SYSVAL

Job switches . . . . . . . . . . . . . . . . . . : 00000000

Inquiry message reply . . . . . . . . . . . . . : *RQD

Job priority (on job queue) . . . . . . . . . . : 5

Job queue . . . . . . . . . . . . . . . . . . . : DOMINO01

Library . . . . . . . . . . . . . . . . . . . : QUSRNOTES

Output priority (on output queue) . . . . . . . : 5

Printer device . . . . . . . . . . . . . . . . . : *USRPRF

Output queue . . . . . . . . . . . . . . . . . . : *USRPRF

Library . . . . . . . . . . . . . . . . . . . :

Message logging:

Level . . . . . . . . . . . . . . . . . . . . : 4

Severity . . . . . . . . . . . . . . . . . . . : 10

Text . . . . . . . . . . . . . . . . . . . . . : *NOLIST

Log CL program commands . . . . . . . . . . . . : *NO

Accounting code . . . . . . . . . . . . . . . . : *USRPRF

Print text . . . . . . . . . . . . . . . . . . . : *SYSVAL

Routing data . . . . . . . . . . . . . . . . . . : QCMDI

Request data . . . . . . . . . . . . . . . . . . : call domadm/nsdmon

Device recovery action . . . . . . . . . . . . . : *SYSVAL

Time slice end pool . . . . . . . . . . . . . . : *SYSVAL

Job message queue maximum size . . . . . . . . . : 16

Job message queue full action . . . . . . . . . : *PRTWRAP

Allow multiple threads . . . . . . . . . . . . . : *NO

Text . . . . . . . . . . . . . . . . . . . . . . : Sample Job Description for Domino

NSD monitor Auto Start Job Entry

Initial library list:

QNOTES

QGPL

QTEMP

DOMADM

* * * * * E N D O F L I S T I N G * * * * *

Figure 3: To automate the monitoring process, create a job description for an autostart job entry in the Domino subsystem.



BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$