II. SUGGESTIONS FOR DEBUGGING FOR SERIAL EXECUTION OF SCIENTIFIC PROGRAMS.
The following section is designed to provide suggestions for debugging programs written for scientific
applications. However, many of the suggestions will apply to debugging other types of applications. This
section deals only with debugging programs that are running serially and not in parallel.
Programs can appear to have no bugs with some sets of data because all paths through the program may
not be executed. To debug your program, it is important to try and test your program with a variety of
different data sets so that (hopefully) all paths in your program can be tested for errors.
Now assume that your program compiles and produces an executable, but the program execution either
doesn't complete or it does complete but produces wrong answers. Carefully going through the following
steps will help you find many of the commonly occurring bugs. All compiler options mentioned in this
section are valid for Fortran/77, Fortran/90, C and C++ compilers unless indicated otherwise.
STEP 1. Using lint
If your program is written in C, you should use the lint utility that will help identify problems with your
code at the compile step. If your C program is in the file prog.c, then invoke lint with:
lint prog.c
The output from the above invocation will be directed to your screen.
There is a public domain version of lint for Fortran77 called ftncheck that can be obtained from
the /pub directory at the anonymous ftp site ftp.dsm.fordham.edu at Fordham University.
STEP 2. Check for out-of-bounds array accesses
A common programming error is the use of array indices outside their declared limits. For help
finding these errors in your Fortran 90 program, compile as follows (currently there is no such support
for the C/C++ compilers):
f90 -g -DEBUG:subscript_check:trap_uninitialized:conform_check prog.f90
Then run the generated executable by itself or under dbx or cvd. The "subscript_check" option
enables bounds checking. The "trap_uninitialized" option caused the program to detect when the value
of a variable is used before it has been set. The "conform_check" option enables conformance checking
of array operands in array expressions.
And click on the RUN button.
(See the DEBUG_group man page for more information on this option.)
(a) If you are running a C or C++ program, your program will stop at the first occurrence of an
array index going out of bounds. You can now examine the value of the index which caused the problem
using any of the methods described in section C above. Compiling with the -g option causes the compiler
to generate symbolic debugging information so your program will execute under cvd. It also disables
optimization. Sometimes disabling optimization will cause the bug to disappear. If this happens, you
should still carefully go through each of these steps as best as you can.
(b) If you are using the Fortran/90 compiler, after compiling with the above options and after
running the generated executable under cvd, enter the following in the cvd pane
cvd> stop in __f90_bounds_check
Now click on the RUN button. Next click on VIEWS and select Call Stack and double click
on the function/subroutine immediately below __f90_bounds_check. This will cause the source code
for this function/subroutine to be displayed in the Main View Window and the line where cvd has stopped
will be highlighted. You can now find the value of the index which caused the out-of-bounds problem.
(c) If you are using the Fortran/77 compiler, after compiling with the above options and after running
the generated executable under cvd, enter the following in the cvd pane
cvd> stop in s_rnge
Now click on the RUN button. Next click on VIEWS and select Call Stack and double click on
the function/subroutine immediately below s_rnge. This will cause the source code for this function/subroutine
to be displayed in the Main View Window and the line where cvd has stopped will be highlighted. You can
now find the value of the index which caused the out-of-bounds problem.
Note:
For Fortran programs, bounds checking cannot be done in subprograms if arrays passed to
a subprogram are declared with extents of "1" or "*" instead of passing in their sizes and using this
information in their declarations. An example of how the declarations should be written to allow
for bounds checking is:
SUBROUTINE SUB(A,LDA,N, ...)
INTEGER LDA,N
REAL A(LDA,N)
STEP 3. Check for uninitialized variables being used in calculations
To find uninitialized REAL variables being used in floating point calculations, compile your program with
-g -DEBUG:trap_uninitialized=ON
This will force all uninitialized stack, automatic and dynamically allocated variables to be initialized
with 0xFFFA5A5A. When this value is used as a floating point variable involving a floating point calculation,
it is treated as a floating point NaN and it will cause a floating point trap. When it is used as a pointer or as
an address a segmentation violation may occur. For example, if x and y are real variables and the program
is compiled as above,
x = y
will not be detected when y is uninitialized since no floating point calculations are being done.
However, the following will be detected:
x = y + 1.0
After compiling your program with the above options, enter
cvd <executable>
and then click the RUN button. To find out where your program has stopped, click on VIEWS
and select the Call Stack where you will see that many system routines have been called. Double click
on the highest routine in the call stack that is clearly not a system routine. This will bring up the source
code for this routine and the line where the first uninitialized variable (subject to the above-mentioned
conditions) was used. You can now examine the values of the indices which caused the problem using
any of the methods described in section I part C.
At the present time, it is not possible to use cvd to detect the use of uninitialized INTEGER variables.
STEP 4. Finding Divisions by Zero and Overflows
A. To find floating point divisions by zero and overflows, first
enter
setenv TRAP_FPE ON
if you are using the csh or tcsh shell. For other shells, see their man pages.
Next compile your program with -g and link with -lfpe:
-g -lfpe
and then enter
cvd <executable>
In the cvd command/message pane enter
cvd> stop in __catch
Click on the RUN button; select Call Stack from VIEWS and then double click on the highest routine
that is not a system routine. The line where execution stopped will now be highlighted in the Source code
display area of the cvd Main View window. You may now use any of the methods in section C above to
find variable values to discover why the divide by zero or overflow occurred. For more information on
handling floating point exceptions, see the man pages for handle_sigfpes.
B. To find integer divisions by zero, compile your program as
-g -DEBUG:div_check=1
and enter
cvd <executable>
Click the RUN button and the program will automatically stop at the first line where an integer divide
by zero occurred. You may now use any of the methods of section C to find variable values to discover
why the divide by zero occurred.
STEP 5. A core file is produced
Sometimes during program execution a core file is produced and the program does not complete execution.
This file is placed in your working directory with the file name of 'core'. You can find the place in your program
where the execution stopped and the core file was produced by entering
cvd <executable> core
where <executable> is the executable that you were running. The cvd Main View window will come up
and the source line where execution stopped may be highlighted in green. If it is not highlighted in green,
then select Call Stack under VIEWS and double-click on the highest routine that is not a system routine.
This will bring up the source code for this routine and the last line executed will be highlighted in green.
If the executable was formed by compiling with the -g option, then you can view values of program variables
when program execution stopped. You can find the assembly instruction where execution stopped by clicking
on VIEWS and selecting Disassembly View. Remember that this is the last statement executed before the core
file was produced and hence it does not necessarily mean that the bug in your program is in this line of code.
For example, a program variable may have been initialized incorrectly, but the core was not produced until
the variable was used later in the program.
Some machines are configured to not produce a core file.
To find out if this is the case on the machine you are using enter
limit
If the limit on coredumpsize is zero, no corefile will be produced.
If the limit on coredumpsize is not large enough to hold the program's memory image,
the core file produced will not be usable. To change the configuration to allow useful core
files to be produced enter
unlimit coredumpsize
STEP 6. Incorrect answers are being produced
Assume that the above steps have been taken and that all problems that can be detected by the above
have been corrected. This means that your program completes execution, but obtains incorrect answers.
What you do at this point will likely depend on special circumstances. The following is a list of some
commonly used debugging procedures that may or may not apply to your situation.
1. Try running your program on a very small problem size where you can easily obtain intermediate results.
Run your program under cvd on this small problem and compare with the known correct results.
2. If you know that a certain answer being calculated is not correct, set breakpoints in your program
so you can monitor the value of the answer at various points in your program.
3. You may want to set breakpoints on each call to a selected function/subroutine where you suspect
there may be problems, see section I part C.
4. Debugging COMMON blocks and EQUIVALENCE statements in Fortran.
Variables used in these statements must have exactly the same type and dimension everywhere they appear
and they must occur in exactly the same order. Normally ftncheck, for Fortran/77 programs will find these errors.
However, for Fortran/77 programs it is best to use an include statement for each COMMON block. For Fortran/90
programs, it is best to use a module for each COMMON block. It is best not to use EQUIVALENCE statements.
5. Local data not saved. In Fortran, values of local variables are not guaranteed to be saved from one execution
of the subprogram to the next unless they are either initialized in their declarations or they are declared to have
the SAVE attribute. Some compilers/machines automatically give all local variables the SAVE attribute, so
moving a working program from this compiler/machine to a compiler/machine that does not do this may introduce
this kind of bug. You should give local variables the SAVE attribute if you would like their values saved.
来源:http://andrew.ait.iastate.edu/HPC