Use gprof to check your codes for performance issues

By reading the article Speed your code with the GNU profiler from IBM DevelopWorks, I have gain the knowledge of using gprof to easy my work to identify my module’s performance’s bottleneck. Here, I would like to share my experience on how I discover the clog of my codes.

Let us first look at the simple steps on how GNU profiler works.

In order to make use of gprof, the c/c++ codes must be compiled by gcc with -pg options. Assume the source code to be compiled is gp-test.c.

gcc -pg -g2 -o gp-test{,.c} 

-pg is to enable gprof, -g2 is to enable debugging mode 2, -o is to specified the output of the binaries and I am using the curly brackets to shorten my typing.

Next, run the binaries and gmon.out will be generated.


With gmon.out, now you can extract the profiling info of your codes by running gprof.

gprof gp-test gmon.out > result.txt 

I like to save the results to a text file ‘result.txt’ for further comparison and analysis.

Lets look at a sample c code, and try to catch the choke point.


int twoD[10000][10000]={0};

int update_d1()
    int i,k=0;
    for (i=0;i<10000;i++)

int update_d2()
    int i,k=0;
    for (i=0;i<10000;i++)

int main(int argc, char * argv[])
    int i,j,k=0;
    if (argc!=2)
        return -1;
    if (*(argv[1])=='1')
    else if (*(argv[1])=='2')
        printf("\nInvalid value %s\n",argv[1]);

    return 1;

Both function update_d1() and update_d2() are accessing the 2D array with same amount of loops. Assume the 3D array twoD[row][column], update_d1() accessing row, where update_d2() accessing column. We discovered that the amount of time used to complete the function are in great differences. Lets compile and profile it with gprof.

gcc -pg -g -o gp-test{,.c}
./gp-test 1
gprof gp-test gmon.out > t1
./gp-test 2
gprof gp-test gmon.out > t2

Observed the extracted results

using update_d1() :
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
100.52      0.06     0.06        1    60.31    60.31  update_d1

using update_d2() :
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00        1     0.00     0.00  update_d2

update_d1() uses 0.06 seconds, and update_d2() uses less than 0.01 seconds, Why?

Look at the 2D array again, twoD[row][column]. The twoD array is physically map to large one chunk of memory instead of rows and columns. The first block of memory is begins with row 0 column 1, the first column of row 1 is actually located at 10001th block.

Imagine how update_d1() accessing the memory. By accessing each row, it has to leap over 10000 blocks, where update_d2() consequently access 10000 blocks without leaping. Thats the reason of the delays.


