Fun to Program -- Debug: level 4

This was originally written and created around 2013 and may require to be updated. (2021)

Debug: level 4

Dynamically tracing program execution to understand the general behavior of the program is the major part of debugging.

There are several ways to trace programs interactively. The perf tool facilitate this without any modifications to the source.

perf

The linux-tools-* package matching your running kernel provides the perf command on your system.

The perf command provides Linux profiling with performance counters.

perf stat: obtain event counts (better time(1), sar(1), …)
perf record: record events for later reporting
perf report: break down events by process, function, etc. (better strace(1), ltrace(1), …)
perf annotate: annotate assembly or source code with event counts
perf top: see live event count (better top(1) with disassembler but need to be root)

Unlike other profiling tools, this does not require the recompilation of the source nor the source file. This is also very light weight. But this is a Linux specific tool.

See perf: Linux profiling with performance counters for more information.

Profiling prime

Let’s investigate why the compiled Vala prime program with Glib was very slow as described in Benchmark using the perf stat command.

Global performance counter stats for the various prime programs.

Performance counter stats	C	Vala (Gee)	Python	Vala (GLib)
seconds time elapsed	0.167983287	0.575971153	5.645002396	9.481985276
task-clock	167.373524	575.422353	5644.228565	9480.929621
context-switches	1	1	9	19
cpu-migrations	1	1	3	1
page-faults	785	643	3,380	1,042
cycles	514,939,216	1,706,530,307	16,964,435,086	27,931,279,895
stalled-cycles-frontend	177,126,811	804,510,638	3,982,808,734	23,873,332,537
stalled-cycles-backend	25,597,018	225,726,193	1,219,951,714	18,391,463,018
branches	57,299,268	270,933,180	7,548,747,415	3,431,684,225
branch-misses	431,965	534,024	48,326,241	616,496

You see the compiled Vala prime program with GLib has much more stalled-cycles. This seems to be the primary source of its slowness. (The execution speed of the interpreted Python prime program is more than double of that of the compiled Vala one. If the value of the stalled-cycles is not as high for the compiled Vala prime program, the execution speed of the interpreted Python one should be much slower.)

You can obtain interactive performance counter stats with the following method.

Method to obtain interactive performance counter stats.

$ perf record ./prime "$(echo 2^20 | bc)">/dev/null


$ perf report

This brings out interactive performance counter stats itemized for each function calls etc. The key parts of results are the following.

Itemized performance counter stats for prime from the C (list) source

# Overhead  Command      Shared Object                            Symbol
# ........  .......  .................  ................................
#
    93.41%    prime  prime              [.] main                        
     1.34%    prime  libc-2.17.so       [.] vfprintf                    
     0.90%    prime  libc-2.17.so       [.] _itoa_word                  
     0.89%    prime  libc-2.17.so       [.] _int_malloc                 
     0.75%    prime  libc-2.17.so       [.] free                        
     0.75%    prime  libc-2.17.so       [.] _IO_file_xsputn@@GLIBC_2.2.5

Itemized performance counter stats for the compiled Vala prime with Gee.

# Overhead  Command               Shared Object                                Symbol
# ........  .......  ..........................  ....................................
#
    77.30%    prime  prime                       [.] _vala_main                      
     5.60%    prime  libglib-2.0.so.0.3600.4     [.] g_datalist_get_flags            
     4.28%    prime  libgobject-2.0.so.0.3600.4  [.] g_object_unref                  
     3.84%    prime  libgee.so.2.0.0             [.] gee_array_list_real_get         
     2.27%    prime  libgee.so.2.0.0             [.] gee_abstract_list_get           
     2.22%    prime  libgobject-2.0.so.0.3600.4  [.] g_type_check_instance_is_a      
     1.45%    prime  libgobject-2.0.so.0.3600.4  [.] g_object_ref                    
     0.85%    prime  libgee.so.2.0.0             [.] gee_abstract_collection_get_size
     0.61%    prime  libc-2.17.so                [.] vfprintf                        
     0.26%    prime  libc-2.17.so                [.] _IO_file_xsputn@@GLIBC_2.2.5    
     0.22%    prime  prime                       [.] _g_object_ref0

Itemized performance counter stats for the interpreted Python prime.

# Overhead  Command      Shared Object                            Symbol
# ........  .......  .................  ................................
#
    39.25%    prime  python3.3          [.] 0x00000000001a2e1f          
    30.29%    prime  python3.3          [.] PyEval_EvalFrameEx          
     8.19%    prime  python3.3          [.] PyDict_GetItem              
     5.69%    prime  python3.3          [.] PyDict_SetItem              
     2.22%    prime  python3.3          [.] _PyLong_New                 
     2.22%    prime  python3.3          [.] PyTuple_New                 
     1.39%    prime  python3.3          [.] PyArg_UnpackTuple           
     1.11%    prime  python3.3          [.] PyObject_Free               
     0.97%    prime  libc-2.17.so       [.] __memset_sse2               
     0.75%    prime  libc-2.17.so       [.] __memcpy_ssse3_back         
     0.56%    prime  python3.3          [.] PyLong_FromLong             
     0.49%    prime  [kernel.kallsyms]  [k] radix_tree_lookup_element   
     0.43%    prime  libc-2.17.so       [.] _int_free                   
     0.42%    prime  [kernel.kallsyms]  [k] clear_page_c

Itemized performance counter stats for the compiled Vala prime with GLib.

# Overhead  Command            Shared Object                                    Symbol
# ........  .......  .......................  ........................................
#
    95.20%    prime  libglib-2.0.so.0.3600.4  [.] g_list_last                         
     4.17%    prime  prime                    [.] _vala_main                          
     0.21%    prime  [kernel.kallsyms]        [k] read_hpet                           
     0.04%    prime  libc-2.17.so             [.] vfprintf                            
     0.03%    prime  libglib-2.0.so.0.3600.4  [.] g_list_append                       
     0.03%    prime  libglib-2.0.so.0.3600.4  [.] g_private_get_impl                  
     0.02%    prime  libglib-2.0.so.0.3600.4  [.] g_slice_alloc                       
     0.02%    prime  libpthread-2.17.so       [.] pthread_getspecific                 
     0.01%    prime  libc-2.17.so             [.] _itoa_word                          
     0.01%    prime  libglib-2.0.so.0.3600.4  [.] g_private_get                       
     0.01%    prime  [kernel.kallsyms]        [k] do_timer

Now you see the g_list_last function in the libglib-2.0 library is the big overhead making the compiled Vala prime program with GLib to run very slow. You get more information from the interactive screen by pressing ENTER on the line of the g_list_last function and then on the Annotate g_list_last line.

Annotated g_list_last

g_list_last
       │
       │    Disassembly of section .text:
       │
       │    0000000000047480 <g_list_last>:
       │      test   %rdi,%rdi
       │      mov    %rdi,%rax
       │    ↓ jne    13
       │    ↓ jmp    1e
       │      nop
 13.28 │10:   mov    %rdx,%rax
       │13:   mov    0x8(%rax),%rdx
 86.72 │      test   %rdx,%rdx
       │    ↑ jne    10
       │      repz   retq
       │1e:   repz   retq

This looks like searching the end of list by scanning from the head of the list. This not only wastes many instructions but also runs through large memory space causing stalled-cycles for the cached memory system.

The assembly code “repz retq” did not make sense to me. As I searched the web for the answer, I found this to be for the code optimization of the branch prediction mechanism. For us to read the generated assembly code, this is practically the same as “retq”, though.

In the (C with the list) source, the code is compact, the pointer to the tail-end of the list is kept in the tail pointer variable, and the tail pointer is updated without scanning the whole list from the head.

Let’s output the Vala source with GLib into C source using the valac -C command and touch it up into a readable C source to inspect the situation.

A equivalent C source for the Vala source with GLib.

/* prime.c generated by valac 0.16.1, the Vala compiler
 * generated from prime.vala, and touched up :-) */

/* The same result as Vala code for slowness */

#include <glib.h>
#include <glib-object.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>


int main (int argc, char ** argv) {
	/* g_type_init (); */
	glong n_max = atol (argv[1]);
	GList* p = NULL;
	glong n;
	n = (glong) 1;
	while (TRUE) {
		n = n + 1;
		gboolean flag_prime;
		flag_prime = FALSE;
		if (!(n < n_max)) {
			break;
		}
		flag_prime = TRUE;
		GList* i_it = NULL;
		for (i_it = p; i_it != NULL; i_it = i_it->next) {
			glong i = GPOINTER_TO_INT (i_it->data);
			glong n_div_i;
			glong n_mod_i;
			n_div_i = n / i;
			n_mod_i = n - (n_div_i * i);
			if (n_mod_i == ((glong) 0)) {
				flag_prime = FALSE;
				break;
			}
			if (n_div_i < i) {
				break;
			}
		}
		if (flag_prime == TRUE) {
			p = g_list_append (p, GINT_TO_POINTER (n));
		}
	}
	GList* i_it = NULL;
	for (i_it = p; i_it != NULL; i_it = i_it->next) {
		glong i = 0L;
		i = GPOINTER_TO_INT (i_it->data);
		{
			fprintf (stdout, "%ld\n", i);
		}
	}
        if (p != NULL) {
            p = (g_list_free (p), NULL);
        }
	return EXIT_SUCCESS;
}

Here, the Vala does not generate a C source with the tail tracking for the list. The p = g_list_append (p, GINT_TO_POINTER (n)); code should cause many calls to g_list_last for every prime numbers found. No wonder g_list_last is called so many times.

The use of Gee solves this problem.

Top

Fun to Program – Debug: level 4

Date: 2013/08/13 (initial publish), 2024/06/07 (last update)

Source: en/fun2-00013.md

TOC

Debug: level 4

perf

Profiling prime