Previous Post | Top | Next Post |
TOC
This was originally written and created around 2013 and may require to be updated. (2021)
Debug: level 4
Dynamically tracing program execution to understand the general behavior of the program is the major part of debugging.
There are several ways to trace programs interactively. The perf
tool
facilitate this without any modifications to the source.
perf
The linux-tools-*
package matching your running kernel provides the perf
command on your system.
The perf
command provides Linux profiling with performance counters.
perf stat
: obtain event counts (bettertime
(1),sar
(1), …)perf record
: record events for later reportingperf report
: break down events by process, function, etc. (betterstrace
(1),ltrace
(1), …)perf annotate
: annotate assembly or source code with event countsperf top
: see live event count (bettertop
(1) with disassembler but need to beroot
)
Unlike other profiling tools, this does not require the recompilation of the source nor the source file. This is also very light weight. But this is a Linux specific tool.
See perf: Linux profiling with performance counters for more information.
Profiling prime
Let’s investigate why the compiled Vala prime
program with Glib was
very slow as described in Benchmark using the perf stat
command.
Global performance counter stats for the various prime
programs.
Performance counter stats | C | Vala (Gee) | Python | Vala (GLib) |
---|---|---|---|---|
seconds time elapsed | 0.167983287 | 0.575971153 | 5.645002396 | 9.481985276 |
task-clock | 167.373524 | 575.422353 | 5644.228565 | 9480.929621 |
context-switches | 1 | 1 | 9 | 19 |
cpu-migrations | 1 | 1 | 3 | 1 |
page-faults | 785 | 643 | 3,380 | 1,042 |
cycles | 514,939,216 | 1,706,530,307 | 16,964,435,086 | 27,931,279,895 |
stalled-cycles-frontend | 177,126,811 | 804,510,638 | 3,982,808,734 | 23,873,332,537 |
stalled-cycles-backend | 25,597,018 | 225,726,193 | 1,219,951,714 | 18,391,463,018 |
branches | 57,299,268 | 270,933,180 | 7,548,747,415 | 3,431,684,225 |
branch-misses | 431,965 | 534,024 | 48,326,241 | 616,496 |
You see the compiled Vala prime
program with GLib has much more
stalled-cycles. This seems to be the primary source of its slowness. (The
execution speed of the interpreted Python prime
program is more than double
of that of the compiled Vala one. If the value of the stalled-cycles is not as
high for the compiled Vala prime
program, the execution speed of the
interpreted Python one should be much slower.)
You can obtain interactive performance counter stats with the following method.
Method to obtain interactive performance counter stats.
$ perf record ./prime "$(echo 2^20 | bc)">/dev/null
$ perf report
This brings out interactive performance counter stats itemized for each function calls etc. The key parts of results are the following.
Itemized performance counter stats for prime
from the C (list) source
# Overhead Command Shared Object Symbol
# ........ ....... ................. ................................
#
93.41% prime prime [.] main
1.34% prime libc-2.17.so [.] vfprintf
0.90% prime libc-2.17.so [.] _itoa_word
0.89% prime libc-2.17.so [.] _int_malloc
0.75% prime libc-2.17.so [.] free
0.75% prime libc-2.17.so [.] _IO_file_xsputn@@GLIBC_2.2.5
Itemized performance counter stats for the compiled Vala prime
with Gee.
# Overhead Command Shared Object Symbol
# ........ ....... .......................... ....................................
#
77.30% prime prime [.] _vala_main
5.60% prime libglib-2.0.so.0.3600.4 [.] g_datalist_get_flags
4.28% prime libgobject-2.0.so.0.3600.4 [.] g_object_unref
3.84% prime libgee.so.2.0.0 [.] gee_array_list_real_get
2.27% prime libgee.so.2.0.0 [.] gee_abstract_list_get
2.22% prime libgobject-2.0.so.0.3600.4 [.] g_type_check_instance_is_a
1.45% prime libgobject-2.0.so.0.3600.4 [.] g_object_ref
0.85% prime libgee.so.2.0.0 [.] gee_abstract_collection_get_size
0.61% prime libc-2.17.so [.] vfprintf
0.26% prime libc-2.17.so [.] _IO_file_xsputn@@GLIBC_2.2.5
0.22% prime prime [.] _g_object_ref0
Itemized performance counter stats for the interpreted Python prime
.
# Overhead Command Shared Object Symbol
# ........ ....... ................. ................................
#
39.25% prime python3.3 [.] 0x00000000001a2e1f
30.29% prime python3.3 [.] PyEval_EvalFrameEx
8.19% prime python3.3 [.] PyDict_GetItem
5.69% prime python3.3 [.] PyDict_SetItem
2.22% prime python3.3 [.] _PyLong_New
2.22% prime python3.3 [.] PyTuple_New
1.39% prime python3.3 [.] PyArg_UnpackTuple
1.11% prime python3.3 [.] PyObject_Free
0.97% prime libc-2.17.so [.] __memset_sse2
0.75% prime libc-2.17.so [.] __memcpy_ssse3_back
0.56% prime python3.3 [.] PyLong_FromLong
0.49% prime [kernel.kallsyms] [k] radix_tree_lookup_element
0.43% prime libc-2.17.so [.] _int_free
0.42% prime [kernel.kallsyms] [k] clear_page_c
Itemized performance counter stats for the compiled Vala prime
with GLib.
# Overhead Command Shared Object Symbol
# ........ ....... ....................... ........................................
#
95.20% prime libglib-2.0.so.0.3600.4 [.] g_list_last
4.17% prime prime [.] _vala_main
0.21% prime [kernel.kallsyms] [k] read_hpet
0.04% prime libc-2.17.so [.] vfprintf
0.03% prime libglib-2.0.so.0.3600.4 [.] g_list_append
0.03% prime libglib-2.0.so.0.3600.4 [.] g_private_get_impl
0.02% prime libglib-2.0.so.0.3600.4 [.] g_slice_alloc
0.02% prime libpthread-2.17.so [.] pthread_getspecific
0.01% prime libc-2.17.so [.] _itoa_word
0.01% prime libglib-2.0.so.0.3600.4 [.] g_private_get
0.01% prime [kernel.kallsyms] [k] do_timer
Now you see the g_list_last
function in the libglib-2.0
library is the big
overhead making the compiled Vala prime
program with GLib to run very
slow. You get more information from the interactive screen by pressing ENTER
on the line of the g_list_last
function and then on the Annotate g_list_last
line.
Annotated g_list_last
g_list_last
│
│ Disassembly of section .text:
│
│ 0000000000047480 <g_list_last>:
│ test %rdi,%rdi
│ mov %rdi,%rax
│ ↓ jne 13
│ ↓ jmp 1e
│ nop
13.28 │10: mov %rdx,%rax
│13: mov 0x8(%rax),%rdx
86.72 │ test %rdx,%rdx
│ ↑ jne 10
│ repz retq
│1e: repz retq
This looks like searching the end of list by scanning from the head of the list. This not only wastes many instructions but also runs through large memory space causing stalled-cycles for the cached memory system.
The assembly code “repz retq
” did not make sense to me. As I searched the
web for the answer, I found this to be for the code optimization of the branch
prediction mechanism. For us to read the
generated assembly code, this is practically the same as “retq
”, though.
In the (C with the list) source,
the code is compact, the pointer to the tail-end of the list is kept in the
tail
pointer variable, and the tail
pointer is updated without scanning the
whole list from the head.
Let’s output the Vala source with GLib into C source using the valac -C
command and touch it up into a readable C source to inspect the situation.
A equivalent C source for the Vala source with GLib.
/* prime.c generated by valac 0.16.1, the Vala compiler
* generated from prime.vala, and touched up :-) */
/* The same result as Vala code for slowness */
#include <glib.h>
#include <glib-object.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
int main (int argc, char ** argv) {
/* g_type_init (); */
glong n_max = atol (argv[1]);
GList* p = NULL;
glong n;
n = (glong) 1;
while (TRUE) {
n = n + 1;
gboolean flag_prime;
flag_prime = FALSE;
if (!(n < n_max)) {
break;
}
flag_prime = TRUE;
GList* i_it = NULL;
for (i_it = p; i_it != NULL; i_it = i_it->next) {
glong i = GPOINTER_TO_INT (i_it->data);
glong n_div_i;
glong n_mod_i;
n_div_i = n / i;
n_mod_i = n - (n_div_i * i);
if (n_mod_i == ((glong) 0)) {
flag_prime = FALSE;
break;
}
if (n_div_i < i) {
break;
}
}
if (flag_prime == TRUE) {
p = g_list_append (p, GINT_TO_POINTER (n));
}
}
GList* i_it = NULL;
for (i_it = p; i_it != NULL; i_it = i_it->next) {
glong i = 0L;
i = GPOINTER_TO_INT (i_it->data);
{
fprintf (stdout, "%ld\n", i);
}
}
if (p != NULL) {
p = (g_list_free (p), NULL);
}
return EXIT_SUCCESS;
}
Here, the Vala does not generate a C source with the tail tracking for the
list. The p = g_list_append (p, GINT_TO_POINTER (n));
code should cause many
calls to g_list_last
for every prime numbers found. No wonder g_list_last
is called so many times.
The use of Gee solves this problem.
Previous Post | Top | Next Post |