Basic Performance counting on Intel Architecture

Preamble

Performance monitoring is becoming an increasing part of low latency C++.  As the speed and flexibility of CPUs have increased the ability of existing technologies to monitor code performance has been found wanting.

Timing code using gmtime() causes a kernel call and delivers insufficiently accurate times for fast executing code.

Valgrind simulates the processor but not down to the actually CPU which may say reduce its clock speed due to local heating or some other weird hardware event. Also running instrumented code can be extremely slow.

Intel has gone down the vTune route which under the hood uses msr functions. These read the actual hardware counters stored by the CPU.  As intel are doing this themselves we can be sure that this functionality will be around for the foreseeable future and can look to leverage it.

asm: rdtsc – ReaD Time Stamp Counter

This is the simplest instruction which allows us to measure accurately timing between start and end of two points.

https://en.wikipedia.org/wiki/Time_Stamp_Counter

There is a link to some code cycle.h at the bottom of the text, which shows how to execute the instruction.

asm: rdmsr/wrmsr – ReaD/WRite Model Specific Register

It appears that rdmsr/wrmsr were initially aimed at memory bank switching and alike, but as time has gone on the functions allows access to specific hardware statistics.

https://en.wikipedia.org/wiki/Model-specific_register

asm: rdpmc – ReaD Performance Message Counter

Using rdtsc will give an indication of how the code performs but may not give specific details of any bottlenecks caused by instructions/cache misses. rdpmc allows for a finer grain of examine the processor. Execution time appears to be 24-40 cycles.

Example how to read the performance counters.

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214

from the above page:

// rdpmc_instructions uses a "fixed-function" performance counter to return the count of retired instructions on
//       the current core in the low-order 48 bits of an unsigned 64-bit integer.
unsigned long rdpmc_instructions()
{
   unsigned a, d, c;

   c = (1<<30);
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_actual_cycles uses a "fixed-function" performance counter to return the count of actual CPU core cycles
//       executed by the current core.  Core cycles are not accumulated while the processor is in the "HALT" state,
//       which is used when the operating system has no task(s) to run on a processor core.
unsigned long rdpmc_actual_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+1;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

// rdpmc_reference_cycles uses a "fixed-function" performance counter to return the count of "reference" (or "nominal")
//       CPU core cycles executed by the current core.  This counts at the same rate as the TSC, but does not count
//       when the core is in the "HALT" state.  If a timed section of code shows a larger change in TSC than in
//       rdpmc_reference_cycles, the processor probably spent some time in a HALT state.
unsigned long rdpmc_reference_cycles()
{
   unsigned a, d, c;

   c = (1<<30)+2;
   __asm__ volatile("rdpmc" : "=a" (a), "=d" (d) : "c" (c));

   return ((unsigned long)a) | (((unsigned long)d) << 32);;
}

Inevitable full documentation

Section 18.2 of Volume 3

https://www.intel.co.uk/content/www/uk/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html

 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.