PROJECTS ESP8266

Precise clocking with ESP8266

When conducting my basic research with esp8266 module, I experimented with HC-SR04 ultrasonic ranging module. (Article to be written..) HC-SR04 itself is simple module, you trigger it and measure the time module keeps it echo signal up. By datasheet helpfully informing that 1 microsecond means 58 centimeters - no need to derive it myself from speed of sound - the calculation of measured distance would have been easy.

But, from where to get microsecond granularity?

Also, even I would get it exact from somewhere, with microsecond granularity I can get only 58 centimeter accuracy. 58 centimeters is half meter. HC-SR04 itself has 3mm accuracy that means the even one microsecond granularity is way too long for useful measurement.

So, simple answer of using system_get_time() at beginning and end of signal edges, calculating difference and trying to compensate call and gpio read overhead can be excluded.

The CCOUNT register

ESP8266 has special 32bit register named CCOUNT that constantly counts clock ticks. ESP8266 can run both at 80Mhz and 160Mhz, with 80Mhz being default. The 80Mhz means that there is 80 000 000 clock ticks in second, 80 000 clock ticks on millisecond and 80 clock ticks on microsecond. Everyone of those ticks increment's that CCOUNT by one.

That 80Mhz clock also ticks at same speed despite how much load there is on system and the 80Mhz clock is exactly 80 million ticks on second always. Once we now know that there is 80 clock ticks on microsecond, that's all we neet. The easy facts are :

But how to read the value of CCOUNT register. For that we need some assembler.

static inline int32_t asm_ccount(void) {

    int32_t r;

    asm volatile ("rsr %0, ccount" : "=r"(r));
    return r;
}

The code above defines function asm_ccount() that returns value of register. The code is marked as inline, so there is never function to be jumped in, but assembly operand is run on place without call overhead. Practically this is single register assignment taking one clock cycle to run.

From compiler point of view, the assignment int32_t myval = asm_ccount() is rsr "myval", ccount since compiler optimizes and remaps variable myval to be directly that register $0. Of course if statement is executed inside complicated function with many active variables, it is not quaranteed that this remains as register-to-register assignment. If compiler decides to assing CCOUNT to memory location, that is longer operation and calling asm_ccount() is no longer completed in one clock cycle. Of cource you can always hint compiler by prepending variable declaration as register int32_t myval. But even in this case compiler is not forced to honor this declaration.

The problem with CCOUNT is that being 32-bit counter, with 80Mhz clock rate it overflows every 53.6 second. You need to take account that during measurement CCOUNT's value might have overflown and wrapped back to zero, so end value being smaller that CCOUNT you collected at beginning.

Also take in account, that when clocking operation lasting over 53.6 seconds, CCOUNT overflows twice!

The timing testing code

To measure timings for couple essential operations, I wrote simple program to measure differences between start and end of operation. The main source file is attached at end of this page, also the source is downloadable from timing-tests.tar.gz with relevant Makefile.

Again, source follows template described at separate page and if you did set up toolchain, the timing program can be compiled and flashed with :

esp:~/Src/esp/examples/timing-tests$ make clean esp flash

The anatomy of esp8266 program is explained at Hello world example, read it there if there is some guestions regarging the basics of esp programming. Below is explanation only for functionality regarding timing.

Basic program operation

The operation is simple. Program boots up, starts timer and on every timer run enumerates all tests and collects results in function __local_event_exec(). Once all tests are run, program prints out how many clock cycles each test took compensating overhead generated from calling the test function itself.

To be slightly "educational", program does not run tests all in one run, but invokes each test through system_os_post() (it's essentiality explained at Hello world! application).

__measure

Time spent is measured very simply at __measure(). The essential part is that, btime and etime are defined as int32_t but their difference casted as uint32_t. This magically compensates if etime overflows returning difference of value despite which is larger and which smaller.

static uint32_t
__measure(void (*callback)()) {

    int32_t btime, etime;

    btime = asm_ccount();
    callback();
    etime = asm_ccount();
    return ((uint32_t)(etime - btime));
}

Assembly helpers

Besides of asm_ccount(), in beginning of file there is also defined asm_nop() that executes just single assembly operand that quaranteely takes only one clock cycle.

static inline void asm_nop() {
    asm volatile ("nop"); }

Global dummy buffers

These buffers are for testing memory copying and single assignments. __dummy_val is declared as volatile so that compiler does not optimize certain tests off by interpreting them as assignments resulting nothing. The volatile-keyword forces variable into memory location making compiler always assign value to that location.

static volatile uint32_t  __dummy_val;
static char               __dummy_buf1[1024];
static char               __dummy_buf2[1024];

Call overhead calibration

In __local_event_exec() we pick the duration that calling empty function __timing_empty() took. This is count of ticks that was spent to call and return from function. This count is reduced from every test result, so that what is left is the tick that were spent inside function.

    /* get calibration to compensate the call overhead from 
       (first) empty call */
    if (callbacks[callidx].callback == __timing_empty)
	calibration = callbacks[callidx].consumed_time;

    /* printout */
    os_printf("    %s: %u ticks\n",
	      callbacks[callidx].name,
	      callbacks[callidx].consumed_time - calibration);

The results of single test run

When program is flashed and executed, it goes through all tests every 3 seconds. Program displays clock ticks spent inside of every test, not counting the call to subroutine itself.

empty_call: 0 ticks
single_nop: 1 ticks
re_recurse: 11 ticks
1us_sleep: 125 ticks
10us_sleep: 845 ticks
100us_sleep: 8044 ticks
gpio_read_pin: 46 ticks
gpio_pull_updown: 97 ticks
gpio_status_read: 27 ticks
gpio_status_write: 9 ticks
postmessage: 121 ticks
system_get_time: 37 ticks
system_get_rtc_time: 71 ticks
system_get_cpufreq: 53 ticks
memcpy_1k_bytes: 887 ticks
memset_1k_bytes: 570 ticks
bzero_1k_bytes: 3114 ticks
Call overhead was 18 ticks

Analysis of results

To be noted, none of the test function are declared with ICACHE_FLASH_ATTR but they exist immediately in RAM during execution.

empty_call, single_nop, re_recurse

In first two rows, the empty_call was also calibration one. The time that took to complete call to function doing nothing is subtracted from every result to eliminate the call overhead. We know that single assembly NOP operation takes always exactly one clock cycle, so ticks spend in __test_single_call() should result exactly one tick - as it does. Correct.

Function __test_re_recurse() expresses how many ticks is spent on basic function call that do not pass any parameters either direction. Since re_recurse calls internally single_nop as test function, we must reduce it's tick count by one tick that is spent on that single assembly NOP. Result is, that function call, ie. branching into function, changing stack frame, and returning from there takes 10 clock cycles.

1us_sleep, 10us_sleep, 100us_sleep

Invoking os_delay_us() allows us to halt processing with microsecond granularity. Since there is always call overhead to calling os_delay_us -function, these tests express how long is really slept, when we expect one microsecond sleep.

We know that because esp8266's clock rate being 80Mhz, one microsecond is exactly 80 ticks. But invoking os_delay_us(1) actually took 125 ticks, so there must be 45 tick overhead on call to os_delay_us. This is verified by sleeping 10us. It should take exactly 800 ticks, but measurement shows 845, ie. with same 45 tick call overhead that previous test. Also the 100us sleep having accurate enough 44 tick call overhead, we can declare that call to os_delay_us takes always ~45 ticks extra.

In practice this means, if os_delay_us(1) you actually delay about os_delay_us(1.55). This might need to get account when driving something through GPIO pins.

gpio_read, _write, _status

Reading the GPIO pin states or pulling output up or down seems to be accomplished in 50 nanoseconds. This needs to be taken account, if chip you are driving requires specific width of pulses or timing between state changes..

gpio_status_read and gpio_status_write are used when whe listening and acknowledging gpio interrupts. The interrupts should be completed in 10us, so you have 800 ticks to spent. Enough for reading and acknowledging the state, collecting couple variables and posting message to process result on leisure time.

system_*

Getting time is quite cheap, no reason to avoid or cache it. Posting messages have 1.5us penalty, so no reason to avoid it either.

Copying memory

Difference between os_memset(buffer, 0, sizeof(buffer)) and os_bzero(buffer, sizeof(buffer)) is weird. Both functions does exactly same result, but os_bzero() does it six times slowed. Probably bug in underlying libs.

Memory copy is in overall quite economic. Copying 1KiB takes 10 microseconds, so copying 1MiB would take bit over 10 milliseconds. But because esp8266 has about 80KiB data ram available, this is not the bottlenect you are looking for.

The code

If you want to add tests, implement test function and add it to callbacks[] at __local_event_exec(). Variable __dummy_val and __dummy_buf* are to prevent GCC for optimizing unneeded code off.

#include <c_types.h>
#include <osapi.h>
#include <user_interface.h>
#include <ets_sys.h>
#include <gpio.h>

/* local defines... */
#define VE_TASKPRIO_LOW    0
#define VE_TASKPRIO_MED    1
#define VE_TASKPRIO_HIGH   2

#define VE_TIMER_NOREPEAT  0
#define VE_TIMER_REPEAT    1

/* need to prevent compiler optimizations */
#define __NOINLINE         __attribute__ ((noinline))

/* local helpers... */
#define __asize(x)  (sizeof(x) / sizeof(x[0]))
#define __d(x) { os_printf("%s:%u %s\n", __FUNCTION__, __LINE__, x); }

/* missing declarations in SDK-includes */
extern void os_printf_plus(const char *format, ...);
extern void ets_timer_setfn(os_timer_t *, os_timer_func_t *, void *);
extern void ets_timer_arm_new(os_timer_t *, uint32_t, int, int);
extern void ets_delay_us(uint16_t);
extern void ets_bzero(void *, size_t);
extern void ets_memcpy(void *, void *, size_t);
extern void ets_memset(void *, int, size_t);

extern void uart_div_modify(int uart_no, int baudrate);

/* local types... */
enum __local_event_t {
    LOCAL_EVENT_INIT,
    LOCAL_EVENT_EXEC
};

/* static test buffers to prevent compiler optimizations */
static volatile uint32_t  __dummy_val;
static char               __dummy_buf1[1024];
static char               __dummy_buf2[1024];

/* asm-helpers */
static inline int32_t asm_ccount(void) {
    int32_t r; asm volatile ("rsr %0, ccount" : "=r"(r)); return r; }

static inline void asm_nop() {
    asm volatile ("nop"); }

/* timing-tests */
void __NOINLINE
__timing_empty() {
    /* nothing */
}

void __NOINLINE
__timing_single_nop() {
    asm_nop();
}

void __NOINLINE
__timing_re_recurse() {
    __timing_single_nop();
}

void __NOINLINE
__timing_1us_sleep() {
    os_delay_us(1);
}

void __NOINLINE
__timing_10us_sleep() {
    os_delay_us(10);
}

void __NOINLINE
__timing_100us_sleep() {
    os_delay_us(100);
}

void __NOINLINE
__timing_gpio_read_pin() {
    /* read gpio4 -state */
    __dummy_val = GPIO_INPUT_GET(4);
}

void __NOINLINE
__timing_gpio_pull_updown() {
    GPIO_OUTPUT_SET(5, 1);
    GPIO_OUTPUT_SET(5, 0);
}

void __NOINLINE
__timing_gpio_status_read() {
    __dummy_val = GPIO_REG_READ(GPIO_STATUS_ADDRESS);
}

void __NOINLINE
__timing_gpio_status_write() {
    GPIO_REG_WRITE(GPIO_STATUS_W1TC_ADDRESS, 0);
}

void __NOINLINE
__timing_postmessage() {

    /* post dummymessage */
    system_os_post(VE_TASKPRIO_LOW, -1, -1);
}

void __NOINLINE
__timing_system_get_time() {
    __dummy_val = system_get_time();
}

void __NOINLINE
__timing_system_get_rtc_time() {
    __dummy_val = system_get_rtc_time();
}

void __NOINLINE
__timing_system_get_cpufreq() {
    __dummy_val = system_get_cpu_freq();
}

void __NOINLINE
__timing_memcpy_1k_bytes() {
    os_memcpy(__dummy_buf1, __dummy_buf2, sizeof(__dummy_buf1));
}

void __NOINLINE
__timing_memset_1k_bytes() {
    os_memset(__dummy_buf1, 0, sizeof(__dummy_buf1));
}

void __NOINLINE
__timing_bzero_1k_bytes() {
    os_bzero(__dummy_buf1, sizeof(__dummy_buf1));
}

static uint32_t
__measure(void (*callback)()) {

    int32_t btime, etime;

    btime = asm_ccount();
    callback();
    etime = asm_ccount();
    return ((uint32_t)(etime - btime));
}

static void
__local_event_exec() {

    static int callidx;
    static uint32_t calibration;
    static struct { /* callbacks to be tested */
	const char *name;
	void (*callback)();
	uint32_t consumed_time;
    } callbacks[] = {
	{ "         empty_call", __timing_empty, 0 },
	{ "         single_nop", __timing_single_nop, 0 },
	{ "         re_recurse", __timing_re_recurse, 0 },
	{ "          1us_sleep", __timing_1us_sleep, 0 },
	{ "         10us_sleep", __timing_10us_sleep, 0 },
	{ "        100us_sleep", __timing_100us_sleep, 0 },
	{ "      gpio_read_pin", __timing_gpio_read_pin, 0},
	{ "   gpio_pull_updown", __timing_gpio_pull_updown, 0 },
	{ "   gpio_status_read", __timing_gpio_status_read, 0 },
	{ "  gpio_status_write", __timing_gpio_status_write, 0 },
	{ "        postmessage", __timing_postmessage, 0 },
	{ "    system_get_time", __timing_system_get_time, 0 },
	{ "system_get_rtc_time", __timing_system_get_rtc_time, 0 },
	{ " system_get_cpufreq", __timing_system_get_cpufreq, 0 },
	{ "    memcpy_1k_bytes", __timing_memcpy_1k_bytes, 0 },
	{ "    memset_1k_bytes", __timing_memset_1k_bytes, 0 },
	{ "     bzero_1k_bytes", __timing_bzero_1k_bytes, 0 }
    };

    /* measure */
    callbacks[callidx].consumed_time =
	__measure(callbacks[callidx].callback);

    /* get calibration to compensate the call overhead from 
       (first) empty call */
    if (callbacks[callidx].callback == __timing_empty)
	calibration = callbacks[callidx].consumed_time;

    /* printout */
    os_printf("    %s: %u ticks\n",
	      callbacks[callidx].name,
	      callbacks[callidx].consumed_time - calibration);
    
    if (++callidx >= __asize(callbacks)) {

	/* rewind to beginning */
	os_printf("Call overhead was %u ticks\n\n", calibration);
	callidx = 0;

    } else {

	/* reinvoke self through call stack */
	system_os_post(VE_TASKPRIO_LOW, LOCAL_EVENT_EXEC, 0);
    }
}

static void
__local_timer_callback(void *arg) {

    /* send the actual SAYHELLO -message */
    system_os_post(VE_TASKPRIO_LOW, LOCAL_EVENT_EXEC, 0);
}

static void ICACHE_FLASH_ATTR
__local_event_init() {

    static os_timer_t __timer;

    /* rest of settings, gpio test pins */
    PIN_FUNC_SELECT(PERIPHS_IO_MUX_GPIO4_U, FUNC_GPIO4);
    PIN_FUNC_SELECT(PERIPHS_IO_MUX_GPIO5_U, FUNC_GPIO5);
    GPIO_DIS_OUTPUT(4);
    GPIO_OUTPUT_SET(5, 0);

    /* set timer for periodic invocations (3000msec) */
    os_timer_setfn(&__timer, __local_timer_callback, NULL);
    os_timer_arm(&__timer, 3000, VE_TIMER_REPEAT);
}

static void
__local_event_callback(os_event_t *ev) {

    enum __local_event_t event =
	(enum __local_event_t)ev->sig;

    switch(event) {
    case LOCAL_EVENT_INIT:
	return __local_event_init();
    case LOCAL_EVENT_EXEC:
	return __local_event_exec();
    }
}

void ICACHE_FLASH_ATTR
user_init() {

    static os_event_t  __local_task_queue[8];

    /* mandatory initialization
       - microsecond timer as first thing to do */
#ifdef USE_US_TIMER
    system_timer_reinit();
#endif
    /* baud-rate */
    uart_div_modify(0 /* uart0 */, UART_CLK_FREQ / 115200);

    /* gpio */
    gpio_init();

    /* initialize taskqueue */
    system_os_task(__local_event_callback, VE_TASKPRIO_LOW,
		   __local_task_queue, __asize(__local_task_queue));
    
    /* post callback */
    system_os_post(VE_TASKPRIO_LOW, LOCAL_EVENT_INIT, 0);

}

The results..

empty_call: 0 ticks
single_nop: 1 ticks
re_recurse: 11 ticks
1us_sleep: 125 ticks
10us_sleep: 845 ticks
100us_sleep: 8044 ticks
gpio_read_pin: 46 ticks
gpio_pull_updown: 97 ticks
gpio_status_read: 27 ticks
gpio_status_write: 9 ticks
postmessage: 121 ticks
system_get_time: 37 ticks
system_get_rtc_time: 71 ticks
system_get_cpufreq: 53 ticks
memcpy_1k_bytes: 887 ticks
memset_1k_bytes: 570 ticks
bzero_1k_bytes: 3114 ticks
Call overhead was 18 ticks