Thanks arfoll,
Yes I did not imagine that the precaching of data would help much. I hacked up your mem version, such that the setup cached away offset and mask:
dev->mem_offset = (dev->pin / 32) * sizeof(uint32_t); /* KJE try pre calculating */
dev->mem_mask = (uint32_t)(1 << (dev->pin % 32)); /* kje try pre calculating */
I then changed the write to be:
*(volatile uint32_t*) (mmap_reg + dev->mem_offset + valoff) = dev->mem_mask;
Could also change the offset to pre add in the mmap_reg, but would not expect that to help that much. As I suspected, did not help out that much, maybe gained a little. The LA, showed the times, ranging from .23 to .27us.. Probably not worth mentioning.
If you are interested I could update my fork of mraa with this, if you want to take a look.
Kurt