Use specific instructions for moving bytes around in a word. This speeds
things up, and as a side-effect, slightly lowers code size.
ARIA_P3 and ARIA_P1 are now 1 single-cycle instruction each (those
instructions are available in all architecture versions starting from v6-M).
Note: ARIA_P3 was already translated to a single instruction by Clang 3.8 and
armclang 6.5, but not arm-gcc 5.4 nor armcc 5.06.
ARIA_P2 is already efficiently translated to the minimal number of
instruction (1 in ARM mode, 2 in thumb mode) by all tested compilers
Manually compiled and inspected generated code with the following compilers:
arm-gcc 5.4, clang 3.8, armcc 5.06 (with and without --gnu), armclang 6.5.
Size reduction (arm-none-eabi-gcc -march=armv6-m -mthumb -Os): 5288 -> 5044 B
Effect on executing time of self-tests on a few boards:
FRDM-K64F (Cortex-M4): 444 -> 385 us (-13%)
LPC1768 (Cortex-M3): 488 -> 432 us (-11%)
FRDM-KL64Z (Cortex-M0): 1429 -> 1134 us (-20%)
Measured using a config.h with no cipher mode and the following program with
aria.c and aria.h copy-pasted to the online compiler:
#include "mbed.h"
#include "aria.h"
int main() {
Timer t;
t.start();
int ret = mbedtls_aria_self_test(0);
t.stop();
printf("ret = %d; time = %d us\n", ret, t.read_us());
}