Cryptogams SHA

From OpenSSLWiki
Revision as of 09:38, 18 May 2019 by Jwalton (talk | contribs) (Add initial page.)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Cryptogams is Andy Polyakov's project used to develop high speed cryptographic primitives and share them with other developers. This wiki article will show you how to use Cryptogams ARMv4 SHA-1 implementation. According to the head notes the ARMv4 implementation runs around 6.5 cycles per byte (cpb). Typical C/C++ implementations run around 10 to 20 cpb and Andy's routines should outperform all of them.

Andy's Cryptogam implementations are provided by OpenSSL, but they are also available stand alone under a BSD license. The BSD style license is permissive and allows developers to use Andy's high speed cryptography without an OpenSSL dependency or licensing terms.

There are 6 steps to the process. The first step obtains the sources. The second step creates an ASM source file. The third step compiles and assembles the source file into an object file. The fourth steps determines the API. The fifth step creates a C header file. The final step integrates the object file into a program.

A few cautions before you begin. First, you are going to examine undocumented features of the OpenSSL library to learn how to work with the Cryptogam's sources. The Cryptogam sources are stable but things could change over time. Second, the ARMv4 implementation hashes full SHA blocks. You are responsible for things like data alignment, padding and side channel counter-measures.

Obtain Source Files

There are two source files you need for Cryptogams SHA. The first is and the second is They are available in the OpenSSL sources. The following commands fetch OpenSSL and then peels off the two Cryptogams files of interest.

# Clone OpenSSL for the latest Cryptogams sources
git clone

mkdir cryptogams/

cp ./openssl/crypto/perlasm/ ./cryptogams/
cp ./openssl/crypto/sha/asm/ ./cryptogams/

cd cryptogams/

Create ASM File

The second step is to run to produce an assembly language source file that can be consumed by GCC. internally calls linux32 is the flavor used by the translate program. sha1-armv4.S is the output filename. In the command below note the *.S file extension, which is a capitol S. Do not use a lowercase s because GCC must drive the compile and assemble step.

perl linux32 sha1-armv4.S

GCC is needed to drive the process because there are C macros in the source file. Some Cryptogam source files have this requirement, while some others do not. sha1-armv4 happens to have the requirement.

$ cat sha1-armv4.S
@ Copyright 2007-2018 The OpenSSL Project Authors. All Rights Reserved.

#ifndef __KERNEL__
# include "arm_arch.h"
# define __ARM_ARCH__ __LINUX_ARM_ARCH__

At this point there is an ASM file but it needs two small fixups. First, arm_arch.h is an OpenSSL source file so the dependency must be removed. Second, GCC defines __ARM_ARCH instead of __ARM_ARCH__ so a sed is needed.

To fixup the source files execute the following two commands:

# Remove OpenSSL include
sed -i 's/# include "arm_arch.h"//g' sha1-armv4.S

# Fix GCC defines
sed -i 's/__ARM_ARCH__/__ARM_ARCH/g' sha1-armv4.S

Alternately, instead of the two sed's, you can open arm_arch.h, copy the defines and paste them directly into sha1-armv4.S. Take care when using arm_arch.h as it carries the OpenSSL license.

After the two fixups sha1-armv4.S is ready to be compiled by GCC.

Compile Source File

The source file is ready to be compiled and assembled. At this point there are two choices. First, you can use ARMv5t or higher which includes Thumb instructions. The following compiles the source file with ARMv5t.

$ gcc -march=armv5t -c sha1-armv4.S

The second choice uses ARMv4 and avoids Thumb instructions. If you want to avoid Thumb then add -marm to you compile command.

$ gcc -march=armv4 -marm -c sha1-armv4.S

Using ARMv5t as an example you now have an object file with the following symbols. Symbols with a capitol T are public and exported. Symbols with a lower t are private and should not be used.

$ gcc -march=armv4 -marm -c sha1-armv4.S
$ nm sha1-armv4.o
00000000 T sha1_block_data_order

And you can inspect the generated code with objdump.

$ objdump --disassemble sha1-armv4.o
sha1-armv4.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 <sha1_block_data_order>:
   0:   e92d5ff0        push    {r4, r5, r6, r7, r8, r9, sl, fp, ip, lr}
   4:   e0812302        add     r2, r1, r2, lsl #6
   8:   e89000f8        ldm     r0, {r3, r4, r5, r6, r7}
   c:   e59f858c        ldr     r8, [pc, #1420] ; 5a0 <sha1_block_data_order+0x5a0>
  10:   e1a0e00d        mov     lr, sp
  14:   e24dd03c        sub     sp, sp, #60     ; 0x3c
  18:   e1a05f65        ror     r5, r5, #30
  1c:   e1a06f66        ror     r6, r6, #30


Determine API

The next step is determine the API so you can call it from a C program. Unfortunately the API is not documented and you have to dig around the OpenSSL sources. Fortunately there is one function of interest called sha1_block_data_order.

A quick grep of OpenSSL sources reveals the following for sha1_block_data_order.

openssl$ grep -nIR sha1_block_data_order | grep '\.c'
crypto/evp/e_sha_cbc_hmac_sha1.c:95:    void sha1_block_data_order(void *c, const void *p, size_t len);
crypto/evp/e_sha_cbc_hmac_sha1.c:115:        sha1_block_data_order(c, ptr, len / SHA_CBLOCK);
crypto/evp/e_sha_cbc_hmac_sha1.c:615:        sha1_block_data_order(&key->md, data, 1);
crypto/evp/e_sha_cbc_hmac_sha1.c:631:        sha1_block_data_order(&key->md, data, 1);

We need several more symbols, and and they are OPENSSL_armcap_P, ARMV7_NEON and ARMV8_SHA1.

$ grep -nIR OPENSSL_armcap_P
crypto/armcap.c:20:unsigned int OPENSSL_armcap_P = 0;

Lather, rinse, repeat for ARMV7_NEON and ARMV8_SHA1.

Create C Header

The fifth step creates a C header file based on information from Determine API. The header file is needed for two reasons. First, it removes the OpenSSL dependency from your project. Second, it avoids OpenSSL licensing violations.

Below is the C Header file you can use. While it is not obvious, the len parameter from Determine API is a block count, not a byte count.

/* Header file for use with Cryptogam's ARMv4 SHA1.    */
/* Also see  */
/*  */


#ifdef __cplusplus
extern "C" {

extern unsigned int OPENSSL_armcap_P;
void sha1_block_data_order(void *state, const void *data, size_t blocks);

/* Auxval caps */
# define HWCAP_ARM_NEON 4096
#ifndef HWCAP_SHA1
# define HWCAP_SHA1 (1 << 5)

/* OpenSSL caps */
#define ARMV7_NEON (1<<0)
#define ARMV8_SHA1 (1<<3)

#ifdef __cplusplus

#endif  /* CRYPTOGAMS_SHA1_ARMV4_H */

Test Program

The final step is to test the integration of Cryptogam's SHA with your program.

$ gcc -std=c99 sha1-armv4-test.c ./sha1-armv4.o -o sha1-armv4-test.exe
$ ./sha1-armv4-test.exe
SHA1 hash of empty message: DA39A3EE5E6B4B0D...

And the test program is shown below.

$ cat sha1-armv4-test.c
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <sys/auxv.h>
#include "sha1-armv4.h"

/* processor caps */
unsigned int OPENSSL_armcap_P = 0;

int main(int argc, char* argv[])
    /* processor caps */
    if (getauxval(AT_HWCAP) & HWCAP_NEON)
        OPENSSL_armcap_P |= ARMV7_NEON;
    if (getauxval(AT_HWCAP) & HWCAP_SHA1)
        OPENSSL_armcap_P |= ARMV8_SHA1;

    /* empty message with padding */
    uint8_t message[64];
    memset(message, 0x00, sizeof(message));
    message[0] = 0x80;

    /* initial state */
    uint32_t state[5] = {0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0};

    sha1_block_data_order(state, message, 1);

    const uint8_t b1 = (uint8_t)(state[0] >> 24);
    const uint8_t b2 = (uint8_t)(state[0] >> 16);
    const uint8_t b3 = (uint8_t)(state[0] >>  8);
    const uint8_t b4 = (uint8_t)(state[0] >>  0);
    const uint8_t b5 = (uint8_t)(state[1] >> 24);
    const uint8_t b6 = (uint8_t)(state[1] >> 16);
    const uint8_t b7 = (uint8_t)(state[1] >>  8);
    const uint8_t b8 = (uint8_t)(state[1] >>  0);

    /* DA39A3EE5E6B4B0D... */
    printf("SHA1 hash of empty message: ");
        b1, b2, b3, b4, b5, b6, b7, b8);

    int success = ((b1 == 0xDA) && (b2 == 0x39) && (b3 == 0xA3) && (b4 == 0xEE) &&
                    (b5 == 0x5E) && (b6 == 0x6B) && (b7 == 0x4B) && (b8 == 0x0D));

    if (success)

    return (success != 0 ? 0 : 1);


You can perform a rough benchmark using the code shown below. Prior to executing the benchmark program you should move the CPU from on-demand or powersave to performance mode.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <unistd.h>
#include <string.h>
#include <sys/auxv.h>
#include "sha1-armv4.h"

typedef unsigned char byte;

/* processor caps */
unsigned int OPENSSL_armcap_P = 0;

int main(int argc, char* argv[])
    /* set processor caps */
    if (getauxval(AT_HWCAP) & HWCAP_NEON)
        OPENSSL_armcap_P |= ARMV7_NEON;
    if (getauxval(AT_HWCAP) & HWCAP_SHA1)
        OPENSSL_armcap_P |= ARMV8_SHA1;

    const unsigned int STEPS = 128;
    byte* buf = (byte*)malloc(STEPS*64+64);
    memset(buf, 0x00, 16);

    double elapsed = 0.0;
    size_t total = 0;

    struct timespec start, end;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);

    uint32_t state[5] = {0x67452301, 0xEFCDAB89, 0x98BADCFE, 0x10325476, 0xC3D2E1F0};

        size_t idx = 0;
        for (unsigned int i=0; i<STEPS; ++i)
            sha1_block_data_order(state, buf, (idx+1)*64);
        total += 16*STEPS;
        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
        elapsed = (end.tv_sec-start.tv_sec);
    while (elapsed < 3 /* seconds */);

    /* Increase precision of elapsed time */
    elapsed = ((double)end.tv_sec-start.tv_sec) +
              ((double)end.tv_nsec-start.tv_nsec) / 1000 / 1000 / 1000;

    /* CPU freq of 1 GHz */
    const double cpuFreq = 1000.0*1000*1000;

    const double bytes = total;
    const double ghz = cpuFreq / 1000 / 1000 / 1000;
    const double mbs = bytes / elapsed / 1024 / 1024;
    const double cpb = elapsed * cpuFreq / bytes;
    printf("%.0f bytes\n", bytes);
    printf("%.02f mbs\n", mbs);
    printf("%.02f cpb\n", cpb);
    return 0;

The results below are from a Libre Computer Tritium H3 with a Cortex-A7 Sun7i SoC running at 1 GHz. A C/C++ SHA implementation runs about 22 cpb on the dev-board. Notice sha1-armv4.S was compiled with -march=armv7.

$ gcc -std=c99 -march=armv7 -c sha1-armv4.S -o sha1-armv7.o
$ gcc -O3 -std=c99 sha1-armv7-test.c sha1-armv7.o -o sha1-armv7-test.exe
$ ./sha1-armv7-test.exe
180994048 bytes
57.59 mbs
16.56 cpb


If you are using Autotools you can add the following to and to conditionally compile sha1-armv4.S for A-32 platforms. You will need to detect ARM A-32 and set IS_ARM32 to non-0. Also see Automake Assembly Support in section 8.13 of the manual.

First, the recipe:

# Set ASM tools

# Used by to compile sha1-armv4.S
if test "$IS_ARM32" != "0"; then

   ## Save CFLAGS

   CFLAGS="-march=armv7-a -Wa,--noexecstack"
   AC_MSG_CHECKING([if $CC supports $CFLAGS])
      [AC_MSG_RESULT([yes]); AC_SUBST([tr_RESULT], [1])],
      [AC_MSG_RESULT([no]);  AC_SUBST([tr_RESULT], [0])]

   if test "$tr_RESULT" = "1"; then



      AC_MSG_CHECKING([if $CC supports $CFLAGS])
         [AC_MSG_RESULT([yes]); AC_SUBST([tr_RESULT], [1])],
         [AC_MSG_RESULT([no]);  AC_SUBST([tr_RESULT], [0])]

      if test "$tr_RESULT" = "1"; then

   ## Restore CFLAGS
   # Required for other platforms

Second, the recipe:


  sha_armv4_la_SOURCES = sha1-armv4.S

  pkginclude_HEADERS += sha1-armv4.h