Fixing GCC's code generation for the MaverickCrunch FPU

Martin Guy <martinwguy@gmail.com> 8 September 2009
Last updated: 7 October 2015

Contents

Preamble
What it does
Correctness tests
Speed tests
Download
Using it
Building it from source
Patches for other packages
Thanks

Update 7 October 2015
I've released my work from July 2013:

64-bit integer instructions should now work, enabled by the new -mcirrus-di flag , giving 64-bit add, sub, mul and neg. extra speed in some cases. It now has no known bugs, but is disabled by default awaiting more exhaustive testing.

GCC-4.4 is now also supported.

Update 18 June 2013
The new release of Debian, "wheezy", has a minimum gcc version of 4.4. The existing gcc-4.2 and 4.3 packages can still be installed and work on it but to run gcc-4.3-crunch you also have to install the libgmp3 package from the previous release, for example:

wget http://ftp.debian.org/debian/pool/main/g/gmp/libgmp3c2_4.3.2+dfsg-1_armel.deb
sudo dpkg -i libgmp3c2_4.3.2+dfsg-1_armel.deb

A blood-and-guts image with title Maverick 9312, found on a Gears of War gaming site

Image found on the "Gears of War" site during a web search for "Maverick 9312"

Preamble

I've been working on GCC-4 to make it generate working code for the Cirrus Logic MaverickCrunch FPU, as found in their ARM-based EP9302, EP9307, EP9312 and EP9315 chips, making floating point-intensive code between 2.5 and 4 times faster.

This follows on from Hasjim Williams' earlier work with gcc-4.1.2 and 4.2.0, a bundle of his more recent ideas and more hacks from me.

If you want to understand the patches themselves, there is an article about the MaverickCrunch FPU and GCC's problems with it on the Debian wiki and I have added commentary at the top of the individual patch files for gcc-4.4.7, for gcc-4.3.6 or for gcc-4.2.4.

Discussion about this (and other issues with these chips) happens on the linux-cirrus mailing list.

What it does

The 20090908 version

performs single and double precision floating point in the FPU (add, sub, mul, neg, abs, cmp and conversions from single and double precision floats to integral types).
by default, disables the floating point cfnegs and cfnegd instructions, which fail to convert 0 to -0 as they should. You can re-enable them with the -funsafe-math-optimizations flag, which is one of those enabled by -ffast-math (gcc-4.3 has an even more specific -fno-signed-zeros flag, which is one of those enabled by -funsafe-math-optimizations).
by default, does not respect denormalised values, so the smallest representable values are ±2^-126 for floats and ±2^-1022 for doubles instead of the usual ±2^-149 and ±2^-1074.
has a -mieee flag, which enables handling of denormalized values by disabling all the buggy instructions. With this, floating point addition, subtraction, negation, absolute value and conversion between floats and integer types are performed in software, leaving only floating point multiplication and comparison performed in hardware.
has no negative impact on regular ARM code generation.
always works round the hardware bugs in the FPU and no longer has the -mcirrus-fix-invalid-insns flag since chip development has stopped and all existing silicon has the same bugs except for the original revision D0 which is not supported.
passes GCC's IEEE testsuite except for the one specific test that checks for correct handling of denormalized values. With -mieee it passes all the math tests.
passes all other testsuites that I've tried (see below) including the stringent "paranoia" floating point IEEE conformance test.
produces the fastest Maverick code yet: 5.94 MFLOPS according to FFTW's tests/bench -opatient cf1024 benchmark and LAME takes 2m25 to encode that 30-second WAV file on a 200MHz EP9307 (instead of X.XX with normal GCC).
does not use the FPU's buggy 64-bit integer instructions unless the new -mcirrus-di flag is given. Programs that do a lot of 64-bit integer operations (add, sub, mul, neg, abs, shifts) may be faster using this, but rigorous testing will be necessary to ensure that bad code is not being produced. OpenSSL's testsuite fails if this is enabled. There is more detail at the head of the arm-crunch-cirrus-di-flag.patch file.

Correctness tests

The compiler passes the following floating point-intensive test suites:

GCC's IEEE testsuite:

make check-gcc RUNTESTFLAGS="ieee.exp --target_board=unix/-mcpu=ep9312/-mfpu=maverick/-mfloat-abi=softfp/-mieee"

The paranoia IEEE precision torture test

gcc-4.4-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -mieee -o paranoia2 paranoia2.c -lm
./paranoia2

FFTW2's correctness tests:

wget http://www.fftw.org/fftw-2.1.5.tar.gz
tar xfz fftw-2.1.5.tar.gz
cd fftw-2.1.5
# double precision floating point
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make && make -C tests check
# single precision floating point
make clean
./configure --enable-float CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make && make -C tests check

FFTW3's correctness tests:

wget http://www.fftw.org/fftw-3.3.3.tar.gz
tar xfz fftw-3.3.3.tar.gz
cd fftw-3.3.3
# double precision floating point
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make; cd tests; perl check.pl -a
# single precision floating point
make clean
./configure --enable-single CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make; cd tests; perl check.pl -a

liboil 0.3.15's testsuite:

wget http://liboil.freedesktop.org/download/liboil-0.3.17.tar.gz
tar xfz liboil-0.3.17.tar.gz
cd liboil-0.3.17
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make check

libsndfile's testsuite:

wget http://www.mega-nerd.com/libsndfile/files/libsndfile-1.0.25.tar.gz
tar xfz libsndfile-1.0.25.tar.gz
cd libsndfile-1.0.25
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make check

libsamplerate's testsuite:

wget http://www.mega-nerd.com/SRC/libsamplerate-0.1.8.tar.gz
tar xfz libsamplerate-0.1.8.tar.gz
apt-get install libsndfile-dev libfftw3-dev libasound2-dev
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"
make check

libvorbis's testsuite:

wget http://downloads.xiph.org/releases/vorbis/libvorbis-1.3.3.tar.gz
tar xfz libvorbis-1.3.3.tar.gz
cd libvorbis-1.3.3
# libvorbis supplies -O20 flag
./configure CC=gcc-4.4-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp"
make check

The 64-bit integer intensive OpenSSL testsuite:

wget https://www.openssl.org/source/openssl-1.0.1e.tar.gz
tar xfz openssl-1.0.1e.tar.gz

./config
vi Makefile
:/^CC= gcc/s/$/-4.3-crunch/
:/^CFLAG= /s/$/ -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -mcirrus-di/
:wq
make
make test

Speed tests

To compare execution speed of crunch-intensive tasks:

FFTW 3.2alpha2's tests/bench -opatient cf1024 benchmark, indicative of the maximum speed of small, highly optimised inner loops, and
LAME 3.97 encoding a 30-second stereo CD-quality file (actually two identical mono tracks) with default options and the output written to /dev/null, indicative of unspecialized mixed use of floating point and integer code.

libgsm 1.0.13 encoding a 16-bit mono 8kHz 30-second file of speech.
Edit the Makefile to say:

CC=gcc-4.4-crunch
CCFLAGS=" ... -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math"
MULHACK=-DUSE_FLOAT_MUL

then

make
time bin/toast < intro.l > /dev/null

openssl 1.0.1e hashing a megabyte of garbage, compiled as above, then

dd if=/dev/urandom of=crap bs=1M count=1
time apps/openssl sha -sha512 < crap

The results, on a 200MHz Cirrus Logic EP9307 revision E1 under Debian "armel":

Compiler/options FFTW
mflops LAME
seconds libgsm(*)
seconds openssl
seconds
Soft-float
gcc-4.3 -O2 -ffast-math (softfloat) 3.59 365 23.5
gcc-4.4 -O2 -ffast-math (softfloat) 0.51u
Hard-float
gcc-4.2-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math 6.13 138 5.72
gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math -mieee 3.83 276 -
gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math 5.94 145 5.72
gcc-4.4-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math 7.65 141(+) 6.52(+) 0.48u

Compiler/options	FFTW mflops	LAME seconds	libgsm(*) seconds	openssl seconds
Soft-float
gcc-4.3 -O2 -ffast-math (softfloat)	3.59	365	23.5
gcc-4.4 -O2 -ffast-math (softfloat)				0.51u
Hard-float
gcc-4.2-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math	6.13	138	5.72
gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math -mieee	3.83	276	-
gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math	5.94	145	5.72
gcc-4.4-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -O2 -ffast-math	7.65	141(+)	6.52(+)	0.48u

In other words, using the full Maverick instruction set, LAME is 2.5 times faster than with softfloat, and when just using the -mieee subset, it runs 25% faster or about half the speed of the full set, and gcc-4.2 produces significantly faster code than gcc-4.3 or gcc-4.4 except for FFTW, in which gcc-4.4 excels.

(*) Although crunch libgsm is 4 times faster than softfloat, libgsm also has a fixed-point encoder, selected with MULHACK='', which is faster still (the same is true of the speex encoder).
(+) These tests were run on a Sim.One board with a 16-bit RAM data path instead of a 32-bit data path, which slows down all RAM accesses for a net slowdown of about 10%.

Download

There are installable binary tarballs, Debian packages and patches.

There is also a repository of prebuilt crunch-accelerated Debian packages for armel's "lenny" and "squeeze" releases. See martinwguy.net/crunch/debian

Using it

To get MaverickCrunch instructions you always have to use:

  gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp

Other relevant options are:

-fno-signed-zeros: (gcc-4.3 only) If you do need to handle Not-a-Number values and infinities in calculations but do not care about the difference between 0 and -0, you can use this flag to enable the Maverick 'negate' instructions for a little extra speed.
-ffinite-math-only: This tells the compiler that NaNs and infinities do not need to be handled; this allows further optimization.
-funsafe-math-optimizations: This enables more optimizations that may give results not in accordance with the strict IEEE-754 math standard. Among others, It enables -fno-signed-zeros and in GCC-4.2 is the least invasive way to enable the Crunch negate instructions.
-ffast-math: This is the most aggressive math optimization flag, enabling all of the above and more.
-mieee: Most of Crunch's instructions take denormal values as zero; this flag only enables the ones that work at full IEEE precision (just multiply and compare).
-mcirrus-di: The FPU also has 64-bit integer instructions but they appear to be buggy. This flag enables them (load, store, add, subtract, convert to/from 32 bit and logical shifts by up to 31 places). Caveat emptor.

When running configure scripts, I normally use:

  ./configure CC=gcc-4.3-crunch CFLAGS="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -ffast-math -O2"

However it's usually less trouble to make a directory of wrapper scripts replacing all of GCC's command names with the crunch version:

mkdir ~/crunch
cat > ~/crunch/gcc << EOF
#! /bin/sh

exec gcc-4.3-crunch -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp -fno-signed-zeros "$@"
EOF
chmod 755 ~/crunch/gcc
ln -s gcc ~/crunch/cc
ln -s gcc ~/crunch/gcc-4.3
ln -s gcc ~/crunch/arm-linux-gnueabi-gcc

and fool the build system into using them

PATH=~/crunch:$PATH ./configure
PATH=~/crunch:$PATH make
PATH=~/crunch:$PATH make install

or, to build accelerated Debian packages:

apt-get source foobar
sudo apt-get build-dep foobar
cd foobar-*
PATH=~/crunch:$PATH dpkg-buildpackage -rfakeroot -B
cd ..
dpkg -i foobar*.deb

Building it from source

Resource requirements

GCC keeps on growing. One of gcc-4.3's C source files, automatically generated during the build, insn-recog.c, is now over 4 MB in size and gcc-4.3 requires 219MB of virtual memory to compile it with normal optimization.

Memory: If you have less than 160MB of physical RAM plus 64MB swap, you will need to stop the compilation, compile that one file without optimisation by saying make CFLAGS=-g and then interrupt it and carry on as usual when that one file has been done.

Disk space: The full sources unpack to 500MB (360MB for gcc-4.2) and a further 200MB (140MB for gcc-4.2) are needed to build the C compiler. If you have less space, you can fetch a "gcc-core" source tarball instead, which only contains the C compiler and unpacks to about 200MB, for a total of 400MB when built.

Build procedure

I go:

    wget ftp://sourceware.org/pub/gcc/releases/gcc-4.4.7/gcc-4.4.7.tar.bz2
    tar xjf gcc-4.4.7.tar.bz2
    cd gcc-4.4.7

    for a in `cat ../gcc-4.4.7-patches/series`
    do
	patch -p1 < ../gcc-4.4.7-patches/$a
    done

    cd ..
    mkdir gcc-4.4.7-build
    cd gcc-4.4.7-build

    # The same basic configuration as Debian
    ../gcc-4.4.7/configure \
	CONFIG_SHELL=/bin/sh \
	--enable-languages=c --prefix=/usr/local \
	--enable-shared --with-system-zlib --without-included-gettext \
	--enable-threads=posix --enable-nls --program-suffix=-4.4-crunch \
	--enable-clocale=gnu --enable-mpfr --disable-libssp \
	--disable-sjlj-exceptions --disable-bootstrap \
	--with-arch=armv4t armv4tl-crunch-linux-gnueabi
    make CFLAGS_FOR_TARGET="-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp"
    ../s/install
    ../s/tarball gcc-4.4-crunch

The tarball script dumps a .tar.gz of the essential installed files and another of the source patchset in the ../packages directory.

There is also a test directory here with some program fragments that I used to probe hardware bug presence and characteristics.

Patches for other packages

These are patches for GCC and work fine for all regular C software that I've tried. Some other software packages are known to need Crunch tweaks as well:

binutils

C++: Some C++ files will not compile, saying
".save {mv8}" Error: register expected
although the same files will compile with optimization disabled.

glibc

C: Values held in Maverick registers are not restored when performing a setjmp/longjmp pair.
C++: Similarly, exception unwinding (performing a throw back to a catch block in a different function) does not restore floating point and 64-bit values held in Maverick registers.
When libm is compiled with Maverick support, sin() goes into an infinite loop on some values, as demonstrated by this test program.

There are fixes to glibc and binutils to solve these and other issues in a message to the linux-cirrus mailing list, though I haven't seen whether this solves the sin() looping problem or not.

Thanks

Thanks to Hasjim Williams for the work that this is based on, to David Herring for prompting me to start working on this again and to simplemachines.it for funding the completion of these patches.

Martin Guy <martinwguy@gmail.com>

Useful? Donate!