cnhnln 发表于 2005-3-28 18:16:33

Compiler Optimization for Spee

这文章有一些图表
直接去看比较好^ ^
http://home.comcast.net/~jcunningham63/linux/GCC_Optimization_apollo.html
======================================
Compiler Optimization for Speed on an Athlon-XP System

Jeffrey K. Cunningham
last revised on July 24, 2003
These results were obtained by running a script with different sets of optimization flags and measuring the run time results of the same floating-point intensive analytic code on the same set of data. The script cleaned and recompiled the code with the given set of flags, and logged the results. Inorder to reproduce the same background load on the machine each time, the xscreensaver daemon was killed, along with any other X-applications (such as mozilla, open-office, etc). The only other applications running were a few xterms, gkrellm, a window manager (Enlightenment in this case). Due to asynchronous loading of mail polling, and other monitor updates, there was an observed timing uncertainty of roughly half a second, meaning that run time differences of 1 second may not be statistically significant.For a more detailed discussion of the methology, flags, and results for other systems, follow the links:

Important Saftey Tip: Be extremely careful using highly optimized flags when compiling your OS and basic tools, as you can break them badly. Some of these optimizations are incompatible with binary libraries compiled without them. In particular, be careful with -malign-double and -ffast-math: never use them to compile anything but stand-alone codes or with codes that only call libraries compiled the same way.

Contents
Methodolgy
Flags affecting optimization
Results for an 1.8 GHz Athlon-XP system
Results for a 1.3 GHz Duron system
Results for a 1.7 GHz P4 Xeon system
A 1.8 GHz Athlon-XP System
This is a cheap system, just a few dollars more than the Duron system. Basically, it has a 233 Mhz bus, a faster hard drive, and the Athlon-XP with its bigger cache (256K instead of 64K).

I am using gcc version 3.2.2. Here is the full version return, in case this is useful:


Reading specs from /usr/lib/gcc-lib/i686-pc-linux-gnu/3.2.2/specs

Configured with: /var/tmp/portage/gcc-3.2.2/work/gcc-3.2.2/configure --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc-bin/3.2 --includedir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.2.2/include --datadir=/usr/share/gcc-data/i686-pc-linux-gnu/3.2 --mandir=/usr/share/gcc-data/i686-pc-linux-gnu/3.2/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.2/info --enable-shared --host=i686-pc-linux-gnu --target=i686-pc-linux-gnu --with-system-zlib --enable-languages=c, c++, ada, f77, objc, java --enable-threads=posix --enable-long-long --disable-checking --enable-cstdio=stdio --enable-clocale=generic --enable-__cxa_atexit --enable-version-specific-runtime-libs --with-gxx-include-dir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.2.2/include/g++-v3 --with-local-prefix=/usr/local --enable-shared --disable-nls
Thread model: posix
gcc version 3.2.2


I used glibc version 2.3.1-r4 in all of the following tests. It was compiled with the following flags:


-march=athlon-xp -O3 -pipe -fomit-frame-pointer -mmmx -msse -m3dnow -mfpmath=sse, 387


The system I performed this test on is an athlon-xp 1800 Mhz based, in an ASUS 87V8X motherboard, with a Radeon 7000 video card.According to the information in /proc/cpuinfo, the following flags are available:


flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow


The flags which I have been able to identify from this as being relevant to GCC optimizations are: mmx, 3dnow.

A script called gcccpuopt is floating around the internet which attempts to identify GCC optimization flags which might be used for a particular hardware configuration - the copy I have is attributed to "pixelbeat".For my hardware, this script identifies the following flags:


-march=athlon-xp -mfpmath=sse -msse -mmmx -m3dnow


These formed the initial basis for the flags I tried. These were the flags I expected would obtain the best results based on everything I had read.

Architecture flag effects
The fundamental optimization - that which provided the one of the greatest improvements in speed - was specifying the architecture.Using only the -O3 and -march=<val> flags, I obtained the following run times:

flags
runtime (seconds)

-O3 -march=i386
350

-O3 -march=i486 350

-O3 -march=i586 347

-O3 -march=i686 330

-O3 -march=athlon 326

-O3 -march=athlon-xp 320


A reduction in run time from 350 to 320 seconds is a 9% increase in speed.The most significant increases occur going from i586 to i686, but that from there to the athlon-xp is nearly as great. It is interesting that there is no discernable difference between the i386 and i486 cpu and architecture (and very little between those and the i586).This would seem to indicate that on older machines, there is probably little point in optimizing code - the standard pre-compiled binaries will do about as well. I am assuming, of course, that they will respond similarly to the differences in these binary codes as this athlon-xp system does. I believe that should be true, but I haven't tried it.

I should note that the uncertainty in all of these runtime estimates is about 0.5 seconds, due to multiple asyncronous processes running in the background. Thus, differences of 1 second appearing in these results may not be significant.

Floating Point Processors

As I understand it, there are potentially three separate floating point processor instruction sets available in an athlon-xp: the standard 387 Intel processor, and two others, identified as SSE and SSE2. It is commonly thought that significant performance gains may be had by enabling both in parallel, through use of the mfpmath=sse, 387 flag.I investigated this next. In all cases here and below, I assume the baseline -march=athlon-xp -O3 flags in addition to the ones I give explicitly.


time (sec)=
320
319 320 321 320 320

-march=athlon-xp -O3 X
X
X
X
X
X

-msse
X


X X

-mfpmath=sse

X

X


-mfpmath=sse,387



X

X


All these times are within 1 second of each other. I don't believe any significance can be drawn from these differences due to the slight loading variation mentioned previously . What is significant is that they are not demonstrably better than the baseline. My initial thought on this was that it is possible that whatever speed increases these flags are producing is being swamped by other inefficiencies yet to be optimized out. So I continued to try variations of these flags along with other flags through the course of my investigation. Nevertheless, I never saw any truely significant difference resulting from their use in any combination, inspite of the fact that this benchmark code makes intensive use of floating point processing. I don't understand this, but the result is clear. What remains to determine is whether this insensitivity to these flags is specific to the athlon (xp) or extends to other Intel processors as well. I will be performing this analysis on another machine with an Intel Xeon processor shortly and will augment this study when I have those results.

Note, I also tried the -msse2 flag, since I had read several times that it was available with the athlon-xp processor. It does not work - GCC issues an "unknown option" error.

Other Optimization Flags
Probably the two most common additional flags recommended for speed are -pipe and -fomit-frame-pointer.The former only improves the compile time by piping the output through memory rather than to a file, and so is of no benefit to run time (see note 1.). Another commonly suggested flag is -funroll-loops. I also experimented with the -ffastmath, -fno-trapping-math, and -fprefetch-loop-arrays flags. There are cautions associated with the -ffastmath flag which you should at least understand before using. Based on what I've found, it will probably be worth your trouble.Here are the data:

time (sec)=
320
317 320
254 254 256
256 248
-march=athlon-xp -O3 X
X
X
X
X
X
X
X

-fomit-frame-pointer
X

   

X X

-funroll-loops


X




X

-ffast-math


X X X
X
X

-fno-trapping-math



X



-fprefetch-loop-arrays




X




The most significant result here is the discovery that the -ffast-math option is by far the most effective so far at increasing speed.The other two flags produced ambiguous results:in the absence of the -ffast-math option, the effect of the -fomit-frame-pointer flag is just barely significant (I don't trust differences less than 2 seconds) and the effects of -funroll-loops not at all. With the--fast-math option, however, neither seem to have any effect and only -funroll-loops has any effect. I'm not quite sure what to make of this.

The Intel/AMD x86-64 Specific Flags
The next set of flags I threw into the mix were the remaining architecture/cpu specific instructions. I wasn't expecting any performance increases from turning on MMMX and 3dNOW support, since the benchmark code didn't include any such device support, but I wasn't sure that there weren't specific sequences of operations that the compiler might implement more efficiently using these instructions were they available, so I tried them anyway. As I thought, they had no discernable effect.I only tried the -maccumulate-outgoing-arrays flag in a couple combinations (only one appears below), but it seemed to have no significant effect either.


time (sec)=
320
311 242 243 243 243 248 241
-march=athlon-xp -O3 X
X
X
X X
X
X
X

-malign-double
X
X
X X
X
X X

-ffast-math

X
X
X
X
X
X

-maccumulate-outgoing-arrays

   X




-mmmx



X




-m3dnow


   
X


-funroll-loops





X X

-fomit-frame-pointer







X

The -malign-double flag however, results in a distinct improvement. Note, however, that there can be major problems with using the -malign-double flag It changes the alignment of doubles in structures so they are on even word boundaries making them faster to access. Unfortunately, if you are linking to a library which expects to pass structures aligned differently, it is going to die a horrible death in all probability. But for stand-alone codes like mine, it is great. For codes operating in an integrated environment, it is probably best to avoid it, which is too bad, because it obviously makes a difference.

I also tried the -mwide-multiply option mentioned in the GCC documentation, but it failed, the compiler giving an "unknown option" error message.

The full set of data I accumulated can be examined here in the form of an OpenOffice spreadsheet..

Summary
Benchmarks were run with a number of combinations of compiler optimization and option flags, including 6 architecture specifiers, 3 floating-point instruction set flags, 5 other Intel/AMD x86-specific flags, and6 general optimization flags not already included in the -O3 set. The -O3 optimization set was assumed throughout. Not all permutations of these flags were tried - this would requir 20! distinct benchmark trials (> 2.43^18).The space of possible flag combinations was sampled in sparse sets designed to flush out their effects assuming separability of behavior, which may not always, but should in general be true - and anyway such that the exceptions are probably not significant for most purposes.

The set of flags resulting in the best run times I was able to obtain for this hardware is either of the following:



-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer

-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -msse -mfpmath=sse,387


To give a rough break-down of their effectiveness would be difficult, but their aggregate effect is to reduce my initial run time from 347 to 241 seconds, which amounts to roughly a 47% increase in speed (the former was the runtime for the flag set I was using routinely before this investigation). Not too shabby.

I added the floating-point processor commands because they do not seem to harm performance, and it is possible they will improve it in different applications. One could make the same argument with regard to the -mmmx and -m3dnow flags.

A Note on Scripting the benchmark

Needless to say, its a good idea to automate this testing process if you don't want to go crazy. I did it by modifying my benchmark Makefile so that it set its CPPFLAGS variable by running a shell script which echo'ed the current flags I wanted to try.The Makefile line is:


OPTFLAGS = $(shell optflags.sh)


Then I wrote a shell script that processed the sequences of trials, creating a new ~/bin/optflags.shfor each one before rebuilding and calling the benchmark:


#!/bin/bash

function run()
{
    ofile="results3"
    echo "echo '$1'" > ~/bin/optflags.sh
    chmod 744 ~/bin/optflags.sh

    echo "Running flag set:" >> ${ofile}
    optflags.sh >> ${ofile}
    optflags.sh
    make clean && make
    ./calcdr -c bbm3b_max.cfg -C RTNX >> ${ofile}
}

run "-march=athlon-xp -O3 -ffast-math -pipe -fomit-frame-pointer"
run "-march=athlon-xp -O3 -ffast-math -pipe -fomit-frame-pointer -funroll-loops"
run "-march=athlon-xp -O3 -msse"
run "-march=athlon-xp -O3 -mfpmath=sse"
run "-march=athlon-xp -O3 -mfpmath=sse,387"
run "-march=athlon-xp -O3 -msse -mfpmath=sse"
run "-march=athlon-xp -O3 -msse -mfpmath=sse,387"
run "-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -msse"
run "-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -msse -mfpmath=sse,387"
run "-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -fforce-mem"
run "-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -mmmx"
run "-march=athlon-xp -O3 -ffast-math -malign-double -funroll-loops -pipe -fomit-frame-pointer -m3dnow"


The command "./calcdr -c bbm3b_max.cfg -C RTNX" ran the benchmark.
The redirection ">> ${ofile}" generated a file giving the run time results.
Note: my benchmark program writes out its run time when it finishes.


Notes
(1) Thank you to aethyr on forums.gentoo.org for pointing outthat the -pipe flag only effects compile time. On revisiting the data I realized that the variation I had attributed to it was not statistically significant, so I've revised this report by removing the references to it to avoid confusion.


Please feel free to email with comments and corrections.If there is a way to squeak more speed out of this sucker, I want to know about it.

[email protected]

--
<<Planetarian~星之梦~>>汉化版已出
http://www.keyfc.net/project/cplanetarian/
http://rinrin.home4u.china.com/images/key_bn01.gif http://rinrin.home4u.china.com/images/air2004_br.gif http://rinrin.home4u.china.com/images/banner.jpg


linky_fan 发表于 2005-4-1 13:45:02

好是好,就是老了点(老核心的althon and R7000), 只支持到mmx, 前一阵子研究了很久的p-m,安装gnu的说法, p-m就是加强版的p3(加入了sse2). :wink:
页: [1]
查看完整版本: Compiler Optimization for Spee