Path: nih-csl!darwin.sura.net!mlb.semi.harris.com!usenet.ufl.edu!gatech!swrinde!news.dell.com!tadpole.com!uunet!munnari.oz.au!news.uwa.edu.au!info.curtin.edu.au!ncrpda!rocky.curtin.edu.au!user From: peter.lewis@info.curtin.edu.au (Peter N Lewis) Newsgroups: alt.sources.mac,comp.sys.mac.programmer Subject: Square Root code Followup-To: comp.sys.mac.programmer Date: Tue, 20 Sep 1994 10:10:48 +0800 Organization: NCRPDA, Curtin University Lines: 519 Message-ID: References: <358r4b$1qs@search01.news.aol.com> <1994Sep19.120554.24684@cc.ic.ac.uk> NNTP-Posting-Host: ncrpda.curtin.edu.au >Well, assuming that you're just comparing distances, the most obvious >thing you could do is just calculate (dist.h*dist.h + dist.v*dist.v), Correct, the fastest square root is a NOP, but if you actually need the real distance, here's lots of different choices of code: Peter. ***** Path: ncrpda!info.curtin.edu.au!news.uwa.edu.au!harbinger.cc.monash.edu.au!msuinfo!agate!ihnp4.ucsd.edu!sdd.hp.com!nigel.msen.com!zib-berlin.de!news.rrz.uni-hamburg.de!news.dkrz.de!news.dfn.de!scsing.switch.ch!swidir.switch.ch!univ-lyon1.fr!jussieu.fr!nef.ens.fr!corvette!pottier From: pottier@corvette.ens.fr (Francois Pottier) Newsgroups: comp.sys.mac.programmer Subject: Re: Faster Square Root Algorithm Date: 4 May 1994 09:42:29 GMT Organization: Ecole Normale Superieure, PARIS, France Lines: 494 Message-ID: <2q7qm5$9uo@nef.ens.fr> References: NNTP-Posting-Host: corvette.ens.fr In article , John D. Busfield wrote: > Does anyone have an algorithm or code for computing square roots. There was a thread on this subject about a year ago. Here's what I kept: ------------------------------------------------------------------------- >Does somebody have code or an algorithm for extracting a long integer >square root and returning an integer? There was a long series of letters in Dr. Dobbs' Journal a few years back that I was a part of. Here are two 'competing' subroutines for the 68000 that I wrote. One is a Newton-Raphson iterator, the other a hybrid of three different subroutines using two different methods, heavily optimized for speed. The non-Newton one is faster in some cases and slower in others. Choosing one as 'best' depends on what kind of input ranges you expect to see most often. Newton's time doesn't vary much on the average based on the input argument. The non-Newton one ranges from a lot better (small arguments) to a little worse (large arguments), but is always better than Newton's worst case. Don't neglect the fact that on a FPU-equipped machine it may be faster to use floating point. I've done no research into this possibility, nor have I timed this stuff on any 68020+ system. The speed balance is no doubt different then. (For that matter, the 68000 testbed had no wait states nor interrupt system overhead, which doesn't necessarily apply to a 68000 PowerBook, and certainly doesn't apply to a Mac+ or earlier.) I'm personally fond of the non-Newton version, because the algorithm only uses shifts and adds, so it could be implemented in microcode with about same speed as a divide. ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Newton-Raphson method). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 338 cycles (1 shift, 1 division) to * * 1580 cycles (16 shifts, 4 divisions) (including rts). * * Averages 854 cycles measured over first 65535 roots. * * Averages 992 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt movem.l d1-d2,-(sp) (24) move.l d0,d1 (4) ; Set up for guessing algorithm. beq.s return (10/8) ; Don't process zero. moveq #1,d2 (4) guess cmp.l d2,d1 (6) ; Get a guess that is guaranteed to be bls.s newton (10/8) ; too high, but not by much, by dividing the add.l d2,d2 (8) ; argument by two and multiplying a 1 by 2 lsr.l #1,d1 (10) ; until the power of two passes the modified bra.s guess (10) ; argument, then average these two numbers. newton add.l d1,d2 (8) ; Average the two guesses. lsr.l #1,d2 (10) move.l d0,d1 (4) ; Generate the next approximation(s) divu d2,d1 (140) ; via the Newton-Raphson method. bvs.s done (10/8) ; Handle out-of-range input (cheats!) cmp.w d1,d2 (4) ; Have we converged? bls.s done (10/8) swap d1 (4) ; No, kill the remainder so the clr.w d1 (4) ; next average comes out right. swap d1 (4) bra.s newton (10) done clr.w d0 (4) ; Return a word answer in a longword. swap d0 (4) move.w d2,d0 (4) return movem.l (sp)+,d1-d2 (28) rts (16) end ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Notes: Result fits in D0.W, but is valid in longword. * * Takes from 122 to 1272 cycles (including rts). * * Averages 610 cycles measured over first 65535 roots. * * Averages 1104 cycles measured over first 500000 roots. * * * ************************************************************************* .globl lsqrt * Cycles lsqrt tst.l d0 (4) ; Skip doing zero. beq.s done (10/8) cmp.l #$10000,d0 (14) ; If is a longword, use the long routine. bhs.s glsqrt (10/8) cmp.w #625,d0 (8) ; Would the short word routine be quicker? bhi.s gsqrt (10/8) ; No, use general purpose word routine. * ; Otherwise fall into special routine. * * For speed, we use three exit points. * This is cheesy, but this is a speed-optimized subroutine! page ************************************************************************* * * * Faster Integer Square Root (16 to 8 bit). For small arguments. * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.W = SQRT(D0.W) * * * * Notes: Result fits in D0.B, but is valid in word. * * Takes from 72 (d0=1) to 504 (d0=625) cycles * * (including rts). * * * * Algorithm supplied by Motorola. * * * ************************************************************************* * Use the theorem that a perfect square is the sum of the first * sqrt(arg) number of odd integers. * Cycles move.w d1,-(sp) (8) move.w #-1,d1 (8) qsqrt1 addq.w #2,d1 (4) sub.w d1,d0 (4) bpl qsqrt1 (10/8) asr.w #1,d1 (8) move.w d1,d0 (4) move.w (sp)+,d1 (12) done rts (16) page ************************************************************************* * * * Integer Square Root (16 to 8 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.W = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.W) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.B, but is valid in word. * * * * Takes from 544 to 624 cycles (including rts). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles gsqrt movem.l d1-d4,-(sp) (40) move.w #7,d4 (8) ; Loop count (bits-1 of result). clr.w d1 (4) ; Error term in D1. clr.w d2 (4) sqrt1 add.w d0,d0 (4) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.w d0,d0 (4) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s sqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) sqrt2 dbra d4,sqrt1 (10/14) ; Do all 8 bit-pairs. move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) page ************************************************************************* * * * Integer Square Root (32 to 16 bit). * * * * (Exact method, not approximate). * * * * Call with: * * D0.L = Unsigned number. * * * * Returns: * * D0.L = SQRT(D0.L) * * * * Uses: D1-D4 as temporaries -- * * D1 = Error term; * * D2 = Running estimate; * * D3 = High bracket; * * D4 = Loop counter. * * * * Notes: Result fits in D0.W, but is valid in longword. * * * * Takes from 1080 to 1236 cycles (including rts.) * * * * Two of the 16 passes are unrolled from the loop so that * * quicker instructions may be used where there is no * * danger of overflow (in the early passes). * * * * Instruction times for branch-type instructions * * listed as (X/Y) are for (taken/not taken). * * * ************************************************************************* * Cycles glsqrt movem.l d1-d4,-(sp) (40) moveq #13,d4 (4) ; Loop count (bits-1 of result). moveq #0,d1 (4) ; Error term in D1. moveq #0,d2 (4) lsqrt1 add.l d0,d0 (8) ; Get 2 leading bits a time and add addx.w d1,d1 (4) ; into Error term for interpolation. add.l d0,d0 (8) ; (Classical method, easy in binary). addx.w d1,d1 (4) add.w d2,d2 (4) ; Running estimate * 2. move.w d2,d3 (4) add.w d3,d3 (4) cmp.w d3,d1 (4) bls.s lsqrt2 (10/8) ; New Error term > 2* Running estimate? addq.w #1,d2 (4) ; Yes, we want a '1' bit then. addq.w #1,d3 (4) ; Fix up new Error term. sub.w d3,d1 (4) lsqrt2 dbra d4,lsqrt1 (10/14) ; Do first 14 bit-pairs. add.l d0,d0 (8) ; Do 15-th bit-pair. addx.w d1,d1 (4) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.w d3,d3 (4) cmp.l d3,d1 (6) bls.s lsqrt3 (10/8) addq.w #1,d2 (4) addq.w #1,d3 (4) sub.l d3,d1 (8) lsqrt3 add.l d0,d0 (8) ; Do 16-th bit-pair. addx.l d1,d1 (8) add.l d0,d0 (8) addx.l d1,d1 (8) add.w d2,d2 (4) move.l d2,d3 (4) add.l d3,d3 (8) cmp.l d3,d1 (6) bls.s lsqrt4 (10/8) addq.w #1,d2 (4) lsqrt4 move.w d2,d0 (4) movem.l (sp)+,d1-d4 (44) rts (16) end -- +----------------+ ! II CCCCCC ! Jim Cathey ! II SSSSCC ! ISC-Bunker Ramo ! II CC ! TAF-C8; Spokane, WA 99220 ! IISSSS CC ! UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com) ! II CCCCCC ! (509) 927-5757 +----------------+ ------------------------------------------------------------------------------- /* * ISqrt-- * Find square root of n. This is a Newton's method approximation, * converging in 1 iteration per digit or so. Finds floor(sqrt(n) + 0.5). */ int ISqrt(register int n) { register int i; register long k0, k1, nn; for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1) ; nn <<= 2; for (;;) { k1 = (nn / k0 + k0) >> 1; if (((k0 ^ k1) & ~1) == 0) break; k0 = k1; } return (int) ((k1 + 1) >> 1); } -- Joseph Nathan Hall | Joseph's Rule of Thumb: Most folks aren't interested Software Architect | in your rules of thumb. Gorca Systems Inc. | ----- joseph@joebloe.maple-shade.nj.us (home) ----- (on assignment) | (602) 732-2549 jnhall@sat.mot.com (work) ------------------------------------------------------------------------------- Here's an article I saved from c.s.m.p four months ago. Strangely, it was only distributed in North America. I don't know how fast it works in practice, but there are no multiplications or divisions in its inner loop, which is promising. #include /* * It requires more space to describe this implementation of the manual * square root algorithm than it did to code it. The basic idea is that * the square root is computed one bit at a time from the high end. Because * the original number is 32 bits (unsigned), the root cannot exceed 16 bits * in length, so we start with the 0x8000 bit. * * Let "x" be the value whose root we desire, "t" be the square root * that we desire, and "s" be a bitmask. A simple way to compute * the root is to set "s" to 0x8000, and loop doing the following: * * t = 0; * s = 0x8000; * do { * if ((t + s) * (t + s) <= x) * t += s; * s >>= 1; * while (s != 0); * * The primary disadvantage to this approach is the multiplication. To * eliminate this, we begin simplying. First, we observe that * * (t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s) * * Therefore, if we redefine "x" to be the original argument minus the * current value of (t * t), we can determine if we should add "s" to * the root if * * (2 * t * s) + (s * s) <= x * * If we define a new temporary "nr", we can express this as * * t = 0; * s = 0x8000; * do { * nr = (2 * t * s) + (s * s); * if (nr <= x) { * x -= nr; * t += s; * } * s >>= 1; * while (s != 0); * * We can improve the performance of this by noting that "s" is always a * power of two, so multiplication by "s" is just a shift. Also, because * "s" changes in a predictable manner (shifted right after each iteration) * we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then adjust * them by shifting after each step. First, we let "m" hold the value * (s * s) and adjust it after each step by shifting right twice. We * also introduce "r" to hold (2 * t * s) and adjust it after each step * by shifting right once. When we update "t" we must also update "r", * and we do so by noting that (2 * (old_t + s) * s) is the same as * (2 * old_t * s) + (2 * s * s). Noting that (s * s) is "m" and that * (r + 2 * m) == ((r + m) + m) == (nr + m): * * t = 0; * s = 0x8000; * m = 0x40000000; * r = 0; * do { * nr = r + m; * if (nr <= x) { * x -= nr; * t += s; * r = nr + m; * } * s >>= 1; * r >>= 1; * m >>= 2; * } while (s != 0); * * Finally, we note that, if we were using fractional arithmetic, after * 16 iterations "s" would be a binary 0.5, so the value of "r" when * the loop terminates is (2 * t * 0.5) or "t". Because the values in * "t" and "r" are identical after the loop terminates, and because we * do not otherwise use "t" explicitly within the loop, we can omit it. * When we do so, there is no need for "s" except to terminate the loop, * but we observe that "m" will become zero at the same time as "s", * so we can use it instead. * * The result we have at this point is the floor of the square root. If * we want to round to the nearest integer, we need to consider whether * the remainder in "x" is greater than or equal to the difference * between ((r + 0.5) * (r + 0.5)) and (r * r). Noting that the former * quantity is (r * r + r + 0.25), we want to check if the remainder is * greater than or equal to (r + 0.25). Because we are dealing with * integers, we can't have equality, so we round up if "x" is strictly * greater than "r": * * if (x > r) * r++; */ unsigned int isqrt(unsigned long x) { unsigned long r, nr, m; r = 0; m = 0x40000000; do { nr = r + m; if (nr <= x) { x -= nr; r = nr + m; } r >>= 1; m >>= 2; } while (m != 0); if (x > r) r++; return(r); } -- (Dr.) John Bruner, Deputy Director bruner@csrd.uiuc.edu Center for Supercomputing Research & Development (217) 244-4476 (voice) University of Illinois at Urbana-Champaign (217) 244-1351 (FAX) 465 CSRL, MC-264; 1308 West Main St.; Urbana, IL 61801-2307 -- Jamie McCarthy Internet: k044477@kzoo.edu AppleLink: j.mccarthy ------------------------------------------------------------------------------- Then there is the easy way - use the toolbox: #define LongSqrt(x) ((short)(FracSqrt((Fract)(x)) >> 15)) FractSqrt is in FixMath.h. This works because given two fixed point number of the form x.y, where x is the number of bits before the decimal point and y is the number of bits after the decimal point, twos complement multiplication will yield a result of: x1.y1 * x2.y2 = x1+x2.y1+y2. If the numbers are the same format, this becomes x.y * x.y = 2x.2y. This holds for squaring a number also: x.y^2 = 2x.2y. The sqrt is then sqrt(x.y) = x/2.y/2. FracSqrt is of the form FracSqrt(2.30) = 2.30. Since we know how sqrt works this can also be expressed as FracSqrt(x.y) = x/2.y/2 << 15 (This is a "virtual" shift since the calculation is compleated with more precision the 15 bits shifted in are "real" and not 0). So if you want to use FracSqrt for a long you get FracSqrt(32.0) = 16.0 << 15. So you shift the result down be 15 to get a short (you can add 1 << 14 prior to shifting down by 15 to round instead of truncing your result). If you wanted a long sqrt with a 16.16 (Fixed) result you would use: #define LongFixedSqrt(x) ((Fixed)(FracSqrt((Fract)(x)) << 1)) If you want a Fixed (16.16) sqrt with a Fixed result (16.16) you would use: #define FixedSqrt(x) ((Fixed)FracSqrt((Fract)(x)) >> 7)) You can do this kind of work with all of the fixed point routines and standard operators. Sean Parent -- Francois Pottier pottier@dmi.ens.fr ------------------------------------------------------------------------------ This area dedicated to the preservation of endangered species. "Moof!" ***** -- Peter N Lewis - Macintosh TCP fingerpainter FTP my programs from redback.cs.uwa.edu.au:Others/PeterLewis/ or amug.org:pub/peterlewis/ or nic.switch.ch:software/mac/peterlewis/