Path: nih-csl!darwin.sura.net!mlb.semi.harris.com!usenet.ufl.edu!gatech!swrinde!news.dell.com!tadpole.com!uunet!munnari.oz.au!news.uwa.edu.au!info.curtin.edu.au!ncrpda!rocky.curtin.edu.au!user
From: peter.lewis@info.curtin.edu.au (Peter N Lewis)
Newsgroups: alt.sources.mac,comp.sys.mac.programmer
Subject: Square Root code
Followup-To: comp.sys.mac.programmer
Date: Tue, 20 Sep 1994 10:10:48 +0800
Organization: NCRPDA, Curtin University
Lines: 519
Message-ID: <peter.lewis-2009941010480001@rocky.curtin.edu.au>
References: <t.griffith-1509941559300001@17.202.32.50> <358r4b$1qs@search01.news.aol.com> <1994Sep19.120554.24684@cc.ic.ac.uk> <giles-1909940904060001@wiggin.med.cornell.edu>
NNTP-Posting-Host: ncrpda.curtin.edu.au

>Well, assuming that you're just comparing distances, the most obvious
>thing you could do is just calculate (dist.h*dist.h + dist.v*dist.v),

Correct, the fastest square root is a NOP, but if you actually need the
real distance, here's lots of different choices of code:
   Peter.

*****
Path:
ncrpda!info.curtin.edu.au!news.uwa.edu.au!harbinger.cc.monash.edu.au!msuinfo!agate!ihnp4.ucsd.edu!sdd.hp.com!nigel.msen.com!zib-berlin.de!news.rrz.uni-hamburg.de!news.dkrz.de!news.dfn.de!scsing.switch.ch!swidir.switch.ch!univ-lyon1.fr!jussieu.fr!nef.ens.fr!corvette!pottier
From: pottier@corvette.ens.fr (Francois Pottier)
Newsgroups: comp.sys.mac.programmer
Subject: Re: Faster Square Root Algorithm
Date: 4 May 1994 09:42:29 GMT
Organization: Ecole Normale Superieure, PARIS, France
Lines: 494
Message-ID: <2q7qm5$9uo@nef.ens.fr>
References: <busfield.767985832@hurricane>
NNTP-Posting-Host: corvette.ens.fr

In article <busfield.767985832@hurricane>,
John D. Busfield <busfield@hurricane.seas.ucla.edu> wrote:

>    Does anyone have an algorithm or code for computing square roots.


There was a thread on this subject about a year ago. Here's what
I kept:

-------------------------------------------------------------------------


>Does somebody have code or an algorithm for extracting a long integer
>square root and returning an integer?

There was a long series of letters in Dr.  Dobbs' Journal a few years
back that I was a part of.  Here are two 'competing' subroutines for the
68000 that I wrote.  One is a Newton-Raphson iterator, the other a
hybrid of three different subroutines using two different methods,
heavily optimized for speed.  The non-Newton one is faster in some cases
and slower in others.  Choosing one as 'best' depends on what kind of
input ranges you expect to see most often.  Newton's time doesn't vary
much on the average based on the input argument.  The non-Newton one
ranges from a lot better (small arguments) to a little worse (large
arguments), but is always better than Newton's worst case.  Don't
neglect the fact that on a FPU-equipped machine it may be faster to use
floating point.  I've done no research into this possibility, nor have I
timed this stuff on any 68020+ system.  The speed balance is no doubt
different then.  (For that matter, the 68000 testbed had no wait states
nor interrupt system overhead, which doesn't necessarily apply to a
68000 PowerBook, and certainly doesn't apply to a Mac+ or earlier.)

I'm personally fond of the non-Newton version, because the algorithm
only uses shifts and adds, so it could be implemented in microcode with
about same speed as a divide.

*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Newton-Raphson method).                                        *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 338 cycles (1 shift, 1 division) to          *
*               1580 cycles (16 shifts, 4 divisions)  (including rts).  *
*               Averages 854 cycles measured over first 65535 roots.    *
*               Averages 992 cycles measured over first 500000 roots.   *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   movem.l d1-d2,-(sp) (24)
        move.l d0,d1    (4)     ; Set up for guessing algorithm.
        beq.s return    (10/8)  ; Don't process zero.
        moveq #1,d2     (4)

guess   cmp.l d2,d1     (6)     ; Get a guess that is guaranteed to be
        bls.s newton    (10/8)  ; too high, but not by much, by dividing the
        add.l d2,d2     (8)     ; argument by two and multiplying a 1 by 2
        lsr.l #1,d1     (10)    ; until the power of two passes the modified
        bra.s guess     (10)    ; argument, then average these two numbers.

newton  add.l d1,d2     (8)     ; Average the two guesses.
        lsr.l #1,d2     (10)
        move.l d0,d1    (4)     ; Generate the next approximation(s)
        divu d2,d1      (140)   ; via the Newton-Raphson method.
        bvs.s done      (10/8)  ; Handle out-of-range input (cheats!)
        cmp.w d1,d2     (4)     ; Have we converged?
        bls.s done      (10/8)
        swap d1         (4)     ; No, kill the remainder so the
        clr.w d1        (4)     ; next average comes out right.
        swap d1         (4)
        bra.s newton    (10)

done    clr.w d0        (4)     ; Return a word answer in a longword.
        swap d0         (4)
        move.w d2,d0    (4)
return  movem.l (sp)+,d1-d2  (28)
        rts             (16)

        end
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*               Takes from 122 to 1272 cycles (including rts).          *
*               Averages 610 cycles measured over first 65535 roots.    *
*               Averages 1104 cycles measured over first 500000 roots.  *
*                                                                       *
*************************************************************************

        .globl lsqrt
*                       Cycles
lsqrt   tst.l d0        (4)     ; Skip doing zero.
        beq.s done      (10/8)
        cmp.l #$10000,d0 (14)   ; If is a longword, use the long routine.
        bhs.s glsqrt    (10/8)
        cmp.w #625,d0   (8)     ; Would the short word routine be quicker?
        bhi.s gsqrt     (10/8)  ; No, use general purpose word routine.
*                               ; Otherwise fall into special routine.
*
*  For speed, we use three exit points.
*  This is cheesy, but this is a speed-optimized subroutine!

        page
*************************************************************************
*                                                                       *
*       Faster Integer Square Root (16 to 8 bit).  For small arguments. *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.W = SQRT(D0.W)                                       *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*               Takes from 72 (d0=1) to 504 (d0=625) cycles             *
*               (including rts).                                        *
*                                                                       *
*       Algorithm supplied by Motorola.                                 *
*                                                                       *
*************************************************************************

* Use the theorem that a perfect square is the sum of the first
* sqrt(arg) number of odd integers.

*                       Cycles
        move.w d1,-(sp) (8)
        move.w #-1,d1   (8)
qsqrt1  addq.w #2,d1    (4)
        sub.w d1,d0     (4)
        bpl qsqrt1      (10/8)
        asr.w #1,d1     (8)
        move.w d1,d0    (4)
        move.w (sp)+,d1 (12)
done    rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (16 to 8 bit).                              *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.W = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.W)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.B, but is valid in word.              *
*                                                                       *
*               Takes from 544 to 624 cycles (including rts).           *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
gsqrt   movem.l d1-d4,-(sp) (40)
        move.w #7,d4    (8)     ; Loop count (bits-1 of result).
        clr.w d1        (4)     ; Error term in D1.
        clr.w d2        (4)
sqrt1   add.w d0,d0     (4)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.w d0,d0     (4)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s sqrt2     (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
sqrt2   dbra d4,sqrt1   (10/14) ; Do all 8 bit-pairs.
        move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        page
*************************************************************************
*                                                                       *
*       Integer Square Root (32 to 16 bit).                             *
*                                                                       *
*       (Exact method, not approximate).                                *
*                                                                       *
*       Call with:                                                      *
*               D0.L = Unsigned number.                                 *
*                                                                       *
*       Returns:                                                        *
*               D0.L = SQRT(D0.L)                                       *
*                                                                       *
*       Uses:   D1-D4 as temporaries --                                 *
*               D1 = Error term;                                        *
*               D2 = Running estimate;                                  *
*               D3 = High bracket;                                      *
*               D4 = Loop counter.                                      *
*                                                                       *
*       Notes:  Result fits in D0.W, but is valid in longword.          *
*                                                                       *
*               Takes from 1080 to 1236 cycles (including rts.)         *
*                                                                       *
*               Two of the 16 passes are unrolled from the loop so that *
*               quicker instructions may be used where there is no      *
*               danger of overflow (in the early passes).               *
*                                                                       *
*               Instruction times for branch-type instructions          *
*               listed as (X/Y) are for (taken/not taken).              *
*                                                                       *
*************************************************************************

*                       Cycles
glsqrt  movem.l d1-d4,-(sp) (40)
        moveq #13,d4    (4)     ; Loop count (bits-1 of result).
        moveq #0,d1     (4)     ; Error term in D1.
        moveq #0,d2     (4)
lsqrt1  add.l d0,d0     (8)     ; Get 2 leading bits a time and add
        addx.w d1,d1    (4)     ; into Error term for interpolation.
        add.l d0,d0     (8)     ; (Classical method, easy in binary).
        addx.w d1,d1    (4)
        add.w d2,d2     (4)     ; Running estimate * 2.
        move.w d2,d3    (4)
        add.w d3,d3     (4)
        cmp.w d3,d1     (4)
        bls.s lsqrt2    (10/8)  ; New Error term > 2* Running estimate?
        addq.w #1,d2    (4)     ; Yes, we want a '1' bit then.
        addq.w #1,d3    (4)     ; Fix up new Error term.
        sub.w d3,d1     (4)
lsqrt2  dbra d4,lsqrt1  (10/14) ; Do first 14 bit-pairs.

        add.l d0,d0     (8)     ; Do 15-th bit-pair.
        addx.w d1,d1    (4)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.w d3,d3     (4)
        cmp.l d3,d1     (6)
        bls.s lsqrt3    (10/8)
        addq.w #1,d2    (4)
        addq.w #1,d3    (4)
        sub.l d3,d1     (8)

lsqrt3  add.l d0,d0     (8)     ; Do 16-th bit-pair.
        addx.l d1,d1    (8)
        add.l d0,d0     (8)
        addx.l d1,d1    (8)
        add.w d2,d2     (4)
        move.l d2,d3    (4)
        add.l d3,d3     (8)
        cmp.l d3,d1     (6)
        bls.s lsqrt4    (10/8)
        addq.w #1,d2    (4)
lsqrt4  move.w d2,d0    (4)
        movem.l (sp)+,d1-d4 (44)
        rts             (16)

        end
-- 
+----------------+
! II      CCCCCC !  Jim Cathey
! II  SSSSCC     !  ISC-Bunker Ramo
! II      CC     !  TAF-C8;  Spokane, WA  99220
! IISSSS  CC     !  UUCP: uunet!isc-br!jimc (jimc@isc-br.isc-br.com)
! II      CCCCCC !  (509) 927-5757
+----------------+

-------------------------------------------------------------------------------

/*
 * ISqrt--
 *     Find square root of n.  This is a Newton's method approximation,
 * converging in 1 iteration per digit or so.  Finds floor(sqrt(n) + 0.5).
 */
int ISqrt(register int n)
{
    register int i;
    register long k0, k1, nn;

    for (nn = i = n, k0 = 2; i > 0; i >>= 2, k0 <<= 1)
        ;
    nn <<= 2;
    for (;;) {
        k1 = (nn / k0 + k0) >> 1;
        if (((k0 ^ k1) & ~1) == 0)
            break;
        k0 = k1;
    }
    return (int) ((k1 + 1) >> 1);       
}

-- 
Joseph Nathan Hall  |  Joseph's Rule of Thumb: Most folks aren't interested
Software Architect  |                          in your rules of thumb.
Gorca Systems Inc.  |  ----- joseph@joebloe.maple-shade.nj.us (home) -----
(on assignment)     |        (602) 732-2549 jnhall@sat.mot.com (work)

-------------------------------------------------------------------------------

Here's an article I saved from c.s.m.p four months ago.  Strangely, it
was only distributed in North America.  I don't know how fast it works
in practice, but there are no multiplications or divisions in its inner
loop, which is promising.


#include <math.h>
 
/*
 * It requires more space to describe this implementation of the manual
 * square root algorithm than it did to code it.  The basic idea is that
 * the square root is computed one bit at a time from the high end.  Because
 * the original number is 32 bits (unsigned), the root cannot exceed 16 bits
 * in length, so we start with the 0x8000 bit.
 *
 * Let "x" be the value whose root we desire, "t" be the square root
 * that we desire, and "s" be a bitmask.  A simple way to compute
 * the root is to set "s" to 0x8000, and loop doing the following:
 *
 *      t = 0;
 *      s = 0x8000;
 *      do {
 *              if ((t + s) * (t + s) <= x)
 *                      t += s;
 *              s >>= 1;
 *      while (s != 0);
 *
 * The primary disadvantage to this approach is the multiplication.  To
 * eliminate this, we begin simplying.  First, we observe that
 *
 *      (t + s) * (t + s) == (t * t) + (2 * t * s) + (s * s)
 *
 * Therefore, if we redefine "x" to be the original argument minus the
 * current value of (t * t), we can determine if we should add "s" to
 * the root if
 *
 *      (2 * t * s) + (s * s) <= x
 *
 * If we define a new temporary "nr", we can express this as
 *
 *      t = 0;
 *      s = 0x8000;
 *      do {
 *              nr = (2 * t * s) + (s * s);
 *              if (nr <= x) {
 *                      x -= nr;
 *                      t += s;
 *              }
 *              s >>= 1;
 *      while (s != 0);
 *
 * We can improve the performance of this by noting that "s" is always a
 * power of two, so multiplication by "s" is just a shift.  Also, because
 * "s" changes in a predictable manner (shifted right after each iteration)
 * we can precompute (0x8000 * t) and (0x8000 * 0x8000) and then adjust
 * them by shifting after each step.  First, we let "m" hold the value
 * (s * s) and adjust it after each step by shifting right twice.  We
 * also introduce "r" to hold (2 * t * s) and adjust it after each step
 * by shifting right once.  When we update "t" we must also update "r",
 * and we do so by noting that (2 * (old_t + s) * s) is the same as
 * (2 * old_t * s) + (2 * s * s).  Noting that (s * s) is "m" and that
 * (r + 2 * m) == ((r + m) + m) == (nr + m):
 *
 *      t = 0;
 *      s = 0x8000;
 *      m = 0x40000000;
 *      r = 0;
 *      do {
 *              nr = r + m;
 *              if (nr <= x) {
 *                      x -= nr;
 *                      t += s;
 *                      r = nr + m;
 *              }
 *              s >>= 1;
 *              r >>= 1;                
 *              m >>= 2;
 *      } while (s != 0);
 *
 * Finally, we note that, if we were using fractional arithmetic, after
 * 16 iterations "s" would be a binary 0.5, so the value of "r" when
 * the loop terminates is (2 * t * 0.5) or "t".  Because the values in
 * "t" and "r" are identical after the loop terminates, and because we
 * do not otherwise use "t"  explicitly within the loop, we can omit it.
 * When we do so, there is no need for "s" except to terminate the loop,
 * but we observe that "m" will become zero at the same time as "s",
 * so we can use it instead.
 *
 * The result we have at this point is the floor of the square root.  If
 * we want to round to the nearest integer, we need to consider whether
 * the remainder in "x" is greater than or equal to the difference
 * between ((r + 0.5) * (r + 0.5)) and (r * r).  Noting that the former
 * quantity is (r * r + r + 0.25), we want to check if the remainder is
 * greater than or equal to (r + 0.25).  Because we are dealing with
 * integers, we can't have equality, so we round up if "x" is strictly
 * greater than "r":
 *
 *      if (x > r)
 *              r++;
 */
 
unsigned int isqrt(unsigned long x)
{
        unsigned long r, nr, m;
 
        r = 0;
        m = 0x40000000;
        do {
                nr = r + m;
                if (nr <= x) {
                        x -= nr;
                        r = nr + m;
                }
                r >>= 1;
                m >>= 2;
        } while (m != 0);
 
        if (x > r)
                r++;
        return(r);
}
--
(Dr.) John Bruner, Deputy Director                      bruner@csrd.uiuc.edu
Center for Supercomputing Research & Development        (217) 244-4476 (voice)
University of Illinois at Urbana-Champaign              (217) 244-1351 (FAX)
465 CSRL, MC-264; 1308 West Main St.; Urbana, IL  61801-2307


-- 
 Jamie McCarthy         Internet: k044477@kzoo.edu      AppleLink: j.mccarthy

-------------------------------------------------------------------------------

Then there is the easy way - use the toolbox:

#define LongSqrt(x) ((short)(FracSqrt((Fract)(x)) >> 15))

FractSqrt is in FixMath.h. This works because given two fixed point number 
of the form x.y, where x is the number of bits before the decimal point 
and y is the number of bits after the decimal point, twos complement 
multiplication will yield a result of: x1.y1 * x2.y2 = x1+x2.y1+y2. If the 
numbers are the same format, this becomes x.y * x.y = 2x.2y. This holds 
for squaring a number also: x.y^2 = 2x.2y. The sqrt is then sqrt(x.y) = 
x/2.y/2. FracSqrt is of the form FracSqrt(2.30) = 2.30. Since we know how 
sqrt works this can also be expressed as FracSqrt(x.y) = x/2.y/2 << 15 
(This is a "virtual" shift since the calculation is compleated with more 
precision the 15 bits shifted in are "real" and not 0). So if you want to 
use FracSqrt for a long you get FracSqrt(32.0) = 16.0 << 15. So you shift 
the result down be 15 to get a short (you can add 1 << 14 prior to 
shifting down by 15 to round instead of truncing your result). If you 
wanted a long sqrt with a 16.16 (Fixed) result you would use:

#define LongFixedSqrt(x) ((Fixed)(FracSqrt((Fract)(x)) << 1))

If you want a Fixed (16.16) sqrt with a Fixed result (16.16) you would use:

#define FixedSqrt(x) ((Fixed)FracSqrt((Fract)(x)) >> 7))

You can do this kind of work with all of the fixed point routines and 
standard operators.

Sean Parent


-- 
Francois Pottier                                            pottier@dmi.ens.fr
------------------------------------------------------------------------------
This area dedicated to the preservation of endangered species.         "Moof!"
*****
-- 
Peter N Lewis <peter.lewis@info.curtin.edu.au> - Macintosh TCP fingerpainter
FTP my programs from redback.cs.uwa.edu.au:Others/PeterLewis/ or
amug.org:pub/peterlewis/ or nic.switch.ch:software/mac/peterlewis/