Re: 64-bit subtract from vector unsigned int

To: PowerPC List Debian <debian-powerpc@lists.debian.org>
Subject: Re: 64-bit subtract from vector unsigned int
From: Jeffrey Walton <noloader@gmail.com>
Date: Mon, 11 May 2020 13:59:36 -0400
Message-id: <[🔎] CAH8yC8mDuBt7DoEguHdKm3t0Z1JDhFSDKHVBHjOwHRQ2+VAoJg@mail.gmail.com>
Reply-to: noloader@gmail.com
In-reply-to: <CAH8yC8mSgiU_F0LCo5L471uomOHq7SE6EvN8Au_+MChVjGNbrA@mail.gmail.com>
References: <CAH8yC8kQKnVhAp-1DSnv4=7g_9iakr+YpBTC+ZYuSrmtvvRgvg@mail.gmail.com> <CAH8yC8mSgiU_F0LCo5L471uomOHq7SE6EvN8Au_+MChVjGNbrA@mail.gmail.com>

On Tue, Apr 7, 2020 at 7:51 AM Jeffrey Walton <noloader@gmail.com> wrote:
>
> On Tue, Apr 7, 2020 at 5:51 AM Jeffrey Walton <noloader@gmail.com> wrote:
> >
> > Hi Everyone,
> >
> > I'm porting a 64-bit algorithm to 32-bit PowerPC (an old PowerMac).
> > The algorithm is simple when 64-bit is available, but it gets a little
> > ugly under 32-bit.
> > ...
> >
> > Here's what an "add with carry" looks like. The addc simply adds the
> > carry into the result after transposing the carry bits from columns 1
> > and 3 to columns 0 and 2.
> >
> > typedef __vector unsigned char uint8x16_p;
> > typedef __vector unsigned int uint32x4_p;
> > ...
> >
> > inline uint32x4_p VecAdd64(const uint32x4_p& vec1, const uint32x4_p& vec2)
> > {
> >     // 64-bit elements available at POWER7 with VSX, but addudm requires POWER8
> > #if defined(_ARCH_PWR8)
> >     return (uint32x4_p)vec_add((uint64x2_p)vec1, (uint64x2_p)vec2);
> > #else
> >     const uint8x16_p cmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
> >     const uint32x4_p zero = {0, 0, 0, 0};
> >
> >     uint32x4_p cy = vec_addc(vec1, vec2);
> >     cy = vec_perm(cy, zero, cmask);
> >     return vec_add(vec_add(vec1, vec2), cy);
> > #endif
> > }
>
> I think I found it... The compliment of the carry was throwing me off.
> Subtract with borrow needs an extra vec_andc to un-compliment the
> borrow:
>
>     const uint8x16_p bmask = {4,5,6,7, 16,16,16,16, 12,13,14,15, 16,16,16,16};
>     const uint32x4_p amask = {1, 1, 1, 1};
>     const uint32x4_p zero = {0, 0, 0, 0};
>
>     uint32x4_p bw = vec_subc(vec1, vec2);
>     bw = vec_andc(amask, bw);
>     bw = vec_perm(bw, zero, bmask);
>    return vec_sub(vec_sub(vec1, vec2), bw);

Sorry to dig up an old thread... I've been working with Steven Munroe,
who is a retired IBM engineer and maintainer of pveclib
(https://github.com/munroesj52/pveclib). Munroe recommended avoid the
load and permute, and use a shift instead.

Here is an updated VecSub64 routine.

typedef __vector unsigned int uint32x4_p ;
...

#if defined(__BIG_ENDIAN__)
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {0, 1, 0, 1};
#else
    const uint32x4_p zero = {0, 0, 0, 0};
    const uint32x4_p mask = {1, 0, 1, 0};
#endif

    uint32x4_p bw = vec_subc(vec1, vec2);
    uint32x4_p res = vec_sub(vec1, vec2);
    bw = vec_andc(mask, bw);
    bw = vec_sld (bw, zero, 4);
    return vec_sub(res, bw);

Jeff

Reply to:

Prev by Date: Migrate from yaboot to grub2
Next by Date: Re: [RFC] Remove AGP support from Radeon/Nouveau/TTM
Previous by thread: Migrate from yaboot to grub2
Next by thread: Re: [RFC] Remove AGP support from Radeon/Nouveau/TTM
Index(es):
- Date
- Thread