题目

在这里插入图片描述
具体描述是:

参考算子:
torch.lcm

算子输入取值范围:
N∈[1,200]
N2∈[1,4000]
N3∈[1, 3000]

算子输入特征说明:
输入x2应与x1 Shape一致,如果不一致,则对应的维度为1,可以广播成与x1 Shape一致,需要考虑广播实现。
N~N3均可能为非32的整倍数,需要考虑非对齐场景。
int64类型数据取值会超出int32取值范围,注意不要使用Cast强转int64到int32

参考实现是torch.lcm

分析

运算本身

首先基本计算是 l c m ( x , y ) = x ∗ y / h c d ( x , y ) lcm(x,y)=x*y/hcd(x,y) lcm(x,y)=xy/hcd(x,y)

乘法部分我们可以调用向量接口。

关键就是计算 g c d gcd gcd。这里只能用类辗转相除的做法来。这样实现的话,对于每对 x , y x,y x,y,辗转相除的过程,进行的操作都是不同的。很难用向量接口,只能逐个计算。

广播

实际上还有一个隐藏难点,广播。就是可能出现两个张量大小不同的情况,但此时不一定无法计算。如果满足广播的条件,也就是x在一个维度可以是任意值,y要么这个维度大小是1,要么没有这个维度,就可以广播。

广播有两种实现,一种是我们全都交给一个核,然后这个核心获得全部数据,也就是不进行tiling,然后枚举结果张量每个维度的下标,相当于遍历整个结果张量,然后用广播规则去计算这个结果位置,分别是由x,y张量的哪两个位置运算得到的。

这样优点是简洁,缺点是慢,毕竟没有利用多核。

另一种实现是,对结果张量tiling,每个核心分配到结果张量的一段区间,然后我们枚举这个区间的下标,仍然去计算每个位置,是由xy的哪两个位置计算出来的。

(理论上还有一种,是对于xy进行tiling,也就是计算出每个核心,需要的x,y的区间,然后分配给核心,但这太难算)

溢出

x ∗ y x*y xy这里可能会溢出,然后这个处理器的溢出规则,和c++,torch的规则不太一样,需要手动去尝试,找到规则,然后特判。具体来说,int32和in64都需要先让他们自然溢出,然后取绝对值。

优化

求gcd,辗转相除里的取模运算太多了,取模实际上是一个很慢的东西,需要的时钟周期比乘法加法多一个数量级。考虑减少取模运算,那么有没有不用取模的求gcd算法呢,还真有。

就是二进制gcd/stein算法,可以全程只用if+位运算。具体可以看
这篇博客

大概实现就是这样

inline int gcd(int a, int b)
{
	register int az=__builtin_ctz(a),bz=__builtin_ctz(b),z=az>bz?bz:az,diff;
	b>>=bz;
	while(a)
	{
		a>>=az;
		diff=b-a;
		az=__builtin_ctz(diff);
		if(a<b)b=a;
		a=diff<0?-diff:diff;
	}
	return b<<z;
}

这里的__builtin都是优化过的库函数,不是 O ( l o g ) O(log) O(log)而是 O ( 1 ) O(1) O(1)

然后由于ascendC是继承了很多C标准库,所以这个也能用。

另外,对于多次查询,求gcd这种问题,一个经典的优化是,打表。可以把一些比较小的 ( x , y ) (x,y) (x,y)的结果预处理出来。

网上找的开源实现更进一步,他没有用 g c d [ x ] [ y ] gcd[x][y] gcd[x][y]保存预处理的 g c d ( x , y ) gcd(x,y) gcd(x,y),因为这样的话数据量还是有点大了,他是用的位运算,表的数据量只有 64 个 u i n t 64 64个uint64 64uint64,实际上是把小于 64 64 64的每个数字用一个二进制位来表示,这样仍然存了 64 ∗ 64 64*64 6464的信息,查表时用位运算即可

有tiling的实现

op_kernel/lcm.cpp

#include "kernel_operator.h"
using namespace AscendC;
constexpr uint64_t pre[65] = {0, 1ull,3ull,5ull,11ull,17ull,39ull,65ull,139ull,261ull,531ull,1025ull,2095ull,4097ull,8259ull,16405ull,32907ull,65537ull,131367ull,262145ull,524827ull,1048645ull,2098179ull,4194305ull,8390831ull,16777233ull,33558531ull,67109125ull,134225995ull,268435457ull,536887863ull,1073741825ull,2147516555ull,4294968325ull,8590000131ull,17179869265ull,34359871791ull,68719476737ull,137439215619ull,274877911045ull,549756338843ull,1099511627777ull,2199024312423ull,4398046511105ull,8796095120395ull,17592186061077ull,35184376283139ull,70368744177665ull,140737496778927ull,281474976710721ull,562949970199059ull,1125899906908165ull,2251799847243787ull,4503599627370497ull,9007199321981223ull,18014398509483025ull,36028797153190091ull,72057594038190085ull,144115188344291331ull,288230376151711745ull,576460752840837695ull,1152921504606846977ull,2305843010287435779ull,4611686018428436805ull,9223372039002292363ull};


template<typename T> class BruteForce {
public:
    __aicore__ inline BruteForce() {}
    __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t n1[3], uint32_t n2[3], uint32_t ny[3], uint32_t size, uint32_t length,uint32_t abs_flag,uint32_t dim_num) {
        //ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
        uint32_t total1 = 1, total2 = 1, totaly = 1;
        for (int i = 0; i < dim_num; ++i) {
            this->n1[i] = n1[i];
            this->n2[i] = n2[i];
            this->ny[i] = ny[i];
            total1 *= n1[i];
            total2 *= n2[i];
            totaly *= ny[i];
        }

        x1Gm.SetGlobalBuffer((__gm__ T*)x1, total1);
        x2Gm.SetGlobalBuffer((__gm__ T*)x2, total2);
        
        const unsigned num_cores = GetBlockNum();
        unsigned L = GetBlockIdx() * length;
        unsigned R = (GetBlockIdx() + 1) * length;
        if (R > size) {
            R = size;
        }
        this->L = L;
        this->R = R ;
        yGm.SetGlobalBuffer((__gm__ T*)y, totaly);
        
        this->abs_flag=abs_flag;
        this->dim_num=dim_num;
    }
    __aicore__ inline void Process() {
        if (L >= R) return;  // 空区间直接返回

        // --------------------------
        // 1. 初始化:仅计算一次初始索引(含除法/取模,仅执行1次)
        // --------------------------
        uint32_t indices[3];  // 5维索引
        uint32_t temp = L;
        // 初始计算L对应的5维索引(仅此处用除法/取模)
        for (int j = dim_num-1; j >= 0; --j) {
            indices[j] = temp % ny[j];
            temp = temp / ny[j];
        }

        // 初始计算idx1和idx2(x1和x2的索引)
        uint32_t idx1 = 0, idx2 = 0;
        for (int j = 0; j < dim_num; ++j) {
            idx1 = idx1 * n1[j] + (indices[j] % n1[j]);
            idx2 = idx2 * n2[j] + (indices[j] % n2[j]);
        }

        // --------------------------
        // 2. 处理第一个元素(i=L)
        // --------------------------
        int64_t a = x1Gm.GetValue(idx1);
        int64_t b = x2Gm.GetValue(idx2);
        if (a == 0 || b == 0) {
            yGm.SetValue(L, 0);
        } else {
            a = (a > 0 ? a : -a);
            b = (b > 0 ? b : -b);
            int64_t aa = a, bb = b;
            int64_t shift = ScalarGetSFFValue<1>(a | b);
            a >>= ScalarGetSFFValue<1>(a);
            do {
                b >>= ScalarGetSFFValue<1>(b);
                if (a <= 64 && b <= 64) {
                    a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                    break;
                }
                if (a > b) {
                    a ^= b ^= a ^= b;
                }
                b -= a;
            } while (b);
            a = (a << shift);
            T ans = aa / a * bb;
            if (ans < 0 && abs_flag) {
                ans = -ans;
            }
            yGm.SetValue(L, ans);
        }

        // --------------------------
        // 3. 处理剩余元素(i从L+1到R-1):纯加法/乘法增量更新
        // --------------------------
        for (int i = L + 1; i < R; ++i) {
            // --------------------------
            // 3.1 增量更新indices(用加法+进位,无除法/取模)
            // --------------------------
            int carry = 1;  // 从最后一维开始+1(模拟i递增1)
            for (int j = dim_num-1; j >= 0 && carry; --j) {
                indices[j] += carry;  // 当前维度+1
                if (indices[j] >= ny[j]) {
                    // 溢出:归零并向前进位
                    indices[j] = 0;
                    carry = 1;
                } else {
                    // 未溢出:停止进位
                    carry = 0;
                }
            }


            uint32_t idx1 = 0, idx2 = 0;

            for (int j = 0; j < dim_num; ++j) {
                int ind=indices[j];
                idx1 = idx1 * n1[j] + (ind>=n1[j]?ind%n1[j]:ind);  // 广播映射x1
                idx2 = idx2 * n2[j] + (ind>=n2[j]?ind%n2[j]:ind);  // 广播映射x2
            }

            // --------------------------
            // 3.3 计算并写入当前元素(逻辑同前)
            // --------------------------
            a = x1Gm.GetValue(idx1);
            b = x2Gm.GetValue(idx2);
            if (a == 0 || b == 0) {
                yGm.SetValue(i, 0);
            } else {
                a = (a > 0 ? a : -a);
                b = (b > 0 ? b : -b);
                int64_t aa = a, bb = b;
                int64_t shift = ScalarGetSFFValue<1>(a | b);
                a >>= ScalarGetSFFValue<1>(a);
                do {
                    b >>= ScalarGetSFFValue<1>(b);
                    if (a <= 64 && b <= 64) {
                        a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                        break;
                    }
                    if (a > b) {
                        a ^= b ^= a ^= b;
                    }
                    b -= a;
                } while (b);
                a = (a << shift);
                T ans = aa / a * bb;
                if (ans < 0 && abs_flag) {
                    ans = -ans;
                }
                yGm.SetValue(i, ans);
            }
        }
    }

private:
    GlobalTensor<T> x1Gm, x2Gm, yGm;
    uint32_t n1[3], n2[3], ny[3];
    uint32_t L, R;
    uint32_t abs_flag;
    uint32_t dim_num;
};
template<typename T> class LCMKernalFast {
    public:
        __aicore__ inline LCMKernalFast() {}
        __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t size, uint32_t length,uint32_t abs_flag) {
            //ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
            const unsigned num_cores = GetBlockNum();
            unsigned L = GetBlockIdx() * length;
            unsigned R = (GetBlockIdx() + 1) * length;
            if (R > size) {
                R = size;
            }
            this->L = 0;
            this->R = R - L;
            x1Gm.SetGlobalBuffer((__gm__ T*)x1 + L, length);
            x2Gm.SetGlobalBuffer((__gm__ T*)x2 + L, length);
            yGm.SetGlobalBuffer((__gm__ T*)y + L, length);
            
            this->abs_flag=abs_flag;
        }
        __aicore__ inline void Process() {
            for (int i = L; i < R; ++i) {
                int64_t a = x1Gm.GetValue(i);
                int64_t b = x2Gm.GetValue(i);
                if (a == 0||b==0){
                    yGm.SetValue(i, 0);
                }
                else {
                    a = (a > 0 ? a : -a);
                    b = (b > 0 ? b : -b);
                    int64_t aa=a,bb=b;
                    int64_t shift = ScalarGetSFFValue<1>(a | b);
                    a >>= ScalarGetSFFValue<1>(a);
                    do {
                        b >>= ScalarGetSFFValue<1>(b);
                        if(a <= 64 && b <= 64){
                            a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                            break;
                        }
                        if (a > b) {
                            a ^= b ^= a ^= b;
                        }
                        b -= a;
                    } while (b);
                    a =(a<<shift);
                    T ans=aa/a*bb;
                    if (ans<0 && abs_flag) {
                        ans=-ans;
                    }
                    yGm.SetValue(i, ans);
                }
            }
        }
    
    private:
        GlobalTensor<T> x1Gm, x2Gm, yGm;
        uint32_t L, R;
        uint32_t abs_flag; 
};
extern "C" __global__ __aicore__ void lcm(GM_ADDR input, GM_ADDR other, GM_ADDR out, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);

    if (tiling_data.status == 0) {
        BruteForce<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny,tiling_data.size,tiling_data.length,tiling_data.abs_flag,tiling_data.dim_num);
        op.Process();
    }
    else {
        LCMKernalFast<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.size, tiling_data.length,tiling_data.abs_flag);
        op.Process();
    }
}

op_host/lcm.cpp


#include "lcm_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/platform/platform_ascendc.h"
#include <iostream>
#include <algorithm>


namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{

    LcmTilingData tiling;
    auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    auto num_cores = ascendcPlatform.GetCoreNum();
    uint32_t sizeofdatatype,tot=16;
    auto dt = context->GetInputTensor(0)->GetDataType();
    if (dt == ge::DT_INT16) {
        sizeofdatatype = 2;
    }
    else if (dt == ge::DT_INT8) {
        sizeofdatatype = 1;
        tot=64;
    }
    else if (dt == ge::DT_INT32) {
        sizeofdatatype = 4;
        tiling.set_abs_flag(1);
    }
    else {
        sizeofdatatype = 8;
        tiling.set_abs_flag(1);
    }
    const uint32_t alignment = tot / sizeofdatatype;

    const gert::StorageShape* x1_shape = context->GetInputShape(0);
    auto dim1 = x1_shape->GetStorageShape().GetDimNum();
    uint32_t n1[3] = {1, 1, 1};
    for (int i = 0; i < dim1; ++i) {
        n1[i] = x1_shape->GetStorageShape().GetDim(i);
    }
    const gert::StorageShape* x2_shape = context->GetInputShape(1);
    auto dim2 = x2_shape->GetStorageShape().GetDimNum();
    uint32_t n2[3] = {1, 1, 1};
    for (int i = 0; i < dim2; ++i) {
        n2[i + (dim1 - dim2)] = x2_shape->GetStorageShape().GetDim(i);
    }
    int dim = std::max(dim1, dim2);
    tiling.set_dim_num(dim);
    uint32_t ny[3] = {1, 1, 1};
    uint32_t size = 1;
    for (int i = 0; i < dim; ++i) {
        ny[i] = std::max(n1[i], n2[i]);
        size *= ny[i];
    }
    tiling.set_n1(n1);
    tiling.set_n2(n2);
    tiling.set_ny(ny);
    //std::cout << "n1: " << n1[0] << " " << n1[1] << " " << n1[2] << " " << n1[3] << " " << n1[4] << " " << std::endl;
    //std::cout << "n2: " << n2[0] << " " << n2[1] << " " << n2[2] << " " << n2[3] << " " << n2[4] << " " << std::endl;
    //std::cout << "ny: " << ny[0] << " " << ny[1] << " " << ny[2] << " " << ny[3] << " " << ny[4] << " " << std::endl;
    int status = 2;
    for (int i = 0; i < 3; ++i) {
        if (n1[i] != n2[i]) {
            status = 0;
        }
    }
    if(size <= num_cores * alignment) status = 0;
    tiling.set_status(status);
    tiling.set_size(size);
    unsigned length = (size - 1) / num_cores + 1;
    while (length % alignment != 0) length += 1;
    tiling.set_length(length);
    if (status == 0) {
        //std::cout<<"Broadcast"<<std::endl;
        context->SetBlockDim(num_cores);
    }
    else {
        //std::cout << "Multicore" << std::endl;
        context->SetBlockDim(num_cores);
    }
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());

    return ge::GRAPH_SUCCESS;
}
}


namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context) {
    const gert::Shape* x1_shape = context->GetInputShape(0);
    const gert::Shape* x2_shape = context->GetInputShape(1);
    gert::Shape* y_shape = context->GetOutputShape(0);

    // 1. 计算广播后的维度(与TilingFunc中ny的逻辑一致)
    int dim1 = x1_shape->GetDimNum();
    int dim2 = x2_shape->GetDimNum();
    int max_dim = std::max(dim1, dim2);
    std::vector<int64_t> broadcast_dims;  // 存储广播后的维度

    for (int i = 0; i < max_dim; ++i) {
        int64_t d1 = (i >= max_dim - dim1) ? x1_shape->GetDim(i - (max_dim - dim1)) : 1;
        int64_t d2 = (i >= max_dim - dim2) ? x2_shape->GetDim(i - (max_dim - dim2)) : 1;
        if (d1 != d2 && d1 != 1 && d2 != 1) {
            std::cerr << "Broadcast failed: incompatible dims " << d1 << " vs " << d2 << std::endl;
            return ge::GRAPH_FAILED;
        }
        broadcast_dims.push_back(std::max(d1, d2));
    }

    // 2. 关键修改:通过默认构造+AddDim添加维度(避开initializer_list)
    gert::Shape new_shape;  // 默认构造空Shape
    new_shape.SetDimNum(broadcast_dims.size());
    for (int i=0;i<broadcast_dims.size();i++) {
        new_shape.SetDim(i,broadcast_dims[i]);  // 逐个添加维度(假设Shape类有AddDim方法)
    }
    *y_shape = new_shape;  // 赋值给输出形状

    return ge::GRAPH_SUCCESS;
}
static ge::graphStatus InferDataType(gert::InferDataTypeContext *context)
{
const auto inputDataType = context->GetInputDataType(0);
context->SetOutputDataType(0, inputDataType);
return ge::GRAPH_SUCCESS;
}
}


namespace ops {
class Lcm : public OpDef {
public:
    explicit Lcm(const char* name) : OpDef(name)
    {
        this->Input("input")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Input("other")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Output("out")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});

        this->SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType);

        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascend910b");

    }
};

OP_ADD(Lcm);
}

op_host/lcm_tiling.h

#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(LcmTilingData)
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, n1);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, n2);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, ny);
  TILING_DATA_FIELD_DEF(uint32_t, status);
  TILING_DATA_FIELD_DEF(uint32_t, size);
  TILING_DATA_FIELD_DEF(uint32_t, length);
  TILING_DATA_FIELD_DEF(uint32_t, abs_flag);
  TILING_DATA_FIELD_DEF(uint32_t, dim_num);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(Lcm, LcmTilingData)
}

无tiling的实现

op_kernel/lcm.cpp

#include "kernel_operator.h"
using namespace AscendC;

constexpr uint64_t pre[65] = {0, 1ull,3ull,5ull,11ull,17ull,39ull,65ull,139ull,261ull,531ull,1025ull,2095ull,4097ull,8259ull,16405ull,32907ull,65537ull,131367ull,262145ull,524827ull,1048645ull,2098179ull,4194305ull,8390831ull,16777233ull,33558531ull,67109125ull,134225995ull,268435457ull,536887863ull,1073741825ull,2147516555ull,4294968325ull,8590000131ull,17179869265ull,34359871791ull,68719476737ull,137439215619ull,274877911045ull,549756338843ull,1099511627777ull,2199024312423ull,4398046511105ull,8796095120395ull,17592186061077ull,35184376283139ull,70368744177665ull,140737496778927ull,281474976710721ull,562949970199059ull,1125899906908165ull,2251799847243787ull,4503599627370497ull,9007199321981223ull,18014398509483025ull,36028797153190091ull,72057594038190085ull,144115188344291331ull,288230376151711745ull,576460752840837695ull,1152921504606846977ull,2305843010287435779ull,4611686018428436805ull,9223372039002292363ull};

template<typename T> class BruteForce {
public:
    __aicore__ inline BruteForce() {}
    __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t n1[5], uint32_t n2[5], uint32_t ny[5]) {
        ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
        uint32_t total1 = 1, total2 = 1, totaly = 1;
        for (int i = 0; i < 5; ++i) {
            this->n1[i] = n1[i];
            this->n2[i] = n2[i];
            this->ny[i] = ny[i];
            total1 *= n1[i];
            total2 *= n2[i];
            totaly *= ny[i];
        }

        x1Gm.SetGlobalBuffer((__gm__ T*)x1, total1);
        x2Gm.SetGlobalBuffer((__gm__ T*)x2, total2);
        yGm.SetGlobalBuffer((__gm__ T*)y, totaly);
    }
    __aicore__ inline void Process() {
        for (uint32_t i0 = 0; i0 < ny[0]; ++i0) {
            for (uint32_t i1 = 0; i1 < ny[1]; ++i1) {
                for (uint32_t i2 = 0; i2 < ny[2]; ++i2) {
                    for (uint32_t i3 = 0; i3 < ny[3]; ++i3) {
                        for (uint32_t i4 = 0; i4 < ny[4]; ++i4) {
                            uint32_t indices[5] = {i0, i1, i2, i3, i4};
                            uint32_t idx1 = 0, idx2 = 0, idxy = 0;
                            for (int j = 0; j < 5; ++j) {
                                idx1 = idx1 * n1[j] + indices[j] % n1[j];
                                idx2 = idx2 * n2[j] + indices[j] % n2[j];
                                idxy = idxy * ny[j] + indices[j];
                            }
                            int64_t a = x1Gm.GetValue(idx1);
                            int64_t b = x2Gm.GetValue(idx2);
                            if (a < 0) {
                                a = -a;
                            }
                            if (b < 0) {
                                b = -b;
                            }
                            int64_t aa=a,bb=b;
                            while (b) {
                                int64_t A = b;
                                int64_t B = a % b;
                                a = A;
                                b = B;
                            }
                            T ans=aa/a*bb;
                            if constexpr (std::is_same<T, int64_t>::value) {
                                if(ans<0)ans=-ans;
                            }
                            if constexpr (std::is_same<T, int32_t>::value) {
                                if(ans<0)ans=-ans;
                            }
                            yGm.SetValue(idxy, ans);
                        }
                    }
                }
            }
        }
    }

private:
    GlobalTensor<T> x1Gm, x2Gm, yGm;
    uint32_t n1[5], n2[5], ny[5];
};
template<typename T> class GCDKernalFast {
    public:
        __aicore__ inline GCDKernalFast() {}
        __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t size, uint32_t length) {
            ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
            const unsigned num_cores = GetBlockNum();
            unsigned L = GetBlockIdx() * length;
            unsigned R = (GetBlockIdx() + 1) * length;
            if (R > size) {
                R = size;
            }
            this->L = 0;
            this->R = R - L;
            x1Gm.SetGlobalBuffer((__gm__ T*)x1 + L, length);
            x2Gm.SetGlobalBuffer((__gm__ T*)x2 + L, length);
            yGm.SetGlobalBuffer((__gm__ T*)y + L, length);
        }
        __aicore__ inline void Process() {
            for (int i = L; i < R; ++i) {
                int64_t a = x1Gm.GetValue(i);
                int64_t b = x2Gm.GetValue(i);
                a = (a > 0 ? a : -a);
                b = (b > 0 ? b : -b);
                int64_t aa=a,bb=b;
                if (a == 0){
                    yGm.SetValue(i, b);
                }
                else if (b == 0) {
                    yGm.SetValue(i, a);
                }
                else {
                    int64_t shift = ScalarGetSFFValue<1>(a | b);
                    a >>= ScalarGetSFFValue<1>(a);
                    do {
                        b >>= ScalarGetSFFValue<1>(b);
                        if(a <= 64 && b <= 64){
                            a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                            break;
                        }
                        if (a > b) {
                            a ^= b ^= a ^= b;
                        }
                        b -= a;
                    } while (b);
                    a =(a<<shift);
                    T ans=aa/a*bb;
                    if constexpr (std::is_same<T, int64_t>::value) {
                        if(ans<0)ans=-ans;
                    }
                    if constexpr (std::is_same<T, int32_t>::value) {
                        if(ans<0)ans=-ans;
                    }
                    yGm.SetValue(i, ans);
                }
            }
        }
    
    private:
        GlobalTensor<T> x1Gm, x2Gm, yGm;
        uint32_t L, R;
};
extern "C" __global__ __aicore__ void lcm(GM_ADDR input, GM_ADDR other, GM_ADDR out, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);

    if (tiling_data.status == 0) {
        BruteForce<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny);
        op.Process();
    }
    else {
        GCDKernalFast<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.size, tiling_data.length);
        op.Process();
    }
        // BruteForce<DTYPE_INPUT> op;
        // op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny);
        // op.Process();
}

op_host/lcm.cpp


#include "lcm_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/platform/platform_ascendc.h"
#include <iostream>
#include <algorithm>


namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{

  LcmTilingData tiling;
    auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    auto num_cores = ascendcPlatform.GetCoreNum();
    uint32_t sizeofdatatype = 2;
    auto dt = context->GetInputTensor(0)->GetDataType();
    if (dt == ge::DT_INT16) {
        sizeofdatatype = 2;
    }
    else if (dt == ge::DT_INT32) {
        sizeofdatatype = 4;
    }
    else {
        sizeofdatatype = 8;
    }
    const uint32_t alignment = 64 / sizeofdatatype;

    const gert::StorageShape* x1_shape = context->GetInputShape(0);
    auto dim1 = x1_shape->GetStorageShape().GetDimNum();
    uint32_t n1[5] = {1, 1, 1, 1, 1};
    for (int i = 0; i < dim1; ++i) {
        n1[i] = x1_shape->GetStorageShape().GetDim(i);
    }
    const gert::StorageShape* x2_shape = context->GetInputShape(1);
    auto dim2 = x2_shape->GetStorageShape().GetDimNum();
    uint32_t n2[5] = {1, 1, 1, 1, 1};
    for (int i = 0; i < dim2; ++i) {
        n2[i + (dim1 - dim2)] = x2_shape->GetStorageShape().GetDim(i);
    }
    int dim = std::max(dim1, dim2);
    uint32_t ny[5] = {1, 1, 1, 1, 1};
    uint32_t size = 1;
    for (int i = 0; i < dim; ++i) {
        ny[i] = std::max(n1[i], n2[i]);
        size *= ny[i];
    }
    tiling.set_n1(n1);
    tiling.set_n2(n2);
    tiling.set_ny(ny);
    std::cout << "n1: " << n1[0] << " " << n1[1] << " " << n1[2] << " " << n1[3] << " " << n1[4] << " " << std::endl;
    std::cout << "n2: " << n2[0] << " " << n2[1] << " " << n2[2] << " " << n2[3] << " " << n2[4] << " " << std::endl;
    std::cout << "ny: " << ny[0] << " " << ny[1] << " " << ny[2] << " " << ny[3] << " " << ny[4] << " " << std::endl;
    int status = 2;
    for (int i = 0; i < 5; ++i) {
        if (n1[i] != n2[i]) {
            status = 0;
        }
    }
    if(size <= num_cores * alignment) status = 0;
    tiling.set_status(status);
    tiling.set_size(size);
    unsigned length = (size - 1) / num_cores + 1;
    while (length % alignment != 0) length += 1;
    tiling.set_length(length);
    if (status == 0) {
        context->SetBlockDim(1);
    }
    else {
        std::cout << "Multicore" << std::endl;
        context->SetBlockDim(num_cores);
    }
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());

    return ge::GRAPH_SUCCESS;
}
}


namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context)
{
    const gert::Shape* x1_shape = context->GetInputShape(0);
    gert::Shape* y_shape = context->GetOutputShape(0);
    *y_shape = *x1_shape;
    return GRAPH_SUCCESS;
}
static ge::graphStatus InferDataType(gert::InferDataTypeContext *context)
{
const auto inputDataType = context->GetInputDataType(0);
context->SetOutputDataType(0, inputDataType);
return ge::GRAPH_SUCCESS;
}
}


namespace ops {
class Lcm : public OpDef {
public:
    explicit Lcm(const char* name) : OpDef(name)
    {
        this->Input("input")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Input("other")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Output("out")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});

        this->SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType);

        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascend910b");

    }
};

OP_ADD(Lcm);
}

op_host/lcm_tiling.h

#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(LcmTilingData)
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, n1);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, n2);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, ny);
  TILING_DATA_FIELD_DEF(uint32_t, status);
  TILING_DATA_FIELD_DEF(uint32_t, size);
  TILING_DATA_FIELD_DEF(uint32_t, length);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(Lcm, LcmTilingData)
}

环境配置

运行脚本安装的ascendC不是在用户目录下的,所以重启就没了。需要每次重启之后都重新运行脚本安装

创建算子工程

可以用msoopgen一件创建,具体见文档

大概用法是

msopgen gen -i {*.json} -f {framework type} -c {Compute Resource} -lan cpp -out {Output Path}

其中.json文件是算子工程的配置文件,需要声明算子的名称,参数的名称,类型

[
    {
        "op": "BitwiseLeftShift",
        "language": "cpp",
        "input_desc": [
            {
                "name": "input",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            },
            {
                "name": "other",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            }
        ],
        "output_desc": [
            {
                "name": "out",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            }
        ]
    }
]

-c格式为ai_core-{soc version},如果用的是910b,就是ai_core-910B

-o可以没有,没有的话就默认在当前目录创建工程

编译

算子工程目录下有个build.sh,直接运行即可。成功后会得到一个可执行文件build_out/custom_opp_euleros_aarch64.run
运行后会把当前算子安装到ascendC算子库里。

这里注意,这种方式同时只能安装一个算子,因为安装时的名称都是一样的,后面的会覆盖前面的。

测试

比赛给的压缩包里,首先运行init_pybind.sh,安装环境,这个只用安装一次就行了。

后面进入对应赛题的文件夹,例如Lcm赛题就是case_910b/Lcm,里面有一个run.sh脚本,执行bash run.sh x,即可运行test_op.py里定义的第x个测试用例。如果运行时检测不到算子的.wheel文件,会先编译得到并安装。

如果修改了算子实现,想重新测试,需要删掉distbuild两个文件夹,否则不会重新安装.wheel,也就无法测试最新实现

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐