昇腾AI创新大赛-算子挑战赛（S5赛季）【挑战性能命题】Lcm赛题

具体描述是：参考算子：torch.lcm算子输入取值范围：N∈[1,200]算子输入特征说明:输入x2应与x1 Shape一致，如果不一致，则对应的维度为1，可以广播成与x1 Shape一致，需要考虑广播实现。N~N3均可能为非32的整倍数，需要考虑非对齐场景。int64类型数据取值会超出int32取值范围，注意不要使用Cast强转int64到int32参考实现是。

Maxwell_Newton

312人浏览 · 2025-08-24 23:14:41

Maxwell_Newton · 2025-08-24 23:14:41 发布

题目

在这里插入图片描述
具体描述是：

参考算子：
torch.lcm

算子输入取值范围：
N∈[1,200]
N2∈[1,4000]
N3∈[1, 3000]

算子输入特征说明:
输入x2应与x1 Shape一致，如果不一致，则对应的维度为1，可以广播成与x1 Shape一致，需要考虑广播实现。
N~N3均可能为非32的整倍数，需要考虑非对齐场景。
int64类型数据取值会超出int32取值范围，注意不要使用Cast强转int64到int32

参考实现是torch.lcm

分析

运算本身

首先基本计算是 $l c m (x, y) = x * y / h c d (x, y)$

乘法部分我们可以调用向量接口。

关键就是计算 $g c d$ 。这里只能用类辗转相除的做法来。这样实现的话，对于每对 $x, y$ ，辗转相除的过程，进行的操作都是不同的。很难用向量接口，只能逐个计算。

广播

实际上还有一个隐藏难点，广播。就是可能出现两个张量大小不同的情况，但此时不一定无法计算。如果满足广播的条件，也就是x在一个维度可以是任意值，y要么这个维度大小是1，要么没有这个维度，就可以广播。

广播有两种实现，一种是我们全都交给一个核，然后这个核心获得全部数据，也就是不进行tiling，然后枚举结果张量每个维度的下标，相当于遍历整个结果张量，然后用广播规则去计算这个结果位置，分别是由x,y张量的哪两个位置运算得到的。

这样优点是简洁，缺点是慢，毕竟没有利用多核。

另一种实现是，对结果张量tiling，每个核心分配到结果张量的一段区间，然后我们枚举这个区间的下标，仍然去计算每个位置，是由xy的哪两个位置计算出来的。

（理论上还有一种，是对于xy进行tiling，也就是计算出每个核心，需要的x,y的区间，然后分配给核心，但这太难算）

溢出

$x * y$ 这里可能会溢出，然后这个处理器的溢出规则，和c++,torch的规则不太一样，需要手动去尝试，找到规则，然后特判。具体来说，int32和in64都需要先让他们自然溢出，然后取绝对值。

优化

求gcd，辗转相除里的取模运算太多了，取模实际上是一个很慢的东西，需要的时钟周期比乘法加法多一个数量级。考虑减少取模运算，那么有没有不用取模的求gcd算法呢，还真有。

就是二进制gcd/stein算法，可以全程只用if+位运算。具体可以看
这篇博客

大概实现就是这样

inline int gcd(int a, int b)
{
	register int az=__builtin_ctz(a),bz=__builtin_ctz(b),z=az>bz?bz:az,diff;
	b>>=bz;
	while(a)
	{
		a>>=az;
		diff=b-a;
		az=__builtin_ctz(diff);
		if(a<b)b=a;
		a=diff<0?-diff:diff;
	}
	return b<<z;
}

这里的__builtin都是优化过的库函数，不是 $O (l o g)$ 而是 $O (1)$ 的

然后由于ascendC是继承了很多C标准库，所以这个也能用。

另外，对于多次查询，求gcd这种问题，一个经典的优化是，打表。可以把一些比较小的 $(x, y)$ 的结果预处理出来。

网上找的开源实现更进一步，他没有用 $g c d [x] [y]$ 保存预处理的 $g c d (x, y)$ ，因为这样的话数据量还是有点大了，他是用的位运算，表的数据量只有 $64 个 u in t 64$ ，实际上是把小于 $64$ 的每个数字用一个二进制位来表示，这样仍然存了 $64 * 64$ 的信息，查表时用位运算即可

有tiling的实现

op_kernel/lcm.cpp

#include "kernel_operator.h"
using namespace AscendC;
constexpr uint64_t pre[65] = {0, 1ull,3ull,5ull,11ull,17ull,39ull,65ull,139ull,261ull,531ull,1025ull,2095ull,4097ull,8259ull,16405ull,32907ull,65537ull,131367ull,262145ull,524827ull,1048645ull,2098179ull,4194305ull,8390831ull,16777233ull,33558531ull,67109125ull,134225995ull,268435457ull,536887863ull,1073741825ull,2147516555ull,4294968325ull,8590000131ull,17179869265ull,34359871791ull,68719476737ull,137439215619ull,274877911045ull,549756338843ull,1099511627777ull,2199024312423ull,4398046511105ull,8796095120395ull,17592186061077ull,35184376283139ull,70368744177665ull,140737496778927ull,281474976710721ull,562949970199059ull,1125899906908165ull,2251799847243787ull,4503599627370497ull,9007199321981223ull,18014398509483025ull,36028797153190091ull,72057594038190085ull,144115188344291331ull,288230376151711745ull,576460752840837695ull,1152921504606846977ull,2305843010287435779ull,4611686018428436805ull,9223372039002292363ull};


template<typename T> class BruteForce {
public:
    __aicore__ inline BruteForce() {}
    __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t n1[3], uint32_t n2[3], uint32_t ny[3], uint32_t size, uint32_t length,uint32_t abs_flag,uint32_t dim_num) {
        //ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
        uint32_t total1 = 1, total2 = 1, totaly = 1;
        for (int i = 0; i < dim_num; ++i) {
            this->n1[i] = n1[i];
            this->n2[i] = n2[i];
            this->ny[i] = ny[i];
            total1 *= n1[i];
            total2 *= n2[i];
            totaly *= ny[i];
        }

        x1Gm.SetGlobalBuffer((__gm__ T*)x1, total1);
        x2Gm.SetGlobalBuffer((__gm__ T*)x2, total2);
        
        const unsigned num_cores = GetBlockNum();
        unsigned L = GetBlockIdx() * length;
        unsigned R = (GetBlockIdx() + 1) * length;
        if (R > size) {
            R = size;
        }
        this->L = L;
        this->R = R ;
        yGm.SetGlobalBuffer((__gm__ T*)y, totaly);
        
        this->abs_flag=abs_flag;
        this->dim_num=dim_num;
    }
    __aicore__ inline void Process() {
        if (L >= R) return;  // 空区间直接返回

        // --------------------------
        // 1. 初始化：仅计算一次初始索引（含除法/取模，仅执行1次）
        // --------------------------
        uint32_t indices[3];  // 5维索引
        uint32_t temp = L;
        // 初始计算L对应的5维索引（仅此处用除法/取模）
        for (int j = dim_num-1; j >= 0; --j) {
            indices[j] = temp % ny[j];
            temp = temp / ny[j];
        }

        // 初始计算idx1和idx2（x1和x2的索引）
        uint32_t idx1 = 0, idx2 = 0;
        for (int j = 0; j < dim_num; ++j) {
            idx1 = idx1 * n1[j] + (indices[j] % n1[j]);
            idx2 = idx2 * n2[j] + (indices[j] % n2[j]);
        }

        // --------------------------
        // 2. 处理第一个元素（i=L）
        // --------------------------
        int64_t a = x1Gm.GetValue(idx1);
        int64_t b = x2Gm.GetValue(idx2);
        if (a == 0 || b == 0) {
            yGm.SetValue(L, 0);
        } else {
            a = (a > 0 ? a : -a);
            b = (b > 0 ? b : -b);
            int64_t aa = a, bb = b;
            int64_t shift = ScalarGetSFFValue<1>(a | b);
            a >>= ScalarGetSFFValue<1>(a);
            do {
                b >>= ScalarGetSFFValue<1>(b);
                if (a <= 64 && b <= 64) {
                    a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                    break;
                }
                if (a > b) {
                    a ^= b ^= a ^= b;
                }
                b -= a;
            } while (b);
            a = (a << shift);
            T ans = aa / a * bb;
            if (ans < 0 && abs_flag) {
                ans = -ans;
            }
            yGm.SetValue(L, ans);
        }

        // --------------------------
        // 3. 处理剩余元素（i从L+1到R-1）：纯加法/乘法增量更新
        // --------------------------
        for (int i = L + 1; i < R; ++i) {
            // --------------------------
            // 3.1 增量更新indices（用加法+进位，无除法/取模）
            // --------------------------
            int carry = 1;  // 从最后一维开始+1（模拟i递增1）
            for (int j = dim_num-1; j >= 0 && carry; --j) {
                indices[j] += carry;  // 当前维度+1
                if (indices[j] >= ny[j]) {
                    // 溢出：归零并向前进位
                    indices[j] = 0;
                    carry = 1;
                } else {
                    // 未溢出：停止进位
                    carry = 0;
                }
            }


            uint32_t idx1 = 0, idx2 = 0;

            for (int j = 0; j < dim_num; ++j) {
                int ind=indices[j];
                idx1 = idx1 * n1[j] + (ind>=n1[j]?ind%n1[j]:ind);  // 广播映射x1
                idx2 = idx2 * n2[j] + (ind>=n2[j]?ind%n2[j]:ind);  // 广播映射x2
            }

            // --------------------------
            // 3.3 计算并写入当前元素（逻辑同前）
            // --------------------------
            a = x1Gm.GetValue(idx1);
            b = x2Gm.GetValue(idx2);
            if (a == 0 || b == 0) {
                yGm.SetValue(i, 0);
            } else {
                a = (a > 0 ? a : -a);
                b = (b > 0 ? b : -b);
                int64_t aa = a, bb = b;
                int64_t shift = ScalarGetSFFValue<1>(a | b);
                a >>= ScalarGetSFFValue<1>(a);
                do {
                    b >>= ScalarGetSFFValue<1>(b);
                    if (a <= 64 && b <= 64) {
                        a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                        break;
                    }
                    if (a > b) {
                        a ^= b ^= a ^= b;
                    }
                    b -= a;
                } while (b);
                a = (a << shift);
                T ans = aa / a * bb;
                if (ans < 0 && abs_flag) {
                    ans = -ans;
                }
                yGm.SetValue(i, ans);
            }
        }
    }

private:
    GlobalTensor<T> x1Gm, x2Gm, yGm;
    uint32_t n1[3], n2[3], ny[3];
    uint32_t L, R;
    uint32_t abs_flag;
    uint32_t dim_num;
};
template<typename T> class LCMKernalFast {
    public:
        __aicore__ inline LCMKernalFast() {}
        __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t size, uint32_t length,uint32_t abs_flag) {
            //ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
            const unsigned num_cores = GetBlockNum();
            unsigned L = GetBlockIdx() * length;
            unsigned R = (GetBlockIdx() + 1) * length;
            if (R > size) {
                R = size;
            }
            this->L = 0;
            this->R = R - L;
            x1Gm.SetGlobalBuffer((__gm__ T*)x1 + L, length);
            x2Gm.SetGlobalBuffer((__gm__ T*)x2 + L, length);
            yGm.SetGlobalBuffer((__gm__ T*)y + L, length);
            
            this->abs_flag=abs_flag;
        }
        __aicore__ inline void Process() {
            for (int i = L; i < R; ++i) {
                int64_t a = x1Gm.GetValue(i);
                int64_t b = x2Gm.GetValue(i);
                if (a == 0||b==0){
                    yGm.SetValue(i, 0);
                }
                else {
                    a = (a > 0 ? a : -a);
                    b = (b > 0 ? b : -b);
                    int64_t aa=a,bb=b;
                    int64_t shift = ScalarGetSFFValue<1>(a | b);
                    a >>= ScalarGetSFFValue<1>(a);
                    do {
                        b >>= ScalarGetSFFValue<1>(b);
                        if(a <= 64 && b <= 64){
                            a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                            break;
                        }
                        if (a > b) {
                            a ^= b ^= a ^= b;
                        }
                        b -= a;
                    } while (b);
                    a =(a<<shift);
                    T ans=aa/a*bb;
                    if (ans<0 && abs_flag) {
                        ans=-ans;
                    }
                    yGm.SetValue(i, ans);
                }
            }
        }
    
    private:
        GlobalTensor<T> x1Gm, x2Gm, yGm;
        uint32_t L, R;
        uint32_t abs_flag; 
};
extern "C" __global__ __aicore__ void lcm(GM_ADDR input, GM_ADDR other, GM_ADDR out, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);

    if (tiling_data.status == 0) {
        BruteForce<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny,tiling_data.size,tiling_data.length,tiling_data.abs_flag,tiling_data.dim_num);
        op.Process();
    }
    else {
        LCMKernalFast<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.size, tiling_data.length,tiling_data.abs_flag);
        op.Process();
    }
}

op_host/lcm.cpp


#include "lcm_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/platform/platform_ascendc.h"
#include <iostream>
#include <algorithm>


namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{

    LcmTilingData tiling;
    auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    auto num_cores = ascendcPlatform.GetCoreNum();
    uint32_t sizeofdatatype,tot=16;
    auto dt = context->GetInputTensor(0)->GetDataType();
    if (dt == ge::DT_INT16) {
        sizeofdatatype = 2;
    }
    else if (dt == ge::DT_INT8) {
        sizeofdatatype = 1;
        tot=64;
    }
    else if (dt == ge::DT_INT32) {
        sizeofdatatype = 4;
        tiling.set_abs_flag(1);
    }
    else {
        sizeofdatatype = 8;
        tiling.set_abs_flag(1);
    }
    const uint32_t alignment = tot / sizeofdatatype;

    const gert::StorageShape* x1_shape = context->GetInputShape(0);
    auto dim1 = x1_shape->GetStorageShape().GetDimNum();
    uint32_t n1[3] = {1, 1, 1};
    for (int i = 0; i < dim1; ++i) {
        n1[i] = x1_shape->GetStorageShape().GetDim(i);
    }
    const gert::StorageShape* x2_shape = context->GetInputShape(1);
    auto dim2 = x2_shape->GetStorageShape().GetDimNum();
    uint32_t n2[3] = {1, 1, 1};
    for (int i = 0; i < dim2; ++i) {
        n2[i + (dim1 - dim2)] = x2_shape->GetStorageShape().GetDim(i);
    }
    int dim = std::max(dim1, dim2);
    tiling.set_dim_num(dim);
    uint32_t ny[3] = {1, 1, 1};
    uint32_t size = 1;
    for (int i = 0; i < dim; ++i) {
        ny[i] = std::max(n1[i], n2[i]);
        size *= ny[i];
    }
    tiling.set_n1(n1);
    tiling.set_n2(n2);
    tiling.set_ny(ny);
    //std::cout << "n1: " << n1[0] << " " << n1[1] << " " << n1[2] << " " << n1[3] << " " << n1[4] << " " << std::endl;
    //std::cout << "n2: " << n2[0] << " " << n2[1] << " " << n2[2] << " " << n2[3] << " " << n2[4] << " " << std::endl;
    //std::cout << "ny: " << ny[0] << " " << ny[1] << " " << ny[2] << " " << ny[3] << " " << ny[4] << " " << std::endl;
    int status = 2;
    for (int i = 0; i < 3; ++i) {
        if (n1[i] != n2[i]) {
            status = 0;
        }
    }
    if(size <= num_cores * alignment) status = 0;
    tiling.set_status(status);
    tiling.set_size(size);
    unsigned length = (size - 1) / num_cores + 1;
    while (length % alignment != 0) length += 1;
    tiling.set_length(length);
    if (status == 0) {
        //std::cout<<"Broadcast"<<std::endl;
        context->SetBlockDim(num_cores);
    }
    else {
        //std::cout << "Multicore" << std::endl;
        context->SetBlockDim(num_cores);
    }
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());

    return ge::GRAPH_SUCCESS;
}
}


namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context) {
    const gert::Shape* x1_shape = context->GetInputShape(0);
    const gert::Shape* x2_shape = context->GetInputShape(1);
    gert::Shape* y_shape = context->GetOutputShape(0);

    // 1. 计算广播后的维度（与TilingFunc中ny的逻辑一致）
    int dim1 = x1_shape->GetDimNum();
    int dim2 = x2_shape->GetDimNum();
    int max_dim = std::max(dim1, dim2);
    std::vector<int64_t> broadcast_dims;  // 存储广播后的维度

    for (int i = 0; i < max_dim; ++i) {
        int64_t d1 = (i >= max_dim - dim1) ? x1_shape->GetDim(i - (max_dim - dim1)) : 1;
        int64_t d2 = (i >= max_dim - dim2) ? x2_shape->GetDim(i - (max_dim - dim2)) : 1;
        if (d1 != d2 && d1 != 1 && d2 != 1) {
            std::cerr << "Broadcast failed: incompatible dims " << d1 << " vs " << d2 << std::endl;
            return ge::GRAPH_FAILED;
        }
        broadcast_dims.push_back(std::max(d1, d2));
    }

    // 2. 关键修改：通过默认构造+AddDim添加维度（避开initializer_list）
    gert::Shape new_shape;  // 默认构造空Shape
    new_shape.SetDimNum(broadcast_dims.size());
    for (int i=0;i<broadcast_dims.size();i++) {
        new_shape.SetDim(i,broadcast_dims[i]);  // 逐个添加维度（假设Shape类有AddDim方法）
    }
    *y_shape = new_shape;  // 赋值给输出形状

    return ge::GRAPH_SUCCESS;
}
static ge::graphStatus InferDataType(gert::InferDataTypeContext *context)
{
const auto inputDataType = context->GetInputDataType(0);
context->SetOutputDataType(0, inputDataType);
return ge::GRAPH_SUCCESS;
}
}


namespace ops {
class Lcm : public OpDef {
public:
    explicit Lcm(const char* name) : OpDef(name)
    {
        this->Input("input")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Input("other")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Output("out")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});

        this->SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType);

        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascend910b");

    }
};

OP_ADD(Lcm);
}

op_host/lcm_tiling.h

#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(LcmTilingData)
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, n1);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, n2);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 3, ny);
  TILING_DATA_FIELD_DEF(uint32_t, status);
  TILING_DATA_FIELD_DEF(uint32_t, size);
  TILING_DATA_FIELD_DEF(uint32_t, length);
  TILING_DATA_FIELD_DEF(uint32_t, abs_flag);
  TILING_DATA_FIELD_DEF(uint32_t, dim_num);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(Lcm, LcmTilingData)
}

无tiling的实现

op_kernel/lcm.cpp

#include "kernel_operator.h"
using namespace AscendC;

constexpr uint64_t pre[65] = {0, 1ull,3ull,5ull,11ull,17ull,39ull,65ull,139ull,261ull,531ull,1025ull,2095ull,4097ull,8259ull,16405ull,32907ull,65537ull,131367ull,262145ull,524827ull,1048645ull,2098179ull,4194305ull,8390831ull,16777233ull,33558531ull,67109125ull,134225995ull,268435457ull,536887863ull,1073741825ull,2147516555ull,4294968325ull,8590000131ull,17179869265ull,34359871791ull,68719476737ull,137439215619ull,274877911045ull,549756338843ull,1099511627777ull,2199024312423ull,4398046511105ull,8796095120395ull,17592186061077ull,35184376283139ull,70368744177665ull,140737496778927ull,281474976710721ull,562949970199059ull,1125899906908165ull,2251799847243787ull,4503599627370497ull,9007199321981223ull,18014398509483025ull,36028797153190091ull,72057594038190085ull,144115188344291331ull,288230376151711745ull,576460752840837695ull,1152921504606846977ull,2305843010287435779ull,4611686018428436805ull,9223372039002292363ull};

template<typename T> class BruteForce {
public:
    __aicore__ inline BruteForce() {}
    __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t n1[5], uint32_t n2[5], uint32_t ny[5]) {
        ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
        uint32_t total1 = 1, total2 = 1, totaly = 1;
        for (int i = 0; i < 5; ++i) {
            this->n1[i] = n1[i];
            this->n2[i] = n2[i];
            this->ny[i] = ny[i];
            total1 *= n1[i];
            total2 *= n2[i];
            totaly *= ny[i];
        }

        x1Gm.SetGlobalBuffer((__gm__ T*)x1, total1);
        x2Gm.SetGlobalBuffer((__gm__ T*)x2, total2);
        yGm.SetGlobalBuffer((__gm__ T*)y, totaly);
    }
    __aicore__ inline void Process() {
        for (uint32_t i0 = 0; i0 < ny[0]; ++i0) {
            for (uint32_t i1 = 0; i1 < ny[1]; ++i1) {
                for (uint32_t i2 = 0; i2 < ny[2]; ++i2) {
                    for (uint32_t i3 = 0; i3 < ny[3]; ++i3) {
                        for (uint32_t i4 = 0; i4 < ny[4]; ++i4) {
                            uint32_t indices[5] = {i0, i1, i2, i3, i4};
                            uint32_t idx1 = 0, idx2 = 0, idxy = 0;
                            for (int j = 0; j < 5; ++j) {
                                idx1 = idx1 * n1[j] + indices[j] % n1[j];
                                idx2 = idx2 * n2[j] + indices[j] % n2[j];
                                idxy = idxy * ny[j] + indices[j];
                            }
                            int64_t a = x1Gm.GetValue(idx1);
                            int64_t b = x2Gm.GetValue(idx2);
                            if (a < 0) {
                                a = -a;
                            }
                            if (b < 0) {
                                b = -b;
                            }
                            int64_t aa=a,bb=b;
                            while (b) {
                                int64_t A = b;
                                int64_t B = a % b;
                                a = A;
                                b = B;
                            }
                            T ans=aa/a*bb;
                            if constexpr (std::is_same<T, int64_t>::value) {
                                if(ans<0)ans=-ans;
                            }
                            if constexpr (std::is_same<T, int32_t>::value) {
                                if(ans<0)ans=-ans;
                            }
                            yGm.SetValue(idxy, ans);
                        }
                    }
                }
            }
        }
    }

private:
    GlobalTensor<T> x1Gm, x2Gm, yGm;
    uint32_t n1[5], n2[5], ny[5];
};
template<typename T> class GCDKernalFast {
    public:
        __aicore__ inline GCDKernalFast() {}
        __aicore__ inline void Init(GM_ADDR x1, GM_ADDR x2, GM_ADDR y, uint32_t size, uint32_t length) {
            ASSERT(GetBlockNum() != 0 && "block dim can not be zero!");
            const unsigned num_cores = GetBlockNum();
            unsigned L = GetBlockIdx() * length;
            unsigned R = (GetBlockIdx() + 1) * length;
            if (R > size) {
                R = size;
            }
            this->L = 0;
            this->R = R - L;
            x1Gm.SetGlobalBuffer((__gm__ T*)x1 + L, length);
            x2Gm.SetGlobalBuffer((__gm__ T*)x2 + L, length);
            yGm.SetGlobalBuffer((__gm__ T*)y + L, length);
        }
        __aicore__ inline void Process() {
            for (int i = L; i < R; ++i) {
                int64_t a = x1Gm.GetValue(i);
                int64_t b = x2Gm.GetValue(i);
                a = (a > 0 ? a : -a);
                b = (b > 0 ? b : -b);
                int64_t aa=a,bb=b;
                if (a == 0){
                    yGm.SetValue(i, b);
                }
                else if (b == 0) {
                    yGm.SetValue(i, a);
                }
                else {
                    int64_t shift = ScalarGetSFFValue<1>(a | b);
                    a >>= ScalarGetSFFValue<1>(a);
                    do {
                        b >>= ScalarGetSFFValue<1>(b);
                        if(a <= 64 && b <= 64){
                            a = 64 - ScalarCountLeadingZero(pre[a] & pre[b]);
                            break;
                        }
                        if (a > b) {
                            a ^= b ^= a ^= b;
                        }
                        b -= a;
                    } while (b);
                    a =(a<<shift);
                    T ans=aa/a*bb;
                    if constexpr (std::is_same<T, int64_t>::value) {
                        if(ans<0)ans=-ans;
                    }
                    if constexpr (std::is_same<T, int32_t>::value) {
                        if(ans<0)ans=-ans;
                    }
                    yGm.SetValue(i, ans);
                }
            }
        }
    
    private:
        GlobalTensor<T> x1Gm, x2Gm, yGm;
        uint32_t L, R;
};
extern "C" __global__ __aicore__ void lcm(GM_ADDR input, GM_ADDR other, GM_ADDR out, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);

    if (tiling_data.status == 0) {
        BruteForce<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny);
        op.Process();
    }
    else {
        GCDKernalFast<DTYPE_INPUT> op;
        op.Init(input, other, out, tiling_data.size, tiling_data.length);
        op.Process();
    }
        // BruteForce<DTYPE_INPUT> op;
        // op.Init(input, other, out, tiling_data.n1, tiling_data.n2, tiling_data.ny);
        // op.Process();
}

op_host/lcm.cpp


#include "lcm_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/platform/platform_ascendc.h"
#include <iostream>
#include <algorithm>


namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{

  LcmTilingData tiling;
    auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    auto num_cores = ascendcPlatform.GetCoreNum();
    uint32_t sizeofdatatype = 2;
    auto dt = context->GetInputTensor(0)->GetDataType();
    if (dt == ge::DT_INT16) {
        sizeofdatatype = 2;
    }
    else if (dt == ge::DT_INT32) {
        sizeofdatatype = 4;
    }
    else {
        sizeofdatatype = 8;
    }
    const uint32_t alignment = 64 / sizeofdatatype;

    const gert::StorageShape* x1_shape = context->GetInputShape(0);
    auto dim1 = x1_shape->GetStorageShape().GetDimNum();
    uint32_t n1[5] = {1, 1, 1, 1, 1};
    for (int i = 0; i < dim1; ++i) {
        n1[i] = x1_shape->GetStorageShape().GetDim(i);
    }
    const gert::StorageShape* x2_shape = context->GetInputShape(1);
    auto dim2 = x2_shape->GetStorageShape().GetDimNum();
    uint32_t n2[5] = {1, 1, 1, 1, 1};
    for (int i = 0; i < dim2; ++i) {
        n2[i + (dim1 - dim2)] = x2_shape->GetStorageShape().GetDim(i);
    }
    int dim = std::max(dim1, dim2);
    uint32_t ny[5] = {1, 1, 1, 1, 1};
    uint32_t size = 1;
    for (int i = 0; i < dim; ++i) {
        ny[i] = std::max(n1[i], n2[i]);
        size *= ny[i];
    }
    tiling.set_n1(n1);
    tiling.set_n2(n2);
    tiling.set_ny(ny);
    std::cout << "n1: " << n1[0] << " " << n1[1] << " " << n1[2] << " " << n1[3] << " " << n1[4] << " " << std::endl;
    std::cout << "n2: " << n2[0] << " " << n2[1] << " " << n2[2] << " " << n2[3] << " " << n2[4] << " " << std::endl;
    std::cout << "ny: " << ny[0] << " " << ny[1] << " " << ny[2] << " " << ny[3] << " " << ny[4] << " " << std::endl;
    int status = 2;
    for (int i = 0; i < 5; ++i) {
        if (n1[i] != n2[i]) {
            status = 0;
        }
    }
    if(size <= num_cores * alignment) status = 0;
    tiling.set_status(status);
    tiling.set_size(size);
    unsigned length = (size - 1) / num_cores + 1;
    while (length % alignment != 0) length += 1;
    tiling.set_length(length);
    if (status == 0) {
        context->SetBlockDim(1);
    }
    else {
        std::cout << "Multicore" << std::endl;
        context->SetBlockDim(num_cores);
    }
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());

    return ge::GRAPH_SUCCESS;
}
}


namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context)
{
    const gert::Shape* x1_shape = context->GetInputShape(0);
    gert::Shape* y_shape = context->GetOutputShape(0);
    *y_shape = *x1_shape;
    return GRAPH_SUCCESS;
}
static ge::graphStatus InferDataType(gert::InferDataTypeContext *context)
{
const auto inputDataType = context->GetInputDataType(0);
context->SetOutputDataType(0, inputDataType);
return ge::GRAPH_SUCCESS;
}
}


namespace ops {
class Lcm : public OpDef {
public:
    explicit Lcm(const char* name) : OpDef(name)
    {
        this->Input("input")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Input("other")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
        this->Output("out")
            .ParamType(REQUIRED)
            .DataType({ge::DT_INT8, ge::DT_INT16, ge::DT_INT32, ge::DT_INT64})
            .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
            .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});

        this->SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType);

        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascend910b");

    }
};

OP_ADD(Lcm);
}

op_host/lcm_tiling.h

#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(LcmTilingData)
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, n1);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, n2);
  TILING_DATA_FIELD_DEF_ARR(uint32_t, 5, ny);
  TILING_DATA_FIELD_DEF(uint32_t, status);
  TILING_DATA_FIELD_DEF(uint32_t, size);
  TILING_DATA_FIELD_DEF(uint32_t, length);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(Lcm, LcmTilingData)
}

环境配置

运行脚本安装的ascendC不是在用户目录下的，所以重启就没了。需要每次重启之后都重新运行脚本安装

创建算子工程

可以用msoopgen一件创建，具体见文档

大概用法是

msopgen gen -i {*.json} -f {framework type} -c {Compute Resource} -lan cpp -out {Output Path}

其中.json文件是算子工程的配置文件，需要声明算子的名称，参数的名称，类型

[
    {
        "op": "BitwiseLeftShift",
        "language": "cpp",
        "input_desc": [
            {
                "name": "input",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            },
            {
                "name": "other",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            }
        ],
        "output_desc": [
            {
                "name": "out",
                "param_type": "required",
                "format": [
                    "ND",
                    "ND",
                    "ND",
                    "ND"
                ],
                "type": [
                    "int8",
                    "int16",
                    "int32",
                    "int64"
                ]
            }
        ]
    }
]

-c格式为ai_core-{soc version}，如果用的是910b，就是ai_core-910B

-o可以没有，没有的话就默认在当前目录创建工程

编译

算子工程目录下有个build.sh，直接运行即可。成功后会得到一个可执行文件build_out/custom_opp_euleros_aarch64.run，
运行后会把当前算子安装到ascendC算子库里。

这里注意，这种方式同时只能安装一个算子，因为安装时的名称都是一样的，后面的会覆盖前面的。

测试

比赛给的压缩包里，首先运行init_pybind.sh，安装环境，这个只用安装一次就行了。

后面进入对应赛题的文件夹，例如Lcm赛题就是case_910b/Lcm，里面有一个run.sh脚本，执行bash run.sh x，即可运行test_op.py里定义的第x个测试用例。如果运行时检测不到算子的.wheel文件，会先编译得到并安装。

如果修改了算子实现，想重新测试，需要删掉dist和build两个文件夹，否则不会重新安装.wheel，也就无法测试最新实现

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

Elser AI：仅需提示词，几分钟创建完整动漫短片 | ProductHunt 今日热榜 - 01月12日

2048 AI社区

Java赋能宠物：自助洗澡无人共享物联系统

摘要：Java技术驱动的宠物自助洗澡无人共享系统采用SpringBoot微服务架构，集成MQTT物联网协议实现设备智能控制。系统具备四大核心功能：智能预约调度、AI情绪识别、生命安全监测和耗材管理，通过HTTPS+AES-256加密保障数据安全。商业价值显著，商家端人力成本降低45%，用户端实现服务透明化与个性化体验。系统融合宠物医疗等衍生服务，某案例显示单客年均消费从800元提升至2500元。