阿里云一句话识别ASR/STT 开发指南

阿里云一句话识别api调用与开发

Aou君

3759人浏览 · 2024-08-23 15:32:14

Aou君 · 2024-08-23 15:32:14 发布

前情提要

在硬件资源服务不足的情况下，本地部署的ASR/STT服务很难开放ip端口和其他项目进行联调。（内网穿透需要备案，如果使用ngrok域名又会一直改变，请求的url老是变也很不方便）

因此，使用第三方的api/SDK就是在低配置服务器部署的优质方案之一。

在使用之前，需要查阅阿里云相关文档
一句话识别Python SDK使用说明_智能语音交互(ISI)-阿里云帮助中心 (aliyun.com)

SDK安装

下载Python SDK。

从Github获取Python SDK，或直接下载alibabacloud-nls-python-sdk-1.0.0.zip。
安装SDK依赖。

进入SDK根目录使用如下命令安装SDK依赖：
```
python -m pip install -r requirements.txt
```
安装SDK。

依赖安装完成后使用如下命令安装SDK：
```
python -m pip install .
```
安装完成后通过以下代码导入SDK。
```
# -*- coding: utf-8 -*-
import nls
```

在以上路径把test_sr.py拉到项目根目录，方便引入nls库

代码修改

阅读源码，发现其token是通过环境变量里预置的akkey和appid来获取的

如果需要在生产环境部署，需要阅读

SDK方式获取Token_智能语音交互(ISI)-阿里云帮助中心 (aliyun.com)

token获取

由于我是做联调的测试部署和上服务器开api，项目比较赶，配置这些参数有点浪费时间，因此参考官方文档使用python自定义显式key的token获取，也顺便方便调试。

import os
import time
import json
from aliyunsdkcore.client import AcsClient
from aliyunsdkcore.request import CommonRequest

class AliyunNlsToken:
    def __init__(self, ak_id, ak_secret, region):
        # 创建AcsClient实例
        self.client = AcsClient(ak_id, ak_secret, region)

    def get_token(self):
        # 创建request，并设置参数
        request = CommonRequest()
        request.set_method('POST')
        request.set_domain('nls-meta.cn-shanghai.aliyuncs.com')
        request.set_version('2019-02-28')
        request.set_action_name('CreateToken')

        try:
            # 发送请求并获取响应
            response = self.client.do_action_with_exception(request)
            response_json = json.loads(response)
            if 'Token' in response_json and 'Id' in response_json['Token']:
                # 返回Token和过期时间
                return response_json['Token']['Id'], response_json['Token']['ExpireTime']
            else:
                return None, None
        except Exception as e:
            # 打印错误信息
            print(e)
            return None, None

if __name__ == '__main__':
    ak_id = '你的akid'
    ak_secret = '你的secret'
    region = "cn-shenzhen"

    # 创建AliyunNlsToken实例
    token_client = AliyunNlsToken(ak_id, ak_secret, region)
    token, expire_time = token_client.get_token()
    if token:
        print("token =", token)
        print("expireTime =", expire_time)
    else:
        print("Failed to get token")

将以上代码封装为sdk_get_token.py文件并且放在项目根目录

代码中的两个变量ak_id和ak_secret在
智能语音交互新手指南_智能语音交互(ISI)-阿里云帮助中心 (aliyun.com)https://help.aliyun.com/zh/isi/getting-started/start-here?spm=a2c4g.11186623.0.0.50eb43c5h1zKMf有详细教程

在阿里云账号已经实名的情况下参考创建阿里云AccessKey_访问控制(RAM)-阿里云帮助中心 (aliyun.com)https://help.aliyun.com/zh/ram/user-guide/create-an-accesskey-pair?spm=a2c4g.11186623.0.0.393b2664xHGqZ5#section-ynu-63z-ujz

修改示例代码

将阿里云的示例代码中需要获取环境变量和token的部分使用显式明文变量，token则使用刚才定义的类来实时获取。

import time
import threading
import sys
from sdk_get_token import AliyunNlsToken 
import nls
# 阿里云的访问密钥ID和密钥Secret
ak_id = ''
ak_secret = ''
region = "cn-shenzhen"  # 阿里云服务区域

# 创建AliyunNlsToken实例，用于获取访问令牌
token_client = AliyunNlsToken(ak_id, ak_secret, region)
token, expire_time = token_client.get_token()  # 获取token和过期时间
URL = "wss://nls-gateway-cn-shenzhen.aliyuncs.com/ws/v1"  # 语音识别的WebSocket URL
TOKEN = token  # 将获取到的token赋值给TOKEN变量
APPKEY = ""  # 应用程序的Appkey，在阿里云控制台获取

class TestSt:
    def __init__(self, tid, test_file):
        self.__th = threading.Thread(target=self.__test_run)
        self.__id = tid
        self.__test_file = test_file
   
    def loadfile(self, filename):
        with open(filename, "rb") as f:
            self.__data = f.read()
    
    def start(self):
        self.loadfile(self.__test_file)
        self.__th.start()

    def test_on_sentence_begin(self, message, *args):
        print("test_on_sentence_begin:{}".format(message))

    def test_on_sentence_end(self, message, *args):
        print("test_on_sentence_end:{}".format(message))

    def test_on_start(self, message, *args):
        print("test_on_start:{}".format(message))

    def test_on_error(self, message, *args):
        print("on_error args=>{}".format(args))

    def test_on_close(self, *args):
        print("on_close: args=>{}".format(args))

    def test_on_result_chg(self, message, *args):
        print("test_on_chg:{}".format(message))

    def test_on_completed(self, message, *args):
        print("on_completed:args=>{} message=>{}".format(args, message))


    def __test_run(self):
        print("thread:{} start..".format(self.__id))
        sr = nls.NlsSpeechTranscriber(
                    token=TOKEN,
                    appkey=APPKEY,
                    on_sentence_begin=self.test_on_sentence_begin,
                    on_sentence_end=self.test_on_sentence_end,
                    on_start=self.test_on_start,
                    on_result_changed=self.test_on_result_chg,
                    on_completed=self.test_on_completed,
                    on_error=self.test_on_error,
                    on_close=self.test_on_close,
                    callback_args=[self.__id]
                )
        while True:
            print("{}: session start".format(self.__id))
            r = sr.start(aformat="pcm",
                    enable_intermediate_result=True,
                    enable_punctuation_prediction=True,
                    enable_inverse_text_normalization=True)

            self.__slices = zip(*(iter(self.__data),) * 640)
            for i in self.__slices:
                sr.send_audio(bytes(i))
                time.sleep(0.01)

            sr.ctrl(ex={"test":"tttt"})
            time.sleep(1)

            r = sr.stop()
            print("{}: sr stopped:{}".format(self.__id, r))
            time.sleep(5)

def multiruntest(num=500):
    for i in range(0, num):
        name = "thread" + str(i)
        t = TestSt(name, "tests/test1.wav")
        t.start()

nls.enableTrace(True)
multiruntest(1)

代码如上，填写好之前获取的id和sc后，在以下链接创建项目，然后复制appkey填入指定位置（代码开头）

智能语音交互 (aliyun.com)https://nls-portal.console.aliyun.com/applist

运行

此时直接运行可以得到一大串的流式输出结果

不过，这个线程没有写关闭选项ctrl+c无法退出，因此我只能关闭终端。

依照以上代码进行测试，我将阿里云一句话识别的STT缺陷和需求进行了如下总结

~~然后根据我的逻辑图一点一点改bug--还有几个没改完的~~

注：由于撰写本文时没有仔细阅读阿里云官方文档，实际上wav文件是可以直接识别的，只是需要采样率在8000或者16000，因此代码中的转换pcm部分可以注释掉

删除源代码的sleep()函数后，实现了快速的本地录音文件识别，识别时间大概在0.7-1s左右，两句sleep共造成了大约2.5s的阻塞，后续可能会考虑流式识别的开发，因此这里得注意一下。

核心代码

import time
import json  
from pydub import AudioSegment  
from sdk_get_token import AliyunNlsToken  
import nls 
import time
# 阿里云的访问密钥ID和密钥Secret
ak_id = ''
ak_secret = ''
region = "cn-shenzhen"  # 阿里云服务区域

# 创建AliyunNlsToken实例，用于获取访问令牌
token_client = AliyunNlsToken(ak_id, ak_secret, region)
token, expire_time = token_client.get_token()  # 获取token和过期时间
print(f"过期时间为：{expire_time}，token为：{token}")
URL = "wss://nls-gateway-cn-shenzhen.aliyuncs.com/ws/v1"  # 语音识别的WebSocket URL
TOKEN = token  
APPKEY = ""  # 应用程序的Appkey，在阿里云控制台获取

# WAV到PCM的转换函数，确保采样率为16000 Hz
def convert_wav_to_pcm(wav_file, pcm_file, target_sample_rate=16000):
    audio = AudioSegment.from_wav(wav_file)
    audio = audio.set_frame_rate(target_sample_rate).set_channels(1)  
    audio.export(pcm_file, format="s16le")  
# 定义一个语音转写测试类
class TestSt:
    def __init__(self, tid, test_file):
        self.__id = tid  
        self.__test_file = test_file  
        self.__pcm_file = "converted.pcm"  # 转换后的PCM文件路径

        # 加载和转换文件
        self.loadfile(self.__test_file)

        # 进行语音识别
        self.final_result = self.__test_run()  # 立即运行识别，并存储最终结果
   
    def loadfile(self, filename):
        # 将WAV文件转换为符合要求的PCM格式（16000 Hz采样率）
        convert_wav_to_pcm(filename, self.__pcm_file)
        # 加载PCM文件，并将其内容读取为二进制数据
        with open(self.__pcm_file, "rb") as f:
            self.__data = f.read()

    # 以下是回调函数，用于在语音识别过程中处理不同的事件
    def test_on_sentence_begin(self, message, *args):
        pass  # 可以删除或者保留，调试时输出

    def test_on_sentence_end(self, message, *args):
        print(f"Raw message: {message}")  # 打印原始消息，调试用
        try:
            # 检查message是否是字符串并尝试解析
            if isinstance(message, str):
                message_dict = json.loads(message)  # 将字符串解析为字典
            else:
                message_dict = message  # 如果不是字符串，直接使用


            self.final_result = message_dict.get('payload', {}).get('result', '')
            print(f"最终识别结果: {self.final_result}")
        except json.JSONDecodeError:
            print("Failed to decode message, check the format.")

    def test_on_start(self, message, *args):
        pass  # 可以删除或者保留，调试时输出

    def test_on_error(self, message, *args):
        pass  

    def test_on_close(self, *args):
        pass  

    def test_on_result_chg(self, message, *args):
        pass  

    def test_on_completed(self, message, *args):
        pass  

    def __test_run(self):
        print(f"thread:{self.__id} start..")
        # 创建NlsSpeechTranscriber对象，配置回调函数
        sr = nls.NlsSpeechTranscriber(
                    token=TOKEN,
                    appkey=APPKEY,
                    on_sentence_begin=self.test_on_sentence_begin,
                    on_sentence_end=self.test_on_sentence_end,
                    on_start=self.test_on_start,
                    on_result_changed=self.test_on_result_chg,
                    on_completed=self.test_on_completed,
                    on_error=self.test_on_error,
                    on_close=self.test_on_close,
                    callback_args=[self.__id]
                )

        print(f"{self.__id}: session start")
        # 启动语音识别会话
        r = sr.start(aformat="pcm",
                    enable_intermediate_result=True,
                    enable_punctuation_prediction=True,
                    enable_inverse_text_normalization=True)

        # 发送音频数据，640字节一块
        self.__slices = zip(*(iter(self.__data),) * 640)
        for i in self.__slices:
            sr.send_audio(bytes(i))
            #time.sleep(0.01)  # 休眠以模拟实时音频传输


        sr.ctrl(ex={"test": "tttt"})
        #time.sleep(1)

   
        r = sr.stop()
        print(f"{self.__id}: sr stopped: {r}")
        
        # 返回最终识别结果
        return self.final_result

if __name__ == "__main__":
    start = time.time()
    t = TestSt("thread0", "tests/本身城市就是相对来说比较安静.wav")
    result = t.final_result  # 获取最终识别结果
    print(f"最终返回结果: {result}")
    end = time.time()
    tol = end - start
    print(f"最终耗时{tol}")

还是用我最爱的坤坤数据集测试

总结

后续可以考虑参考我SenseVoice的二次开发代码里的录音部分，实现低硬件需求的STT/ASR

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

大模型+数据库：使用LlamaIndex构建自然语言SQL查询引擎完整指南

2048 AI社区

AiPPT vs Canva vs 美图设计室：三大AI PPT生成器深度评测！

2048 AI社区

standardu.dll standardb.dll stampjoin201800.dll stairrampdbapi.dll stairrampres.dll stabilize2d

以上只是通用的运行库dll处理方式，如果你遇到缺失文件是第三方的软件文件，那么就需要下载到属于这个程序所匹配的版本的文件，然后将这个文件复制到这个程序的安装目录下才能解决问题。如果我们遇到关于文件在系统使用过程中提示缺少找不到的情况，如果文件是属于运行库文件的可以单独下载文件解决，但还是建议安装完整的运行库，可以尝试采用手动下载替换的方法解决问题！文件下载完成后，下方列表会有很多个不同版本的文件，