部署AI语音助手，实现本地Siri

andmore。

言京谅

2241人浏览 · 2024-09-18 11:05:03

言京谅 · 2024-09-18 11:05:03 发布

在这里插入图片描述

This is a project to deploy a local AI assistant. This combines FunAsr,Ollama and CosyVoice.

在这里插入图片描述

Preparation

FunAsr

You need to guarantee the WSL and Windows host can share the docker, this can be opened WSL in the docker desktop.

在这里插入图片描述

Follow this tutorial to deploy FunAsr: FunASR Realtime Transcribe Service.

Download workspace and run the local Asr server:

$ curl -O https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/shell/funasr-runtime-deploy-online-cpu-zh.sh
$ sudo bash funasr-runtime-deploy-online-cpu-zh.sh install --workspace ./funasr-runtime-resources

# Restart the container
$ sudo bash funasr-runtime-deploy-online-cpu-zh.sh restart

In the container, it will download models from the modelscope. After that you should see:

$ docker ps -a
$ docker exec -it <contianer ID> bash

# In container:
$ watch -n 0.1 "cat FunASR/runtime/log.txt | tail -n 10"
I20240915 21:57:20.544512    56 ct-transformer-online.cpp:21] Successfully load model from /workspace/models/damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727-onnx/model_quant.onnx
I20240915 21:57:20.647238    56 itn-processor.cpp:33] Successfully load model from /workspace/models/thuduj12/fst_itn_zh/zh_itn_tagger.fst
I20240915 21:57:20.648470    56 itn-processor.cpp:35] Successfully load model from /workspace/models/thuduj12/fst_itn_zh/zh_itn_verbalizer.fst
I20240915 21:57:20.648491    56 websocket-server-2pass.cpp:580] initAsr run check_and_clean_connection
I20240915 21:57:20.648558    56 websocket-server-2pass.cpp:583] initAsr run check_and_clean_connection finished
I20240915 21:57:20.648564    56 funasr-wss-server-2pass.cpp:565] decoder-thread-num: 16
I20240915 21:57:20.648567    56 funasr-wss-server-2pass.cpp:566] io-thread-num: 4
I20240915 21:57:20.648571    56 funasr-wss-server-2pass.cpp:567] model-thread-num: 2
I20240915 21:57:20.648572    56 funasr-wss-server-2pass.cpp:568] asr model init finished. listen on port:10095

VAD

We need to stop recording when the mic is not active. Here the technology is called VAD.

We also need to judge when the users input stop, not by the mic, but the results count from FunAsr server. The count of sentences(messages) sent from the ASR server increases when user is saying. When user stopped, the count will stop increasing. I set the latency to 2s, when there is no more sentences coming from the server, the thread of wait_end_and_send_to_ollama will block and wait the results from the Ollama server.

FIX: Note that when assistant is saying, the users input at the same time will be set as next input. An important feature is that user can interrupt the assistant.

Ollama

Download and install ollama in windows. After that, run ollama run llama3.1 or ollama run qwen:7b in the Powershell to download the model. Then start the ollama server:

$ ollama serve
...
time=2024-09-15T18:25:35.068+08:00 level=INFO source=images.go:753 msg="total blobs: 5"
time=2024-09-15T18:25:35.069+08:00 level=INFO source=images.go:760 msg="total unused blobs removed: 0"
time=2024-09-15T18:25:35.070+08:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.10)"
time=2024-09-15T18:25:35.070+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]"
time=2024-09-15T18:25:35.070+08:00 level=INFO source=gpu.go:200 msg="looking for compatible GPUs"
time=2024-09-15T18:25:35.251+08:00 level=INFO source=gpu.go:292 msg="detected OS VRAM overhead" id=GPU-e2aae3a3-4bf5-4f72-0920-b864cb97001c library=cuda compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4060" overhead="124.9 MiB"
time=2024-09-15T18:25:35.259+08:00 level=INFO source=types.go:107 msg="inference compute" id=GPU-e2aae3a3-4bf5-4f72-0920-b864cb97001c library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4060" total="8.0 GiB" available="6.9 GiB"

You can easily follow the Ollama PyPi tutorial to use ollama APIs.

import ollama

# Response streaming can be enabled by setting stream=True, modifying function
# calls to return a Python generator where each part is an object in the stream.
stream = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True,
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

CosyVoice

Follow the tutorial: CosyVoice

Cuda 11.8 torch and torchaudio:

$ pip install torch==2.0.1 --index-url https://download.pytorch.org/whl/cu118

If you want to clone audio, transform audio file recorded from windows recorder:

This is a very funny feature provided by CosyVoice, for example, you can clone Trump’s voice. This is already realized several years ago, but here it supports Chinese and uses LLM. Note that you should follow the law and privacy policy, it is very significant.
```
$ ffmpeg -i input.m4a output.wav
```
ONNX Runtime Issue: onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcufft.so.10: cannot open shared object file: No such file or directory
```
$ pip3 install onnxruntime-gpu==1.18.1 -i https://mirrors.aliyun.com/pypi/simple/
$ pip3 install onnxruntime==1.18.1 -i https://mirrors.aliyun.com/pypi/simple/
```

Glibc issue: version GLIBCXX_3.4.29 not found

$ find ~ -name "libstdc++.so.6*"
$ strings .conda/envs/cosyvoice/lib/libstdc++.so.6 | grep -i "glibcxx"
$ sudo cp .conda/envs/cosyvoice/lib/libstdc++.so.6.0.33 /lib/x86_64-linux-gnu
$ sudo rm /usr/lib/x86_64-linux-gnu/libstdc++.so.6
$ sudo ln -s /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33 /usr/lib/x86_64-linux-gnu/libstdc++.so.6

Server and client

# 安装依赖
$ cd runtime/python/grpc && python3 -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. cosyvoice.proto

# 将文字请求发送至server，并返回语音文件 demo.wav
$ python3 runtime/python/grpc/server.py --port 50000 --max_conc 4 --model_dir pretrained_models/CosyVoice-300M && sleep infinit
$ python3 runtime/python/grpc/client.py --port 50000 --mode sft

# Fast API
$ python3 runtime/python/fastapi/server.py --port 50000 --model_dir pretrained_models/CosyVoice-300M && sleep infinity
$ python3 runtime/python/fastapi/client.py --port 50000 --mode sft

Play the audio in python

After get the .wav files from the server, you can use simpleaudio python library to play the audio.

import simpleaudio as sa
 
# Load audio file
filename = 'demo.wav'
wave_obj = sa.WaveObject.from_wave_file(filename)
 
# Play the audio
play_obj = wave_obj.play()
play_obj.wait_done()

TODO

Streaming…

Deploy

Script:

# -*- encoding: utf-8 -*-
import os
import websockets, ssl
import asyncio
import argparse
import json
from ollama import Client
import logging
from multiprocessing import Process

logging.basicConfig(level=logging.ERROR)

parser = argparse.ArgumentParser()
parser.add_argument("--host", type=str, default="localhost", required=False, help="host ip, localhost, 0.0.0.0")
parser.add_argument("--port", type=int, default=10095, required=False, help="grpc server port")
parser.add_argument("--chunk_size", type=str, default="5, 10, 5", help="chunk")
parser.add_argument("--chunk_interval", type=int, default=10, help="chunk")
parser.add_argument("--hotword", type=str, default="", help="hotword file path, one hotword perline (e.g.:阿里巴巴 20)")
parser.add_argument("--words_max_print", type=int, default=10000, help="chunk")
parser.add_argument("--use_itn", type=int, default=1, help="1 for using itn, 0 for not itn")
parser.add_argument("--powershell", type=int, default=0, help="work under powershell")
parser.add_argument("--llamahost", type=str, default="0.0.0.0:11434", help="Ollama server")
parser.add_argument("--llm_model", type=str, default="llama3.1", help="Ollama model")

args = parser.parse_args()
args.chunk_size = [int(x) for x in args.chunk_size.split(",")]

msg_cnt = 0
msg_end = False
text_print = ""
text_print_2pass_online = ""
text_print_2pass_offline = ""
messages = []

async def record_microphone():
    is_finished = False
    import pyaudio
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 16000
    chunk_size = 60 * args.chunk_size[1] / args.chunk_interval
    CHUNK = int(RATE / 1000 * chunk_size)

    p = pyaudio.PyAudio()

    stream = p.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK)
    # hotwords
    fst_dict = {}
    hotword_msg = ""
    if args.hotword.strip() != "":
        f_scp = open(args.hotword, encoding='utf-8')
        hot_lines = f_scp.readlines()
        for line in hot_lines:
            words = line.strip().split(" ")
            if len(words) < 2:
                print("Please checkout format of hotwords")
                continue
            try:
                fst_dict[" ".join(words[:-1])] = int(words[-1])
            except ValueError:
                print("Please checkout format of hotwords")
        hotword_msg=json.dumps(fst_dict)

    use_itn=True
    if args.use_itn == 0:
        use_itn=False
    
    message = json.dumps({"mode": "2pass", "chunk_size": args.chunk_size, "chunk_interval": args.chunk_interval,
                          "wav_name": "microphone", "is_speaking": True, "hotwords":hotword_msg, "itn": use_itn})
    await websocket.send(message)
    
    if args.powershell:
        os.system('powershell -command "Clear-Host"')
    else:
        os.system('clear')

    print('---------------------------------------------------------------')
    print('Welcome to use local AI assistant. You can say: Tell me a joke!')
    print('---------------------------------------------------------------')
    print("User: ")

    while True:
        data = stream.read(CHUNK)
        message = data
        await websocket.send(message)
        await asyncio.sleep(0.005)

async def wait_end_and_send_to_ollama():
    while True:
        global msg_cnt, msg_end, text_print, text_print_2pass_online, text_print_2pass_offline

        cur_cnt = msg_cnt
        await asyncio.sleep(2)
        if (msg_cnt == cur_cnt and text_print):
            prompt = text_print
            assistant_log = ""

            # Clear ASR texts
            text_print_2pass_online = ""
            text_print_2pass_offline = ""
            text_print = ""

            print("\n\nAssistant: ")
            messages.append({'role': 'user', 'content': prompt})
            client = Client(host='http://' + args.llamahost)
            stream = client.chat(model=args.llm_model, messages=messages, stream=True)
            for chunk in stream:
                assistant_log += chunk['message']['content']
                print(chunk['message']['content'], end='', flush=True)
            messages.append({'role': 'assistant', 'content': assistant_log})
            assistant_log = ""
            print("\n-------------------------------------------------------------")
            print("User: ")


async def message(id):
    global websocket

    try:
       while True:
            global msg_cnt, msg_end, text_print, text_print_2pass_online, text_print_2pass_offline

            meg = await websocket.recv()
            msg_cnt += 1

            meg = json.loads(meg)
            text = meg["text"]

            if 'mode' not in meg:
                continue
            else:
                if meg["mode"] == "2pass-online":
                    text_print_2pass_online += "{}".format(text)
                    text_print = text_print_2pass_offline + text_print_2pass_online
                else:
                    text_print_2pass_online = ""
                    text_print = text_print_2pass_offline + "{}".format(text)
                    text_print_2pass_offline += "{}".format(text)
                text_print = text_print[-args.words_max_print:]

                # Fix: Delete the laster punctuation mark in the laster sentence
                if (text_print[0] in ["。", "，", "？", "！"]):
                    text_print = text_print[1:]

                print("\r" + text_print, end='')

    except Exception as e:
            print("Exception:", e)

async def ws_client(id, chunk_begin, chunk_size):
    chunk_begin=0
    chunk_size=1
    global websocket
    
    for i in range(chunk_begin,chunk_begin+chunk_size):

        ssl_context = ssl.SSLContext()
        ssl_context.check_hostname = False
        ssl_context.verify_mode = ssl.CERT_NONE
        uri = "wss://{}:{}".format(args.host, args.port)

        print("connect to", uri)
        async with websockets.connect(uri, subprotocols=["binary"], ping_interval=None, ssl=ssl_context) as websocket:
            task1 = asyncio.create_task(record_microphone())
            task2 = asyncio.create_task(message(str(id)+"_"+str(i))) #processid+fileid
            task3 = asyncio.create_task(wait_end_and_send_to_ollama())
            await asyncio.gather(task1, task2, task3)
    exit(0)

def one_thread(id, chunk_begin, chunk_size):
    asyncio.get_event_loop().run_until_complete(ws_client(id, chunk_begin, chunk_size))
    asyncio.get_event_loop().run_forever()

if __name__ == '__main__':
    p = Process(target=one_thread, args=(0, 0, 0))
    p.start()
    p.join()
    print('end')

Windows

Run the script:

pip3 install websockets pyaudio ollama
python3 funasr_client.py --host "127.0.0.1" --port 10095 --hotword hotword.txt --powershell 1 --llm_mode llama3.1 --llamahost "localhost:11434"

在这里插入图片描述

WSL

If ollama runs in the Windows host, you should enable wsl to access it in LAN (For other devices, this should also be enabled). In Powershell:

$ [Environment]::SetEnvironmentVariable('OLLAMA_HOST', '0.0.0.0:11434', 'Process')
$ [Environment]::SetEnvironmentVariable('OLLAMA_ORIGINS', '*', 'Process')
$ ollama serve

Run ipconfig in Poweshell to get the IPv4 of Host, for example: 172.20.10.2.
The audio may not work due to the audio card. A way to solve the problem:
```
$ sudo apt-get install python3-pyaudio pulseaudio portaudio19-dev
```

Run the scripts:

$ pip3 install websockets pyaudio ollama
$ python3 funasr_client.py --host "127.0.0.1" --port 10095 --hotword hotword.txt --llamahost "172.20.10.2:11434" --llm_model "qwen:7b"

References

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

【大白话】浅析Transformer的自注意力机制：从“小纸条”到改变AI的核心魔法

在Transformer模型诞生之前，自然语言处理（NLP）领域主要由循环神经网络（RNN）及其变体（如LSTM）主导。顺序处理，难以并行：必须一个字一个字地处理序列，计算速度慢。长距离依赖问题：当句子很长时，模型容易“忘记”开头的信息。比如在句子“我出生在法国，……，所以我流利地说法语”中，RNN很难建立“法国”和“法语”之间的遥远联系。Attention机制的初衷，就是解决“长距离依赖”问题。

2048 AI社区

AI算力革命2025：从百亿烧钱竞赛到盈利破局

2025年AI行业迎来关键转折，训练成本逼近百亿美元，推理日耗达千万美元。行业从"参数竞赛"转向"成本控制"，资本更看重算力投入产出比。五大创新范式应运而生：小模型逆袭、智能路由优化、全域缓存体系、专用芯片突破和精准定价策略。垂直场景的小模型表现优异，专用芯片效率提升15倍，95%请求实现零推理响应。AI从业者角色重塑，成本优化师成为稀缺人才。行业共识表明，

2048 AI社区

每日AI学习笔记----Qwen3-Omni

最近作者开始上班了~上班两个多月，终于也是找到一点点工作的节奏~~。也深感到自己的不足，常在思考，选择这个行业是否正确，但是既然选择了，那么去深入也是乐趣所在。没有什么比静下心来学习能让你更踏实。浮躁了就去学习，想谈恋爱了就去学习，烦了就去学习吧，孩子。因此作者决定只要工作不加班到很晚，每天都要坚持至少一小时的AI新知识和技术的学习。