FunAsr的本地部署与使用

发表于 2025-11-04 分类于 AI 阅读次数：本文字数： 3.2k 阅读时长 ≈ 8 分钟

对语音识别的浅尝辄止

前言

原本的打算是配合项目做一个简单的用户语音输入的识别，仅此而已。
后来饼越画越大，逐渐忘记自己原本要做什么。。。

安装

1	pip install funasr modelscope huggingface_hub torch torchaudio onnx onnxconverter_common

简单使用

由于这个东西他输入进去了模型名称，如果在运行时检测不到模型就会直接下载对应模型，所以非常的傻瓜式，照做就好了

from funasr import AutoModel
import datetime

# 使用支持时间戳的模型
model = AutoModel(
    model="paraformer-zh",
    vad_model="fsmn-vad", 
    punc_model="ct-punc-c",
    spk_model="cam++",  # 可选，如果需说话人分离
    disable_update=True
)

# 生成带时间戳的结果
res = model.generate(
    input="C:\\Users\\X.J\\Desktop\\live2d_ai\\3.wav",
    batch_size_s=60,
    return_timestamp=True  # 返回时间戳，没有该功能的模型记得false
)

print(res)

运行一下，等待模型下载完成就能用了。。。
最开始尝试的时候我也是震惊的，居然这么弱智吗？

时间戳的使用

目前FunAsr完美支持的语言似乎只有中文和英文，为什么这么说呢？
日文模型也是有的，语音识别也是没有问题的，但是我在modelscope找了一圈，都是不支持时间戳的！
好在至少中英文一些比较完善的模型还是支持时间戳的

什么是时间戳呢？
经常听歌或者其他包含字幕的音频的应该知道，其实也就是类似时间轴的东西，也就是说一些模型能够做到——准确地识别语音，并且给出在音频中的一句话的起始和终止时间。

下文以某类音频常用的歌词(文本)的文件，vtt为例：

vtt的大致长相长这样:

WEBVTT

1
00:00:00.130 --> 00:00:02.250
从前有个可爱的小姑娘，

2
00:00:02.870 --> 00:00:03.870
谁见了都喜欢。

3
00:00:04.270 --> 00:00:06.250
但最喜欢她的是他的奶奶，

有了vtt文件，一些对应的音频、视频就可以有实时的歌词显示或者是字幕了

那么很显然，我们现在手上有了能够获取时间戳的语音识别模型，那么我们唯一要做的也就是进行文本格式的转换了：


def format_vtt_time(milliseconds):
    """将毫秒转换为VTT时间格式: HH:MM:SS.mmm"""
    seconds = milliseconds / 1000
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = seconds % 60
    return f"{hours:02d}:{minutes:02d}:{secs:06.3f}"

def generate_vtt_file(recognition_results, output_path="lyrics.vtt"):
    vtt_content = ["WEBVTT", ""]  # VTT文件头
    
    cue_index = 1
    
    for result in recognition_results:
        if 'sentence_info' in result:
            for sentence in result['sentence_info']:
                text = sentence['text'].strip()
                start_ms = sentence['start']
                end_ms = sentence['end']
                
                # 格式化时间
                start_time = format_vtt_time(start_ms)
                end_time = format_vtt_time(end_ms)
                
                # 写入内容
                vtt_content.append(f"{cue_index}")
                vtt_content.append(f"{start_time} --> {end_time}")
                vtt_content.append(text)
                vtt_content.append("")  # 空行分隔
                
                cue_index += 1
    
    # 写入文件
    try:
        with open(output_path, 'w', encoding='utf-8') as f:
            f.write('\n'.join(vtt_content))
        print(f"VTT歌词文件已生成: {output_path}")
        return True
    except Exception as e:
        print(f"生成VTT文件时出错: {e}")
        return False

# 处理识别结果并生成VTT文件
if res and len(res) > 0:
    print("识别结果详情:")
    for i, result in enumerate(res):
        print(f"\n--- 第{i+1}段识别结果 ---")
        
        if 'sentence_info' in result:
            for sentence in result['sentence_info']:
                text = sentence['text']
                start_time = sentence['start']
                end_time = sentence['end']
                
                print(f"句子: {text}")
                print(f"时间范围: {format_vtt_time(start_time)} --> {format_vtt_time(end_time)}")
    
    # 生成VTT文件
    generate_vtt_file(res)
    
    # 在控制台显示VTT文件内容
    print("\n=== 生成的VTT文件内容 ===")
    try:
        with open("lyrics.vtt", 'r', encoding='utf-8') as f:
            print(f.read())
    except FileNotFoundError:
        print("VTT文件未找到")
else:
    print("未获取到识别结果")