MARS5: 혁신적인 음성을 지원하는 New 음성 모델

소개

MARS5는 CAMB.AI에서 개발한 최첨단 TTS 모델로, 두 단계의 AR-NAR 파이프라인을 통해 동작한다. 아주 적은 시간인 5초 오디오와 텍스트 조각만으로도 다양한 프로소디 시나리오에서 높은 품질의 음성을 생성할 수 있다.

해당 모델은 텍스트와 참조 오디오를 입력 받아 다양한 운율 흐름에서도 자연스러운 음성을 생성할 수 있다. 특히 스포츠 해설이나 애니메이션 같은 프로소디가 어려운 시나리오에서도 뛰어난 성능을 보인다. MARS5의 독특한 점은 NAR 컴포넌트의 혁신적인 설계로, 자세한 내용은 Architecture문서에서 확인할 수 있다.

MARS5는 기존의 TTS 모델들과 비교하여 혁신적인 특징을 갖추고 있다. 기존 모델들은 주로 단순한 텍스트-음성(TTS) 변환에 중점을 두었던 반면, MARS5는 복잡한 운율을 처리할 수 있는 능력을 갖추고 있다. 예를 들어, DeepMind의 Tacotron2와 비교했을 때, MARS5는 더 짧은 참조 오디오로도 높은 품질의 음성을 생성할 수 있다.

주요 특징

AR-NAR 파이프라인: 두 단계의 파이프라인으로 텍스트와 참조 오디오를 입력 받아 고품질 음성을 생성.
프로소디(prosody, 운율) 제어: 쉼표나 대문자와 같은 단순한 텍스트 기호를 통해 자연스럽게 프로소디를 제어 가능.
심층 복제: 참조 오디오의 텍스트를 제공하여 더 높은 품질의 복제 음성을 생성.
다양한 활용: 스포츠 해설, 애니메이션 등 다양한 시나리오에서 사용 가능.

사용 방법

설치

필수 라이브러리 설치:

pip install --upgrade torch torchaudio librosa vocos encodec

모델 사용

# 모델 불러오기
import torch, librosa

mars5, config_class = torch.hub.load('Camb-ai/mars5-tts', 'mars5_english', trust_repo=True)

# 참조 오디오와 텍스트 불러오기
wav, sr = librosa.load('<path to arbitrary 24kHz waveform>.wav', sr=mars5.sr, mono=True)
wav = torch.from_numpy(wav)
ref_transcript = "<transcript of the reference audio>"

# 음성 생성
deep_clone = True
cfg = config_class(deep_clone=deep_clone, rep_penalty_window=100, top_k=100, temperature=0.7, freq_penalty=3)

ar_codes, output_audio = mars5.tts("The quick brown rat.", wav, ref_transcript, cfg=cfg)

MARS5 GitHub 저장소

https://github.com/Camb-ai/MARS5-TTS?utm_source=pytorchkr&ref=pytorchkr

GitHub - Camb-ai/MARS5-TTS: MARS5 speech model (TTS) from CAMB.AI

MARS5 speech model (TTS) from CAMB.AI. Contribute to Camb-ai/MARS5-TTS development by creating an account on GitHub.

github.com

더 읽어보기 / 참고 프로젝트들

TransFusion

https://github.com/RF5/transfusion-asr?utm_source=pytorchkr&ref=pytorchkr

GitHub - RF5/transfusion-asr: Transcribing Speech with Multinomial Diffusion, training code and models.

Transcribing Speech with Multinomial Diffusion, training code and models. - RF5/transfusion-asr

github.com

Multinomial diffusion

https://github.com/ehoogeboom/multinomial_diffusion?utm_source=pytorchkr&ref=pytorchkr

Mistral-src

https://github.com/mistralai/mistral-inference?utm_source=pytorchkr&ref=pytorchkr

GitHub - mistralai/mistral-inference: Official inference library for Mistral models

Official inference library for Mistral models. Contribute to mistralai/mistral-inference development by creating an account on GitHub.

github.com

minbpe

https://github.com/karpathy/minbpe?utm_source=pytorchkr&ref=pytorchkr

GitHub - karpathy/minbpe: Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. - karpathy/minbpe

github.com

gemelo-ai's encodec Vocos

https://github.com/gemelo-ai/vocos?utm_source=pytorchkr&ref=pytorchkr

'AI' 카테고리의 다른 글

DragonFly 다중 해상도 줌(확대)기능을 갖춘 Llama-3 기반 Vision Language Model (0)	2024.06.17
Stable Cascade : 단계적 접근 방식을 통한 효율적 Text - To - Image 생성 모델 (1)	2024.02.26
DoRA: 가중치 분해 LoRA (Weight-Decomposed Low-Rank Adaptation) (0)	2024.02.19
MM-LLMs: 멀티모달 대규모 언어 모델의 최근 발전에 대한 연구 (Recent Advances in MultiModal Large Language Models) (1)	2024.02.16
MagicVideo-V2, 고품질 비디오 생성 기법 (0)	2024.01.24

산으로 가는 블로그

MARS5: 혁신적인 음성을 지원하는 New 음성 모델

소개

주요 특징

사용 방법