⏱ 8–12 min read🎓 Beginner → IntermediateUpdated Apr 2026

Generative Audio: It's Just Token Prediction

Dec 30, 2025 • 20 min read

Generating images works by predicting values on a 2D spatial grid. Generating music requires predicting a 1D temporal sequence at 44,100 discrete samples per second. At that resolution, a 30-second piece of music is 1.3 million samples — too long for any transformer to model directly. The breakthrough came from treating audio generation as a compressed token prediction problem: convert the raw waveform into a compact discrete token representation using a neural audio codec, then train a transformer to predict those tokens. MusicGen, AudioCraft, and Stable Audio all use this approach, achieving high-quality music generation from text descriptions.

1. The Architecture: EnCodec Neural Audio Codec

Meta's EnCodec is the key enabling technology. It's a convolutional encoder-decoder (similar to a VAE but with discrete quantization) that compresses raw PCM audio into 4 parallel streams of discrete tokens using Residual Vector Quantization (RVQ):

Input: 44.1kHz stereo audio (88,200 samples/second)
Output: 4 codebooks × 50 tokens/second = 200 tokens/second (vs 88,200 samples/second)
Compression ratio: 440x more efficient than raw samples
Quality: Near-lossless reconstruction — trained with perceptual + adversarial losses
Why RVQ? First codebook captures broad harmonic structure; subsequent codebooks capture finer detail — hierarchical audio representation

MusicGen's transformer predicts these 4 codebook token streams simultaneously using "interleaving" — treating the 4 parallel sequences as a single token stream with a specific interleaving pattern, enabling autoregressive generation across all codebooks in one forward pass.

2. Running MusicGen Locally

pip install audiocraft

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch

# Model variants:
# musicgen-small  (300M): Fast, ~6GB VRAM, lower quality
# musicgen-medium (1.5B): Good quality, ~10GB VRAM
# musicgen-large  (3.3B): Best quality, ~16GB VRAM, 2-3 min for 30s audio
# musicgen-stereo-large: same but stereo output (double VRAM)
# musicgen-melody: melody conditioning variant

model = MusicGen.get_pretrained('facebook/musicgen-large')

# Generation parameters
model.set_generation_params(
    duration=30,          # seconds of audio to generate
    temperature=1.0,      # 0.5=conservative, 1.0=default, 1.2=experimental
    top_k=250,            # Top-K sampling for each token step
    top_p=0.0,            # Top-P (nucleus) sampling (0=disabled when using top_k)
    cfg_coef=3.0,         # Classifier-Free Guidance scale — higher=follows prompt more
                          # 1.0=no guidance, 3.0=standard, 5.0=very strict
    extend_stride=18,     # Stride when generating >30s audio (overlap handling)
)

# Generate from text prompts (batch processing supported)
descriptions = [
    "Upbeat synthwave with driving arpeggiated bass, punchy kick drum, 128 BPM, retro 80s sound",
    "Cinematic orchestral score, dramatic strings, rising tension, film score, Hans Zimmer style",
    "Lo-fi hip hop beat, mellow jazz chords, soft drums, rain ambience, relaxing and warm",
]

# Returns tensor of shape [batch, channels, samples]
wav_tensor = model.generate(descriptions)

# Save each generated track
for idx, (wav, description) in enumerate(zip(wav_tensor, descriptions)):
    # audio_write handles normalization and format conversion
    audio_write(
        f'generated_track_{idx}',
        wav.cpu(),                    # Move to CPU for I/O
        model.sample_rate,            # 32000 Hz for musicgen-large
        strategy="loudness",          # Normalize loudness to -14 LUFS (streaming standard)
        loudness_compressor=True,     # Light dynamic compression
        format="wav",                 # Or "mp3" (requires ffmpeg)
    )

print(f"Generated {len(descriptions)} tracks!")

3. Melody Conditioning: Transform Any Audio Into New Arrangements

import torchaudio
from audiocraft.models import MusicGen

# musicgen-melody uses CQT (Constant-Q Transform) to extract the melody
# from the input audio, then conditions generation on it.
# The generated music will follow the same melody but with different instruments.

model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20, temperature=1.0, cfg_coef=3.0)

# Load your input melody (any format torchaudio supports)
melody_waveform, sample_rate = torchaudio.load("reference_melody.mp3")

# Tile for batch processing (or use different melody per description)
melody_wavs = melody_waveform.unsqueeze(0).expand(3, -1, -1)  # 3 descriptions

descriptions = [
    "Orchestral arrangement, full string orchestra, dramatic and cinematic",
    "Jazz quartet arrangement, piano, upright bass, trumpet, brushed drums",
    "Electronic rearrangement, synthesizers, driving beat, club music",
]

# Generate while conditioning on the melody from reference_melody.mp3
wav = model.generate_with_chroma(
    descriptions=descriptions,
    melody_wavs=melody_wavs,          # Your reference melody input
    melody_sample_rate=sample_rate,   # Must specify SR of input
    progress=True,                    # Show tqdm progress bar
)

for idx, one_wav in enumerate(wav):
    audio_write(f'arrangement_{descriptions[idx][:20]}', one_wav.cpu(), model.sample_rate)

# Use cases:
# - Hum a melody into your phone → generate full orchestration
# - Reinterpret a reference track in a completely different style/genre
# - Create background music variations that follow your edited melody

4. Music Generation Model Comparison

Model	Provider	Access	Strengths
MusicGen-Large	Meta	Open source (local)	Best open model; text + melody conditioning; batch generation
Stable Audio Open	Stability AI	Open source (local)	Duration-conditioned (specify exact length); high quality stereo
Suno v3/v4	Suno AI	Cloud API ($)	Best overall quality; handles vocals; production-ready sound
Udio	Udio	Cloud API ($)	Excellent for specific genres; strong vocal capability
AudioCraft (all)	Meta	Open source	Suite: MusicGen + AudioGen (SFX) + EnCodec + MAGNeT
Stability Audio 2.0	Stability AI	Cloud API	3-minute tracks; high quality; commercial licensing

Frequently Asked Questions

Can I use MusicGen output commercially?

MusicGen-Large is released under CC BY-NC 4.0 — non-commercial only. For commercial use with open-weight models, Stable Audio Open uses CreativeML Open RAIL-M which is more permissive for commercial applications. Suno and Udio provide commercial licensing tiers in their paid plans. Always check the current license terms before using AI-generated music commercially, as these are evolving rapidly.

How do I improve prompt following for specific musical elements?

Be explicit about: BPM (beats per minute), key signature, specific instruments rather than genres, mood descriptors, reference artists (for style guidance — results vary), and era/decade for sonic character. The CFG scale (cfg_coef) significantly affects how strictly the model follows the prompt: increase it from 3.0 toward 5.0 for more literal prompt interpretation, but at higher values generation quality often decreases. Common pattern: generate at cfg_coef=3 with temperature=1.0 for exploration, then refine with cfg_coef=4+ for the final version.

Conclusion

AI music generation in 2025 is genuinely useful: local MusicGen-Large produces production-quality background music for video content, game soundtracks, and creative exploration. The neural codec approach (EnCodec → transformer token prediction → EnCodec decoding) has proven to be the right architecture for audio generation, enabling everything from 10-second loops to 3-minute structured pieces with consistent musical style. For applications requiring top commercial quality, Suno v4 leads; for open-source control and local deployment, MusicGen-Large is the standard.

Continue Reading

👨‍💻

Written by

Vivek

AI Engineer

Full-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.

GPT-4oLangChainNext.jsVector DBsRAGVercel AI SDK

More about me →GitHub ↗Contact