Generative Audio: It's Just Token Prediction
Dec 30, 2025 • 20 min read
Generating images works by predicting values on a 2D spatial grid. Generating music requires predicting a 1D temporal sequence at 44,100 discrete samples per second. At that resolution, a 30-second piece of music is 1.3 million samples — too long for any transformer to model directly. The breakthrough came from treating audio generation as a compressed token prediction problem: convert the raw waveform into a compact discrete token representation using a neural audio codec, then train a transformer to predict those tokens. MusicGen, AudioCraft, and Stable Audio all use this approach, achieving high-quality music generation from text descriptions.
1. The Architecture: EnCodec Neural Audio Codec
Meta's EnCodec is the key enabling technology. It's a convolutional encoder-decoder (similar to a VAE but with discrete quantization) that compresses raw PCM audio into 4 parallel streams of discrete tokens using Residual Vector Quantization (RVQ):
- Input: 44.1kHz stereo audio (88,200 samples/second)
- Output: 4 codebooks × 50 tokens/second = 200 tokens/second (vs 88,200 samples/second)
- Compression ratio: 440x more efficient than raw samples
- Quality: Near-lossless reconstruction — trained with perceptual + adversarial losses
- Why RVQ? First codebook captures broad harmonic structure; subsequent codebooks capture finer detail — hierarchical audio representation
MusicGen's transformer predicts these 4 codebook token streams simultaneously using "interleaving" — treating the 4 parallel sequences as a single token stream with a specific interleaving pattern, enabling autoregressive generation across all codebooks in one forward pass.
2. Running MusicGen Locally
pip install audiocraft
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torch
# Model variants:
# musicgen-small (300M): Fast, ~6GB VRAM, lower quality
# musicgen-medium (1.5B): Good quality, ~10GB VRAM
# musicgen-large (3.3B): Best quality, ~16GB VRAM, 2-3 min for 30s audio
# musicgen-stereo-large: same but stereo output (double VRAM)
# musicgen-melody: melody conditioning variant
model = MusicGen.get_pretrained('facebook/musicgen-large')
# Generation parameters
model.set_generation_params(
duration=30, # seconds of audio to generate
temperature=1.0, # 0.5=conservative, 1.0=default, 1.2=experimental
top_k=250, # Top-K sampling for each token step
top_p=0.0, # Top-P (nucleus) sampling (0=disabled when using top_k)
cfg_coef=3.0, # Classifier-Free Guidance scale — higher=follows prompt more
# 1.0=no guidance, 3.0=standard, 5.0=very strict
extend_stride=18, # Stride when generating >30s audio (overlap handling)
)
# Generate from text prompts (batch processing supported)
descriptions = [
"Upbeat synthwave with driving arpeggiated bass, punchy kick drum, 128 BPM, retro 80s sound",
"Cinematic orchestral score, dramatic strings, rising tension, film score, Hans Zimmer style",
"Lo-fi hip hop beat, mellow jazz chords, soft drums, rain ambience, relaxing and warm",
]
# Returns tensor of shape [batch, channels, samples]
wav_tensor = model.generate(descriptions)
# Save each generated track
for idx, (wav, description) in enumerate(zip(wav_tensor, descriptions)):
# audio_write handles normalization and format conversion
audio_write(
f'generated_track_{idx}',
wav.cpu(), # Move to CPU for I/O
model.sample_rate, # 32000 Hz for musicgen-large
strategy="loudness", # Normalize loudness to -14 LUFS (streaming standard)
loudness_compressor=True, # Light dynamic compression
format="wav", # Or "mp3" (requires ffmpeg)
)
print(f"Generated {len(descriptions)} tracks!")3. Melody Conditioning: Transform Any Audio Into New Arrangements
import torchaudio
from audiocraft.models import MusicGen
# musicgen-melody uses CQT (Constant-Q Transform) to extract the melody
# from the input audio, then conditions generation on it.
# The generated music will follow the same melody but with different instruments.
model = MusicGen.get_pretrained('facebook/musicgen-melody')
model.set_generation_params(duration=20, temperature=1.0, cfg_coef=3.0)
# Load your input melody (any format torchaudio supports)
melody_waveform, sample_rate = torchaudio.load("reference_melody.mp3")
# Tile for batch processing (or use different melody per description)
melody_wavs = melody_waveform.unsqueeze(0).expand(3, -1, -1) # 3 descriptions
descriptions = [
"Orchestral arrangement, full string orchestra, dramatic and cinematic",
"Jazz quartet arrangement, piano, upright bass, trumpet, brushed drums",
"Electronic rearrangement, synthesizers, driving beat, club music",
]
# Generate while conditioning on the melody from reference_melody.mp3
wav = model.generate_with_chroma(
descriptions=descriptions,
melody_wavs=melody_wavs, # Your reference melody input
melody_sample_rate=sample_rate, # Must specify SR of input
progress=True, # Show tqdm progress bar
)
for idx, one_wav in enumerate(wav):
audio_write(f'arrangement_{descriptions[idx][:20]}', one_wav.cpu(), model.sample_rate)
# Use cases:
# - Hum a melody into your phone → generate full orchestration
# - Reinterpret a reference track in a completely different style/genre
# - Create background music variations that follow your edited melody4. Music Generation Model Comparison
| Model | Provider | Access | Strengths |
|---|---|---|---|
| MusicGen-Large | Meta | Open source (local) | Best open model; text + melody conditioning; batch generation |
| Stable Audio Open | Stability AI | Open source (local) | Duration-conditioned (specify exact length); high quality stereo |
| Suno v3/v4 | Suno AI | Cloud API ($) | Best overall quality; handles vocals; production-ready sound |
| Udio | Udio | Cloud API ($) | Excellent for specific genres; strong vocal capability |
| AudioCraft (all) | Meta | Open source | Suite: MusicGen + AudioGen (SFX) + EnCodec + MAGNeT |
| Stability Audio 2.0 | Stability AI | Cloud API | 3-minute tracks; high quality; commercial licensing |
Frequently Asked Questions
Can I use MusicGen output commercially?
MusicGen-Large is released under CC BY-NC 4.0 — non-commercial only. For commercial use with open-weight models, Stable Audio Open uses CreativeML Open RAIL-M which is more permissive for commercial applications. Suno and Udio provide commercial licensing tiers in their paid plans. Always check the current license terms before using AI-generated music commercially, as these are evolving rapidly.
How do I improve prompt following for specific musical elements?
Be explicit about: BPM (beats per minute), key signature, specific instruments rather than genres, mood descriptors, reference artists (for style guidance — results vary), and era/decade for sonic character. The CFG scale (cfg_coef) significantly affects how strictly the model follows the prompt: increase it from 3.0 toward 5.0 for more literal prompt interpretation, but at higher values generation quality often decreases. Common pattern: generate at cfg_coef=3 with temperature=1.0 for exploration, then refine with cfg_coef=4+ for the final version.
Conclusion
AI music generation in 2025 is genuinely useful: local MusicGen-Large produces production-quality background music for video content, game soundtracks, and creative exploration. The neural codec approach (EnCodec → transformer token prediction → EnCodec decoding) has proven to be the right architecture for audio generation, enabling everything from 10-second loops to 3-minute structured pieces with consistent musical style. For applications requiring top commercial quality, Suno v4 leads; for open-source control and local deployment, MusicGen-Large is the standard.
Continue Reading
Vivek
AI EngineerFull-stack AI engineer with 4+ years building LLM-powered products, autonomous agents, and RAG pipelines. I've shipped AI features to production for startups and worked hands-on with GPT-4o, LangChain, LlamaIndex, and the Vercel AI SDK. I started OpnCrafter to share everything I wish I had when learning — no fluff, just working code and real-world context.