COMP4949 | Predictive Analytics | BIG DATA!

Chord
Progression
Predictor

An honest attempt at data-driven exploration of 30 years of Billboard pop harmony, built with an N-gram language model trained on 886 real songs.

886
Billboard Songs
95K
Chord Tokens
1958-91
Dataset Range
01 | Motivation

The Four Chord Song

Every now and then, this wonderfully awesome video appears in my feed. In 2009, Australian comedy band Axis of Awesome performed a medley of over 40 pop hits that all use the exact same four chords. So given this assignment, I thought to myself...

Is pop music really that simple?

Axis of Awesome - "4 Chords Song" (2009). Every song in the medley uses I - V - vi - IV.

"What if you could train a model on real Billboard dataset and let it answer that question empirically?"

Rather than accepting the claim at face value, I wanted to build something that could measure it using the lessons we learned (Thanks Pat!). The four chord progression in question, written in Roman numerals as I -> V -> vi -> IV is apparently famous. But is it actually the most common pattern in pop music? Well... Let's see!

This project is a different take the N-gram language model. Instead of predicting the next word in a sentence, it predicts the next chord in a progression.

02 | The Data

McGill Billboard Dataset

The McGill Billboard Project used for the project includes chord annotations for 886 songs from the Billboard Hot 100, spanning 1958 to 1991.

886
Songs
Billboard Hot 100 from 1958-1991
95,199
Chord Tokens
After transposing to roman numeral
107
Avg. Chords/Song
Range: 12 - 480

The Process: Normalization

The dataset provides chord annotations that look like this: A:min, C:maj, G:maj. Songs like Let It Be (key of C) and With or Without You (key of D) use the same harmonic pattern but completely different note names, so comparing them directly is like comparing sentences written in different alphabets.

The solution is to normalize the chords using Roman numeral transposition by converting every chord to its position in the song's major scale. Both songs then become a standardized I → V → vi → IV.

01
Raw chords
A:min · C:maj
G:maj · F:maj
02
Detect key
tonic: C
(salami metadata)
03
Transpose
A:min → vi
C:maj → I
G:maj → V
04
Corpus
886 sequences
95K tokens
14 symbols
03 | The Model

Chords as Tokens

N-gram models are used to find recurring word phrases, like we did for that one lesson with movie reviews. Here, the exact same technique is applied to music where chords are tokens and its progressions are sentences.

AnalogThis Project
Vocabulary of words14 Roman numeral symbols
Tokenized sentenceChord sequence per song
Text corpus886 Billboard songs
Bigram "sky is"Chord bigram I → V
Next word predictionNext chord prediction
Word frequency (hot words)Roman numeral frequency

An N-gram model learns conditional probabilities from the training corpus. A bigram looks 1 chord back: "after I, what chord appears most often next?" A trigram looks 2 chords back, giving more context, and hopefully better predictions.

Improvement attempts were made...

Laplace Smoothing (k = 0.5)

A naive N-gram assigns zero probability to any chord it never saw after a given context. Laplace smoothing adds a small constant to every count. That way, unseen chords get a tiny probability instead of zero.

Stupid Backoff

If a trigram context has fewer than 5 observations, the model backs off to the bigram (scaled by 0.4). If that's sparse too, it falls back to raw unigram frequencies. Trigram -> Bigram -> Unigram.

Training output

N=2 (bigram): 14 unique chords, 14 contexts, 95,199 tokens
N=3 (trigram): 14 unique chords, 180 contexts with 16 possible two-chord combinations never appeared in the data, where backoff is used.

04 | Results

What the Model Learned

Chord frequency across 886 songs

Although, this project idea hinged off the 4 Chord Song, it's statistically more probable for songs to use just three chords. I, IV, and V alone account for roughly 68% of all chord tokens. Teal bars are major chords, gold bars are minor.

The Four Chord Song Test

After V → vi, does the model predict IV to complete the famous loop?

Bigram — context: vi → ?
vi → ?
Trigram — context: V → vi → ?
V → vi → ?
Key finding

The bigram correctly identifies IV first (29.4%). But the trigram, with one extra chord of context, finds that V is more common (35%) than IV (25%). So it seems like The Four Chord Song is NOT the most frequent progression.

Where the Model Fails

The minor A common progression: i -> VI -> III -> VII used in songs like Creep, Mad World, Africa, and hundreds of post-90s hits scored poorly. The model predicts V at 35%, not VII.

The dataset ends in 1991. This particular pattern exploded in the 1990s and 2000s, well outside the training window. So it's not that the model is necessarily wrong when applied to songs of my generation; the model can only reflects what's available.

05 | Interactive Demo

Try It Yourself

Click chord buttons to build a progression. This model attempts to predict what comes next. Toggle between bigram and trigram to compare.

major chords
minor chords
 
loading piano...
click a chord to begin...
predictions
TRIGRAM

Suggested: try I → V → vi and compare bigram vs trigram.

06 | Reflection

Limitations & What's Next

Dataset era (1958-1991)

Modern progressions that emerged in the 90s to now are underrepresented. The model is missing current data.

Simplified chord types

Chords are reduced to major/minor only. 7ths, suspensions, and extended chords typically used in jazz is lost; but let's be honest, who how can you predict jazz??

No generalisation

N-gram models can only predict from contexts they've seen. An unseen trigram always backs off, meaning that it can't infer from similar patterns.

Realistic context window

A trigram only sees 2 chords back. Real musical memory spans entire sections. Theory in verse, chorus, bridge cannot be seen by N-grams.