An honest attempt at data-driven exploration of 30 years of Billboard pop harmony, built with an N-gram language model trained on 886 real songs.
Every now and then, this wonderfully awesome video appears in my feed. In 2009, Australian comedy band Axis of Awesome performed a medley of over 40 pop hits that all use the exact same four chords. So given this assignment, I thought to myself...
Is pop music really that simple?
Axis of Awesome - "4 Chords Song" (2009). Every song in the medley uses I - V - vi - IV.
"What if you could train a model on real Billboard dataset and let it answer that question empirically?"
Rather than accepting the claim at face value, I wanted to build something that could measure it using the lessons we learned (Thanks Pat!). The four chord progression in question, written in Roman numerals as I -> V -> vi -> IV is apparently famous. But is it actually the most common pattern in pop music? Well... Let's see!
This project is a different take the N-gram language model. Instead of predicting the next word in a sentence, it predicts the next chord in a progression.
The McGill Billboard Project used for the project includes chord annotations for 886 songs from the Billboard Hot 100, spanning 1958 to 1991.
The dataset provides chord annotations that look like this: A:min, C:maj, G:maj. Songs like Let It Be (key of C) and With or Without You (key of D) use the same harmonic pattern but completely different note names, so comparing them directly is like comparing sentences written in different alphabets.
The solution is to normalize the chords using Roman numeral transposition by converting every chord to its position in the song's major scale. Both songs then become a standardized I → V → vi → IV.
N-gram models are used to find recurring word phrases, like we did for that one lesson with movie reviews. Here, the exact same technique is applied to music where chords are tokens and its progressions are sentences.
| Analog | This Project |
|---|---|
| Vocabulary of words | 14 Roman numeral symbols |
| Tokenized sentence | Chord sequence per song |
| Text corpus | 886 Billboard songs |
| Bigram "sky is" | Chord bigram I → V |
| Next word prediction | Next chord prediction |
| Word frequency (hot words) | Roman numeral frequency |
An N-gram model learns conditional probabilities from the training corpus. A bigram looks 1 chord back: "after I, what chord appears most often next?" A trigram looks 2 chords back, giving more context, and hopefully better predictions.
A naive N-gram assigns zero probability to any chord it never saw after a given context. Laplace smoothing adds a small constant to every count. That way, unseen chords get a tiny probability instead of zero.
If a trigram context has fewer than 5 observations, the model backs off to the bigram (scaled by 0.4). If that's sparse too, it falls back to raw unigram frequencies. Trigram -> Bigram -> Unigram.
N=2 (bigram): 14 unique chords, 14 contexts, 95,199 tokens
N=3 (trigram): 14 unique chords, 180 contexts with 16 possible two-chord combinations never appeared in the data, where backoff is used.
Although, this project idea hinged off the 4 Chord Song, it's statistically more probable for songs to use just three chords. I, IV, and V alone account for roughly 68% of all chord tokens. Teal bars are major chords, gold bars are minor.
After V → vi, does the model predict IV to complete the famous loop?
The bigram correctly identifies IV first (29.4%). But the trigram, with one extra chord of context, finds that V is more common (35%) than IV (25%). So it seems like The Four Chord Song is NOT the most frequent progression.
The minor A common progression: i -> VI -> III -> VII used in songs like Creep, Mad World, Africa, and hundreds of post-90s hits scored poorly. The model predicts V at 35%, not VII.
The dataset ends in 1991. This particular pattern exploded in the 1990s and 2000s, well outside the training window. So it's not that the model is necessarily wrong when applied to songs of my generation; the model can only reflects what's available.
Click chord buttons to build a progression. This model attempts to predict what comes next. Toggle between bigram and trigram to compare.
Suggested: try I → V → vi and compare bigram vs trigram.
Modern progressions that emerged in the 90s to now are underrepresented. The model is missing current data.
Chords are reduced to major/minor only. 7ths, suspensions, and extended chords typically used in jazz is lost; but let's be honest, who how can you predict jazz??
N-gram models can only predict from contexts they've seen. An unseen trigram always backs off, meaning that it can't infer from similar patterns.
A trigram only sees 2 chords back. Real musical memory spans entire sections. Theory in verse, chorus, bridge cannot be seen by N-grams.