Grantland was a long form blog owned by ESPN. Grantland was known for its award-winning writing, and it’s contributors brilliantly mixed sports, popular culture, and data analytics & visualization into riveting stories and analysis. Grantland was writing and media as it should be, and it was shut down because it was much harder to monetize than low-effort clickbait. This analysis is a lighthearted tribute to the site and it’s contributors.
Read on to see how I used unsupervised clustering methods to group articles by topic, and supervised learning methods to predict the author of each article.
This project can be forked from my github here.
I scraped 126 Articles by selected contributors from Grantland.com using BeautifulSoup.
Contributors were chosen fairly arbitrarily by my memory of which writers I enjoyed most,which means the text will bias towards my interests, that is, mostly basketball and occasionally football.
I also added a few extra contributors at random for variety.
The scraper pulled the most recent 10 posts for each contributor, then excluded podcast posts. Bill Simmons’s most recent posts were almost all podcasts, so his were selected manually. Ultimately, the corpus broke down by contributor this way:
Before any features could be created, I had to remove special characters and expand contractions into full words. I also removed "stop words" - simple words like "and" or "the" that are common and don't contain meaningful information (on their own) in the context of a sports/pop culture blog.
I then chose to use lemmas instead of the original text. Lemmas can be thought of as the "reduced" form of a word; for example, "decide", "decides", and "decided" would all be reduced to the base "decide". Remember - to a machine, words are just groups of characters, and a machine can't tell that "decide" and "decides" indicate similar ideas any more than it can tell that "right" and "rights" have different meanings. That's why lemmas can improve NLP models in many cases - we lose a small degree of context, but indicate to the machine that words from the original text convey near-identical ideas.
NLP packages have these word-to-lemma relationships mapped out thoroughly for general -audience text; I used spacy in this case.
I then trained gensim's doc2vec model on 94 of the 126 articles, and held out the rest as a test set. The model was set to output a vector of length 65, and was trained for 100 epochs. I ignored any words that appeared less than 7 times. The result was a 65-dimension vector describing each article based on it's contents; let's break this down.
In a word2vec model, each word is assigned a vector of 65 numbers, which are initialized completely randomly neural network. In each epoch, each word's vector is iteratively nudged such that the 65-dimensional "coordinates" (or weights) for a word's vector are nearby other words that are used in similar contexts, and are further from words used in very different contexts. If it is well trained, we would expect the vector for "basketball" to be close/similar to the vector for "game", and far from the vector for "movie" or "goalie".
The word2vec model is useful for making predictions at the word-level, but the goal here is to make predictions about entire articles, so we need a doc2vec model. The doc2vec algorithm starts as a word2vec; at the end, it creates an additional vector for the article itself, and weights it appropriately based on the vectors of all the words within it.
With 65 features to describe each article, Our hope is that we have created features that can make meaningful differentiations. However, we don't know how many topics there are to differentiate. Below are 3 different clustering methods I tried, which were each insightful and revealed different things about the data.
K-Means is a great place to start due to it's simplicity; however, it has 2 key limitations that matter:
One way to estimate how many clusters are present in the data is to calculate silhouette scores for different numbers of K-means clusters. Silhouette score is a crude calculation for how different the clusters are from each other:
Based on the above, I started with 4 clusters, but quickly moved to 3 as it appeared to be a cleaner option. Obviously, we can't visualize 65-dimensional spaces, so I used Principal Components Analysis to capture the 2 dimensions that explained the greatest amount of variance, and we see that this method has done really clean job separating the 3 clusters, even in just the first 2 dimensions!
Let's look at word clouds to see what each of these clusters contain:
Not bad! Loosely, The first topic is basketball, the second is football, and the 3 seems to be a generic "pop culture" cluster. This is a great start.
However, the assumption that each cluster is the same size doesn't sound right; I read about basketball more often than I do football, and MUCH more often than pop culture topics. It's likely that my choice of authors would have skewed towards basketball articles.
Spectral Clustering still requires that we specify how many clusters we want to see, but because it uses eigenvectors of the Laplacian matrix of a similarity graph, Spectral clustering can identify clusters of different sizes - this matches our intuition better.
Once again, I used 3 clusters, and used PCA to visualize the best 2 dimensions:
As expected, we have various sizes of clusters, which indicates that we have topics in varying quantities. In 2D, we have more overlap then before, which is not desirable. Let's examine the word clouds of each:
This is VERY insightful. Let's break it down. The first cluster is still primarily basketball, but is a bit less focused now; we see words like "people, thing, feel, go" which previously were present in the pop culture topic. The second cluster is still football, and is perhaps more specific to football. Finally, the third topic is completely new - Hockey!
The interpretation here is that 3 clusters is the wrong number. One solution is missing hockey, and the other is missing pop culture. Those articles are being with with one of the other topics.
Affinity Propagation can identify a variable number of clusters AND clusters of varying sizes. The downside is that this solution is very sensitive to differences, and tends to really overestimate how many clusters exist. When we run Affinity Propagation on this data, it returns 11 clusters! In 2 dimensions, they are not well separated, either:
However - it's an interesting starting point. From there I examined word clouds of each of the 11, and realized I could consolidate down to a 6-cluster solution that is really clean! With more data, I suspect the especially bizarre "Future" cluster would not remain consistent - it strikes me as a catch-all cluster for some irreverent outliers, whereas the other 5 are well defined and consistent:
Unlike topics, we know the authors in advance, so predicting authorship is a supervised problem. To predict Author, I tried 3 Classification techniques:
Additionally, I tried two different sets of features to describe the data:
Latent Semantic Analysis (LSA) is a technique that uses Singular Value Decomposition on a sparse matrix containing term counts per document. Ultimately, the output is a feature vector for each document with information about the concepts within.
I used F1 scores to compare the methods, which is a handy way of ensuring that models are performing well on both precision and recall.
Despite the success of clustering on doc2vec vectors, those features performed poorly for classification: For all 3 techniques, LSA features were more successful.
Additionally, Logistic Regression outperformed the other 2 techniques regardless of the feature set. The performance of the best solution is below:
I extracted the most used words and most unique words used by each contributor - this can be viewed at the bottom of this presentation. Enjoy!