Quantifying and Predicting Beer Preference with the Untappd API

About us

Ian Nightingale

Python Guru

Alexander Jaffe

Data Analyst

Brian Mendel

Data Wrangler

Background and Motivation

Craft beer has exploded in popularity in recent years, representing a > $10B market in the US alone (see figure on right). This surge in popularity, along with the rise of social drinking apps like RateBeer, Beer Advocate, and Untappd have generated huge amounts of data that could reveal consumer interest and help to guide the craft beer movement as it solidifies its place among the traditional American beer business. However, to our knowledge, this type of data is not commonly employed by breweries or venues to judge consumer taste. As both beer geeks and data scientists, we wanted to explore these data and find out what makes “good beer” good to the people that drink it. We recognize that beer rating is fairly subjective, dictated both by personal tastes and other external factors. Our study then, is an attempt to quantify these variables, determining which contribute to the overall rating of a beer and which can be used to predict a person’s reception of a certain beer.
Note: we do not approve of the choice to use a donut chart.

The Data

6,377 users

606,083 ratings

88,028 unique beers

8,035 breweries


Style and Attributes

In our exploratory analysis, we sought to discover relationships among variables and any correlations that they might have with beer rating. We saw that different styles of beer cluster by their alcohol by volume and bitterness, which we expect given that these metrics relate to the ingredients and flavor profile of a given beer. Looking more closely, linear regressions revealed that some of these variables correlated with user rating, which implied that current tastes skew towards more alcoholic and hoppier beers. Given this, we weren’t surprised to see that we that russian imperial stouts, wild ales, ipas and double ipas are consistently higher rated. We also found that the locality of a beer was not correlated with its rating - there is good beer everywhere!

Beer Descriptions

We also wanted to explore textual features of our dataset, in particular those contained in the corpus of beer descriptions. We tokenized these descriptions and created a word cloud to visualize the most frequently used terms. Click to enter interactive mode and explore!

Machine Learning

We then wondered - could we use these textual and numerical features to predict user rating of a beer? We implemented two random forest classifiers, first using words from our natural language processing analysis and secondly with the beer attributes. For the word analysis, important features began with common words like malt and hops, whereas for beer attributes, the global rating of the beer was the best predictor. Our classifiers were not highly accurate, but performance improved once we binarized our data into good and bad beer. Using the results of our analysis, we built a recommendation tool that tells a user whether or not they will like a given beer.


Given the relatively low accuracy score of our classifiers, the predictive value of our variables remains somewhat unclear - although our analysis was able to identify important trends in user preference, more information may be needed to implement a more accurate algorithm. Future work could look to include other important attributes like price, aroma, taste, and color that could all contribute to a user's perception of quality. A beer's perceived quality may to some degree be a product of psychology, difficult to detect/predict given the variables we worked with here. Ultimately, we were able to build a large database of beers and perform preliminary analyses that explored its major features and predictive possibilities. This large dataset also allowed us to visualize global trends in a novel way, providing some insight into the existing field of craft beer and the preferences of those who drink it.