Reddit Logo

Predicting Upvotes and Popularity on Reddit

Authors: Andrew Paul, Chigozie Nna


Introduction

Reddit is an American social news aggregation, web content rating, and discussion website. The site quickly gained popularity after its creation in 2005 by two University of Virgina Students, Steven Huffman, and Alexis Ohanian. The goal of Reddit is for its users to submit content to the site in the form of links, text posts, and images, which can then be voted up or down by other users. The posts are categorized into groups called “subreddits”, where users can share specific topics and/or interests that relate to the topic at hand. In its early years, Reddit began to rise in popularity, with NSFW, Programming, and Science being the top trending subreddits of the time. By 2008, a launch of numerous different subreddits began to popularize the site, with Reddit being able to gain enough popularity to overtake its competitor Digg by 2010. Reddit’s rise to fame did not stop there, with Reddit finally achieving a total of one billion page views per month only a year later. As of 2019, Reddit is ranked the 18th top site globally, according to Alexa Internet.

In this tutorial, we analyzed all the Reddit posts from Janurary 2016 to the August 2019 (over 510 million posts!). The goal was to provide us with knowledge into what factors of a post (such as title length, and time posted) cause the most effect in terms of up votes, down votes, score, and general reaction to a post. Posts may vary in topics, arguments, time posted, and many more variables, but we felt as if the popularity really depended on the post's title length, and the time it was posted. We were able to determine which length is just too short to gain attention, and what length is long enough to bore an audience. We also looked at the most popular subreddit posts and time of day to see any upvote relation. We hope to give enough information and analysis to provide clarity, understanding, and a new found interest to readers that are unfamiliar with the social foreground. And hopefully fellow Reddit users will gain some insight on how to optimize their posts to gain the most traction.

Getting started with the Data

We decided to use Python 3 and SQL to gather and analyze our data. Crucial libraries used to help us were: pandas, matplotlib, seaborn, and scikit-learn.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
We used SQL commands to query in the data, and used Panda's dataframes to read in and analyze the data.

Processing and Recieving data

We used the following SQL command through Google's BigQuery to take data from "Pushshift" (a third party Reddit API) that tracks nearly all of Reddit's post history. We are taking in data from Janurary 2016 to August 2019, which contains about 510 million rows of data. Luckily using Google's BigQuery, that process only takes a matter of seconds.

SQL Code

In this SQL Query we are getting the length of every single title, the average score based on the length of the title, the average number of comments based on the length of the title, and the number of posts with that amount of characters. This is done by using the 'GROUP BY' command with SQL. BigQuery convertd this data into a .csv file, making it easy to parse and analyze.

Reading the Data

We first used Python's Pandas module to read in the .csv file and convert it into a Panda dataframe, which is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

data = pd.read_csv('LengthScoreComments.csv', sep=',')
data[:10]
length_title avg_score avg_comments num_posts
0 1 40.804163 1.789499 472705
1 2 65.521526 2.796440 739424
2 3 65.577614 3.039975 1361269
3 4 92.408734 3.559541 2588850
4 5 84.223042 3.203536 2210213
5 6 129.382588 3.370371 3850745
6 7 105.844544 3.985902 2858907
7 8 99.618349 4.208025 2939891
8 9 104.936504 4.531499 3818682
9 10 103.995719 4.758954 3773035

In the DataFrame above you can see:

  • length_title: number of characters in the title (1-300).
  • avg_score: average score (upvotes minus downvotes) of the posts with the certain title length.
  • avg_comments: average amount of comments on a post with the certain title length.
  • num_posts: number of posts with that certain title length.


Graphing

In this first graph, we looked at the potential relationship between the length of the post title, and the score that post recieved. This can help readers gain an insight to how long they should make their posts to gain the most traction.

X = data['length_title']
Y = data['avg_score']
Size = data['num_posts']/200000
plt.scatter(x = X, y = Y, s = Size )
plt.title('Length of Post Title vs Average Score of Post')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

We can see that there is a clear relationship between the number of characters in the post title, and the score recieved. Based on the graph, it seems the highest scored posts have title lengths between 5-25 characters, and 153-300 characters. The range most likely is based off the fact that the type of posts are fairly different. There are many popular subreddits such as "r/me_irl" where every post is titled with the same six characters, "me_irl". Other popular subreddits are seemingly the opposite, such as "r/futurology", where most post titles are scientific headlines with over 200 characters. One explanation for what is going on here is that there are many, many times more posts with 50 character titles than, say, 261. So there's a dramatic increase in variability towards the upper end of the character limit. As the number of characters increases, the posts could tend to be actual issues or phenomena/facts that require longer explanation, and thus recieve more views and reactions (ex. A president's quote = more characters and responses).


Comments

We then wanted to see the relationship between the post's title length, and the number of comments the post recieved. People will upvote anything they think is funny or interesting, however, we wanted to see which posts actually get people commenting. Comments are what we deemed as true reactions to the posts, since it requires viewers to put in more effort than just a click for an upvote.

X = data['length_title']
Y = data['avg_comments']
Size = data['num_posts']/200000
plot = plt.scatter(x = X, y = Y, s = Size)
plt.title('Avg Number of Comments vs Length of Post title')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Comments')
plt.show()
Based on the relation, we are able to see that the number of comments are actually greater based on the character length of the posts. One explanation is that the longer the title, the more likely it is a controversial issue, quote, or topic that requires a lot of explanation. Thus, people are more likely to comment on and put their input on these types of posts. For example, posts on the subreddit "r/news" tend to have longer post titles (because of headlines) and thus generate more comments for people to furhter the discussion of the topic. This is in contrast to memes captioned with a few words, or 3-liner jokes that will mostly recieve likes and not comments.

Reddit Artwork

Filtered Data

We then decided to filter our data to only include the most popular subreddits, so they don't include ones like "r/me_irl", as discussed earlier. We want to see the relationship between title length and upvotes for typical reddit users, that aren't making posts with smally, silly trends like "r/hmmm".

In this SQL Query, we created an entirely new .csv file with the top 15 subreddits (most subscribers) to see the relation again, with subreddits that most of reddit uses on a day to day basis. This gave us a more accurate analysis in understanding the relationship between post title length, and popularity.

Unbias = pd.read_csv('FilteredSubLengthScore.csv', sep=',')
Unbias[:10]
length_title avg_score num_posts
0 1 90.522604 9401
1 2 210.944914 18934
2 3 207.167737 35812
3 4 218.983280 61842
4 5 241.415093 57298
5 6 271.754140 62556
6 7 285.368175 77455
7 8 288.794141 94686
8 9 316.807680 123414
9 10 313.843326 141357

In the DataFrame above you can see:

  • length_title: number of characters in the title.
  • avg_score: average score (upvotes minus downvotes) of the posts with the certain title length.
  • num_posts: number of posts with that certain title length.


Graphing Filtered Data

We used the same method from above to graph the relationship between the post title's length and the average score it recieved, except only looking at the top 15 subreddits.
X = Unbias['length_title']
Y = Unbias['avg_score']
Size = Unbias['num_posts']/10000
plt.scatter(x = X, y = Y, s = Size)
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

Based on the new graph we can see a less skewed range of values and a more accurate drop off of average score after around 25 characters. It seems that the amount of characters required are still the same, despite the data being filtered now. This proves the fact that it is best to have posts that are between the amounts of 5-25 characters.


Linear & Polynomial Regression

We decided to get a relative prediction of what the outcome would be by creating both a linear regression of the data, as well as a polynomial trend line.

Linear

Y = Unbias['avg_score']
X = Unbias['length_title']

linear_regression = LinearRegression()
reshapedX = X.values.reshape(-1, 1)
linear_regression.fit(reshapedX, Y)
model = linear_regression.predict(reshapedX)

plt.figure(figsize=(10,8));
plt.scatter(X, Y);
plt.plot(X, model);
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

Polynomial

poly_reg = PolynomialFeatures(degree=2)
reshapedX = X.values.reshape(-1, 1)
poly = poly_reg.fit_transform(reshapedX)


linear_regression2 = LinearRegression()
reshapedY = Y.values.reshape(-1, 1)
linear_regression2.fit(poly, reshapedY)
y_pred = linear_regression2.predict(poly)
plt.figure(figsize=(10,8));
plt.scatter(X, Y);
plt.plot(X, y_pred);
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

The polynomial regression gives us a more accurate analysis of the data compared to the linear regression line. The relationship between the length of characters in a post and the average number of upvotes is a clear polynomial relation. Reddit users can use this knowledge to decide if they want to either go with a short amount of characters to average around 300 upvotes, or go with a more lengthy title to gain an average of 300+ upvotes.


Typical Reddit Home Page

Time Matters

In this section, we analyzed whether the time of day something is posted is correlated with its popularity. This way, Reddit users can use this knowledge and potentially hold off posting something, in order to get the most amount of traction on their post. We started by getting a new batch of data from BigQuery.

Our goal was to figure out what time of day a post would be most likely to maximize its popularity. So we organized our data into 168 rows (24 hours in a day x 7 days a week), each row having the average score the post recieves. We decided to look at posts which have a score of 100 or more upvotes, so our results are not biased with the large majority of posts having little to no upvotes, bringing the average down.

timeScore = pd.read_csv('TimeVsScore.csv', sep=',')
formattedTimeScore = timeScore.copy()
formattedTimeScore['hourofday'] = formattedTimeScore['hourofday'].apply(convertHourToTime)
formattedTimeScore['dayofweek'] = formattedTimeScore['dayofweek'].apply(convertNumToDay)

# Converts the numbers 0-23 to their respective times
def convertHourToTime(num):
    if num == 0:
        timeOfDay = '12 AM'
    elif num <= 11:
        timeOfDay = str(num) + ' AM'
    elif num == 12:
        timeOfDay = '12 PM'
    else: 
        timeOfDay = str(num-12) + ' PM'
    return timeOfDay

# Converts the numbers 1-7 to their respective weekdays
def convertNumToDay(num):
    weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
    return weekdays[num-1]

formattedTimeScore
avg_score dayofweek hourofday
0 817.979886 Sunday 12 AM
1 832.602126 Sunday 1 AM
2 926.316992 Sunday 2 AM
3 1013.444329 Sunday 3 AM
4 1134.617892 Sunday 4 AM
... ... ... ...
163 971.682509 Saturday 7 PM
164 942.749931 Saturday 8 PM
165 911.208568 Saturday 9 PM
166 880.878130 Saturday 10 PM
167 823.783345 Saturday 11 PM

168 rows × 3 columns

We wanted to display our data in a heatmap, but the data was organized into three columns. We had to reorganize the data in order for the heatmap to work correctly.

timeScoreMatrix = timeScore.pivot(index='dayofweek', columns='hourofday', values='avg_score')
cols = timeScoreMatrix.columns.tolist()
cols.insert(0, cols.pop(cols.index(23)))
fixedTimeScoreMatrix = timeScoreMatrix.reindex(columns=cols)
timeScoreMatrix
hourofday 0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
dayofweek
1 817.979886 832.602126 926.316992 1013.444329 1134.617892 1283.984606 1400.718513 1384.293555 1339.667450 1227.160091 ... 940.857548 968.624444 937.781498 929.255882 967.731765 942.091881 964.019933 963.388646 853.194138 802.564590
2 815.127267 862.858846 928.686203 1042.522746 1189.814349 1386.860338 1369.992439 1293.379859 1220.160733 1125.224842 ... 957.458976 938.846684 938.656955 939.169165 937.429680 957.803950 937.148762 886.015139 810.725206 800.888918
3 807.148566 859.623517 974.795660 1067.599219 1179.800492 1357.187672 1392.787999 1288.674100 1203.942667 1137.456890 ... 968.803341 962.026646 934.731859 944.079777 949.844017 973.267883 955.275513 890.547645 838.271816 822.564901
4 820.053926 864.788059 966.607819 1086.004091 1201.194851 1373.418265 1386.699401 1285.982900 1201.614869 1116.236775 ... 969.912780 957.093768 941.838205 947.765502 929.672191 966.943421 925.359047 874.487371 815.822274 818.863226
5 810.921800 883.710208 969.234269 1078.066438 1207.298409 1366.675789 1366.740784 1282.025036 1212.700262 1112.079900 ... 958.369804 932.961184 935.878545 949.448003 930.227652 964.164243 938.896549 885.560502 840.863545 813.365210
6 824.531597 883.380917 946.547062 1009.453494 1188.025956 1347.901959 1383.533918 1283.798372 1211.544344 1119.912744 ... 943.259381 940.068199 915.703810 920.797151 924.610139 937.115934 920.370725 894.035268 857.822298 790.034420
7 799.677349 843.552096 933.552917 1037.488966 1152.136520 1315.464118 1392.310930 1390.827688 1304.164204 1162.523854 ... 947.246423 940.004676 945.921120 943.630010 936.120690 971.682509 942.749931 911.208568 880.878130 823.783345

7 rows × 24 columns

After reorganizing our data with a pivot table in Pandas, it becomes easier to see how it can be formed into a heatmap.

import matplotlib.ticker as ticker
import matplotlib.cm as cm
import matplotlib as mpl

fig = plt.figure()
fig, ax = plt.subplots(1,1,figsize=(15,15))
heatmap = ax.imshow(fixedTimeScoreMatrix, cmap='BuPu')
ax.set_xticklabels(np.append('', formattedTimeScore.hourofday.unique())) # columns
ax.set_yticklabels(np.append('', formattedTimeScore.dayofweek.unique())) # index

tick_spacing = 1
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.set_title("Time of Day Posted vs Average Score")
ax.set_xlabel('Time of Day Posted (EST)')
ax.set_ylabel('Day of Week Posted')

from mpl_toolkits.axes_grid1 import make_axes_locatable
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", "3%", pad="1%")
fig.colorbar(heatmap, cax=cax)
plt.show()

As we can see from the data, the posts with the highest score are posted between 6:00am - 9:00am EST. 7:00am definitely has the highest amount of upvotes over the seven days of the week, but it seems to be the most popular to make posts on Saturday and Sunday. One possible explanataion is that most Reddit posts can take a few hours to reach the "Front Page", or the top of their subreddit. Uploading in the morning will give it time to get to rise to the top, so users can see it when they first check Reddit in the morning. After heavy analysis over all of these datasets, we are finally able to make an accurate guess on what is best for Reddit users popularity in posts. Overall, Reddit users should have the best chance to score upvotes if they post on a Saturday, at 7am, with roughly 5-25 characters in their post. The popularity can further be increased if their post is in a top 15 subreddit.


Conclusion and More

Reddit has grown to be an outstanding social media website. However, many people still do not know the trick to nailing down how to get their posts viewed and seen by the public. This project helped us learn so much more about the website and what kind of data is continously being drawn by third party websites. As you can see from this analysis, there are many factors that can put into a post on reddit that will ultimately decide how many views and upvotes one will get. We hope that this helps new Reddit users get a great jump on how to begin posting and what kind of times and post titles they should be using. If you are interested in Reddit and the many datasets to use, we recommend using Google's BigQuery and Pushshift. This API can do over 1,000 Reddit calls a second, so using a select amount of data from Janurary 2016 to August 2019 gave us plenty of data points to work with. This tutorial was a small fraction of the amount of data and things that can be done using Reddits data. If you've made it this far, then thank you for the read, and we hope you learned something new!