Reddit Logo

Predicting Upvotes and Popularity on Reddit

Authors: Andrew Paul, Chigozie Nna

Introduction

Reddit is an American social news aggregation, web content rating, and discussion website. The site quickly gained popularity after its creation in 2005 by two University of Virgina Students, Steven Huffman, and Alexis Ohanian. The goal of Reddit is for its users to submit content to the site in the form of links, text posts, and images, which can then be voted up or down by other users. The posts are categorized into groups called “subreddits”, where users can share specific topics and/or interests that relate to the topic at hand. In its early years, Reddit began to rise in popularity, with NSFW, Programming, and Science being the top trending subreddits of the time. By 2008, a launch of numerous different subreddits began to popularize the site, with Reddit being able to gain enough popularity to overtake its competitor Digg by 2010. Reddit’s rise to fame did not stop there, with Reddit finally achieving a total of one billion page views per month only a year later. As of 2019, Reddit is ranked the 18th top site globally, according to Alexa Internet.

In this tutorial, we analyzed all the Reddit posts from Janurary 2016 to the August 2019 (over 510 million posts!). The goal was to provide us with knowledge into what factors of a post (such as title length, and time posted) cause the most effect in terms of up votes, down votes, score, and general reaction to a post. Posts may vary in topics, arguments, time posted, and many more variables, but we felt as if the popularity really depended on the post's title length, and the time it was posted. We were able to determine which length is just too short to gain attention, and what length is long enough to bore an audience. We also looked at the most popular subreddit posts and time of day to see any upvote relation. We hope to give enough information and analysis to provide clarity, understanding, and a new found interest to readers that are unfamiliar with the social foreground. And hopefully fellow Reddit users will gain some insight on how to optimize their posts to gain the most traction.

Getting started with the Data

We decided to use Python 3 and SQL to gather and analyze our data. Crucial libraries used to help us were: pandas, matplotlib, seaborn, and scikit-learn.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn import model_selection
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

We used SQL commands to query in the data, and used Panda's dataframes to read in and analyze the data.

Processing and Recieving data

We used the following SQL command through Google's BigQuery to take data from "Pushshift" (a third party Reddit API) that tracks nearly all of Reddit's post history. We are taking in data from Janurary 2016 to August 2019, which contains about 510 million rows of data. Luckily using Google's BigQuery, that process only takes a matter of seconds.

SQL Code

In this SQL Query we are getting the length of every single title, the average score based on the length of the title, the average number of comments based on the length of the title, and the number of posts with that amount of characters. This is done by using the 'GROUP BY' command with SQL. BigQuery convertd this data into a .csv file, making it easy to parse and analyze.

Reading the Data

We first used Python's Pandas module to read in the .csv file and convert it into a Panda dataframe, which is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

data = pd.read_csv('LengthScoreComments.csv', sep=',')
data[:10]

	length_title	avg_score	avg_comments	num_posts
0	1	40.804163	1.789499	472705
1	2	65.521526	2.796440	739424
2	3	65.577614	3.039975	1361269
3	4	92.408734	3.559541	2588850
4	5	84.223042	3.203536	2210213
5	6	129.382588	3.370371	3850745
6	7	105.844544	3.985902	2858907
7	8	99.618349	4.208025	2939891
8	9	104.936504	4.531499	3818682
9	10	103.995719	4.758954	3773035

In the DataFrame above you can see:

length_title: number of characters in the title (1-300).
avg_score: average score (upvotes minus downvotes) of the posts with the certain title length.
avg_comments: average amount of comments on a post with the certain title length.
num_posts: number of posts with that certain title length.

Graphing

In this first graph, we looked at the potential relationship between the length of the post title, and the score that post recieved. This can help readers gain an insight to how long they should make their posts to gain the most traction.

X = data['length_title']
Y = data['avg_score']
Size = data['num_posts']/200000
plt.scatter(x = X, y = Y, s = Size )
plt.title('Length of Post Title vs Average Score of Post')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

We can see that there is a clear relationship between the number of characters in the post title, and the score recieved. Based on the graph, it seems the highest scored posts have title lengths between 5-25 characters, and 153-300 characters. The range most likely is based off the fact that the type of posts are fairly different. There are many popular subreddits such as "r/me_irl" where every post is titled with the same six characters, "me_irl". Other popular subreddits are seemingly the opposite, such as "r/futurology", where most post titles are scientific headlines with over 200 characters. One explanation for what is going on here is that there are many, many times more posts with 50 character titles than, say, 261. So there's a dramatic increase in variability towards the upper end of the character limit. As the number of characters increases, the posts could tend to be actual issues or phenomena/facts that require longer explanation, and thus recieve more views and reactions (ex. A president's quote = more characters and responses).

Comments

We then wanted to see the relationship between the post's title length, and the number of comments the post recieved. People will upvote anything they think is funny or interesting, however, we wanted to see which posts actually get people commenting. Comments are what we deemed as true reactions to the posts, since it requires viewers to put in more effort than just a click for an upvote.

X = data['length_title']
Y = data['avg_comments']
Size = data['num_posts']/200000
plot = plt.scatter(x = X, y = Y, s = Size)
plt.title('Avg Number of Comments vs Length of Post title')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Comments')
plt.show()

Based on the relation, we are able to see that the number of comments are actually greater based on the character length of the posts. One explanation is that the longer the title, the more likely it is a controversial issue, quote, or topic that requires a lot of explanation. Thus, people are more likely to comment on and put their input on these types of posts. For example, posts on the subreddit "r/news" tend to have longer post titles (because of headlines) and thus generate more comments for people to furhter the discussion of the topic. This is in contrast to memes captioned with a few words, or 3-liner jokes that will mostly recieve likes and not comments.

Reddit Artwork

Filtered Data

We then decided to filter our data to only include the most popular subreddits, so they don't include ones like "r/me_irl", as discussed earlier. We want to see the relationship between title length and upvotes for typical reddit users, that aren't making posts with smally, silly trends like "r/hmmm".

In this SQL Query, we created an entirely new .csv file with the top 15 subreddits (most subscribers) to see the relation again, with subreddits that most of reddit uses on a day to day basis. This gave us a more accurate analysis in understanding the relationship between post title length, and popularity.

Unbias = pd.read_csv('FilteredSubLengthScore.csv', sep=',')
Unbias[:10]

	length_title	avg_score	num_posts
0	1	90.522604	9401
1	2	210.944914	18934
2	3	207.167737	35812
3	4	218.983280	61842
4	5	241.415093	57298
5	6	271.754140	62556
6	7	285.368175	77455
7	8	288.794141	94686
8	9	316.807680	123414
9	10	313.843326	141357

In the DataFrame above you can see:

length_title: number of characters in the title.
avg_score: average score (upvotes minus downvotes) of the posts with the certain title length.
num_posts: number of posts with that certain title length.

Graphing Filtered Data

We used the same method from above to graph the relationship between the post title's length and the average score it recieved, except only looking at the top 15 subreddits.

X = Unbias['length_title']
Y = Unbias['avg_score']
Size = Unbias['num_posts']/10000
plt.scatter(x = X, y = Y, s = Size)
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

Based on the new graph we can see a less skewed range of values and a more accurate drop off of average score after around 25 characters. It seems that the amount of characters required are still the same, despite the data being filtered now. This proves the fact that it is best to have posts that are between the amounts of 5-25 characters.

Linear & Polynomial Regression

We decided to get a relative prediction of what the outcome would be by creating both a linear regression of the data, as well as a polynomial trend line.

Linear

Y = Unbias['avg_score']
X = Unbias['length_title']

linear_regression = LinearRegression()
reshapedX = X.values.reshape(-1, 1)
linear_regression.fit(reshapedX, Y)
model = linear_regression.predict(reshapedX)

plt.figure(figsize=(10,8));
plt.scatter(X, Y);
plt.plot(X, model);
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

Polynomial

poly_reg = PolynomialFeatures(degree=2)
reshapedX = X.values.reshape(-1, 1)
poly = poly_reg.fit_transform(reshapedX)


linear_regression2 = LinearRegression()
reshapedY = Y.values.reshape(-1, 1)
linear_regression2.fit(poly, reshapedY)
y_pred = linear_regression2.predict(poly)
plt.figure(figsize=(10,8));
plt.scatter(X, Y);
plt.plot(X, y_pred);
plt.title('Length of Post Title vs Average Score of Post (Filtered)')
plt.xlabel('Length of Post Title (# of Characters)')
plt.ylabel('Average Score of Post')
plt.show()

The polynomial regression gives us a more accurate analysis of the data compared to the linear regression line. The relationship between the length of characters in a post and the average number of upvotes is a clear polynomial relation. Reddit users can use this knowledge to decide if they want to either go with a short amount of characters to average around 300 upvotes, or go with a more lengthy title to gain an average of 300+ upvotes.

Typical Reddit Home Page

Time Matters

In this section, we analyzed whether the time of day something is posted is correlated with its popularity. This way, Reddit users can use this knowledge and potentially hold off posting something, in order to get the most amount of traction on their post. We started by getting a new batch of data from BigQuery.

Our goal was to figure out what time of day a post would be most likely to maximize its popularity. So we organized our data into 168 rows (24 hours in a day x 7 days a week), each row having the average score the post recieves. We decided to look at posts which have a score of 100 or more upvotes, so our results are not biased with the large majority of posts having little to no upvotes, bringing the average down.

timeScore = pd.read_csv('TimeVsScore.csv', sep=',')
formattedTimeScore = timeScore.copy()
formattedTimeScore['hourofday'] = formattedTimeScore['hourofday'].apply(convertHourToTime)
formattedTimeScore['dayofweek'] = formattedTimeScore['dayofweek'].apply(convertNumToDay)

# Converts the numbers 0-23 to their respective times
def convertHourToTime(num):
    if num == 0:
        timeOfDay = '12 AM'
    elif num <= 11:
        timeOfDay = str(num) + ' AM'
    elif num == 12:
        timeOfDay = '12 PM'
    else: 
        timeOfDay = str(num-12) + ' PM'
    return timeOfDay

# Converts the numbers 1-7 to their respective weekdays
def convertNumToDay(num):
    weekdays = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
    return weekdays[num-1]

formattedTimeScore

	avg_score	dayofweek	hourofday
0	817.979886	Sunday	12 AM
1	832.602126	Sunday	1 AM
2	926.316992	Sunday	2 AM
3	1013.444329	Sunday	3 AM
4	1134.617892	Sunday	4 AM
...	...	...	...
163	971.682509	Saturday	7 PM
164	942.749931	Saturday	8 PM
165	911.208568	Saturday	9 PM
166	880.878130	Saturday	10 PM
167	823.783345	Saturday	11 PM

168 rows × 3 columns

We wanted to display our data in a heatmap, but the data was organized into three columns. We had to reorganize the data in order for the heatmap to work correctly.

timeScoreMatrix = timeScore.pivot(index='dayofweek', columns='hourofday', values='avg_score')
cols = timeScoreMatrix.columns.tolist()
cols.insert(0, cols.pop(cols.index(23)))
fixedTimeScoreMatrix = timeScoreMatrix.reindex(columns=cols)
timeScoreMatrix

hourofday	0	1	2	3	4	5	6	7	8	9	...	14	15	16	17	18	19	20	21	22	23
dayofweek
1	817.979886	832.602126	926.316992	1013.444329	1134.617892	1283.984606	1400.718513	1384.293555	1339.667450	1227.160091	...	940.857548	968.624444	937.781498	929.255882	967.731765	942.091881	964.019933	963.388646	853.194138	802.564590
2	815.127267	862.858846	928.686203	1042.522746	1189.814349	1386.860338	1369.992439	1293.379859	1220.160733	1125.224842	...	957.458976	938.846684	938.656955	939.169165	937.429680	957.803950	937.148762	886.015139	810.725206	800.888918
3	807.148566	859.623517	974.795660	1067.599219	1179.800492	1357.187672	1392.787999	1288.674100	1203.942667	1137.456890	...	968.803341	962.026646	934.731859	944.079777	949.844017	973.267883	955.275513	890.547645	838.271816	822.564901
4	820.053926	864.788059	966.607819	1086.004091	1201.194851	1373.418265	1386.699401	1285.982900	1201.614869	1116.236775	...	969.912780	957.093768	941.838205	947.765502	929.672191	966.943421	925.359047	874.487371	815.822274	818.863226
5	810.921800	883.710208	969.234269	1078.066438	1207.298409	1366.675789	1366.740784	1282.025036	1212.700262	1112.079900	...	958.369804	932.961184	935.878545	949.448003	930.227652	964.164243	938.896549	885.560502	840.863545	813.365210
6	824.531597	883.380917	946.547062	1009.453494	1188.025956	1347.901959	1383.533918	1283.798372	1211.544344	1119.912744	...	943.259381	940.068199	915.703810	920.797151	924.610139	937.115934	920.370725	894.035268	857.822298	790.034420
7	799.677349	843.552096	933.552917	1037.488966	1152.136520	1315.464118	1392.310930	1390.827688	1304.164204	1162.523854	...	947.246423	940.004676	945.921120	943.630010	936.120690	971.682509	942.749931	911.208568	880.878130	823.783345

7 rows × 24 columns

After reorganizing our data with a pivot table in Pandas, it becomes easier to see how it can be formed into a heatmap.

import matplotlib.ticker as ticker
import matplotlib.cm as cm
import matplotlib as mpl

fig = plt.figure()
fig, ax = plt.subplots(1,1,figsize=(15,15))
heatmap = ax.imshow(fixedTimeScoreMatrix, cmap='BuPu')
ax.set_xticklabels(np.append('', formattedTimeScore.hourofday.unique())) # columns
ax.set_yticklabels(np.append('', formattedTimeScore.dayofweek.unique())) # index

tick_spacing = 1
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.yaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
ax.set_title("Time of Day Posted vs Average Score")
ax.set_xlabel('Time of Day Posted (EST)')
ax.set_ylabel('Day of Week Posted')

from mpl_toolkits.axes_grid1 import make_axes_locatable
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", "3%", pad="1%")
fig.colorbar(heatmap, cax=cax)
plt.show()

As we can see from the data, the posts with the highest score are posted between 6:00am - 9:00am EST. 7:00am definitely has the highest amount of upvotes over the seven days of the week, but it seems to be the most popular to make posts on Saturday and Sunday. One possible explanataion is that most Reddit posts can take a few hours to reach the "Front Page", or the top of their subreddit. Uploading in the morning will give it time to get to rise to the top, so users can see it when they first check Reddit in the morning. After heavy analysis over all of these datasets, we are finally able to make an accurate guess on what is best for Reddit users popularity in posts. Overall, Reddit users should have the best chance to score upvotes if they post on a Saturday, at 7am, with roughly 5-25 characters in their post. The popularity can further be increased if their post is in a top 15 subreddit.

Conclusion and More

Reddit has grown to be an outstanding social media website. However, many people still do not know the trick to nailing down how to get their posts viewed and seen by the public. This project helped us learn so much more about the website and what kind of data is continously being drawn by third party websites. As you can see from this analysis, there are many factors that can put into a post on reddit that will ultimately decide how many views and upvotes one will get. We hope that this helps new Reddit users get a great jump on how to begin posting and what kind of times and post titles they should be using. If you are interested in Reddit and the many datasets to use, we recommend using Google's BigQuery and Pushshift. This API can do over 1,000 Reddit calls a second, so using a select amount of data from Janurary 2016 to August 2019 gave us plenty of data points to work with. This tutorial was a small fraction of the amount of data and things that can be done using Reddits data. If you've made it this far, then thank you for the read, and we hope you learned something new!

PredictingPopularityOnReddit

Using Python and SQL to find correlations between popular posts on Reddit.