Looking for pearls in the internet forum

Dr Lim Wern Han

Sifting through the Internet’s sea of content can be exhausting. When you look online for quality answers, how do you decide what to read: Only what’s on top? Only what many voters have “liked”? Now, imagine the challenge in massive Internet forums like Reddit, where thousands of new posts and comments are submitted daily by users.

How can social media sites organise all this content so that you, the individual user, can easily find what’s relevant, accurate, or meaningful? Three Monash University academics made a contribution toward answering this question, through their research on algorithms.

Using big data to predict good content

The researchers were Lim Wern Han (Monash Malaysia), Mark James Carman (Monash Caulfield), and Jojo Wong (Monash Clayton), each from the Faculty of Information Technology in their respective campuses.

Drawing on their experience in data science field, the researchers proposed and tested several algorithms for how well they could predict good-quality content on Reddit, one of the world’s most popular websites.

Sporting over 300 million active users, Reddit is an Internet forum for sharing news, asking questions, and discussing topics of interest such as politics, science, DIY, or dating.

Lim, Carman, and Wong found that it is possible to predict if a forum user will produce quality comments in the future, by using algorithms that learn user behaviours over large sets of data. These findings can be applied beyond Reddit to other websites such as online question-answering platforms. The application is especially useful for websites which lack the manpower of editors or moderators, yet handle thousands of comments posted by users.

The Reddit challenge

The scope of the study was five popular sub-forums on Reddit—“science”, “world news”, “gaming”, “explain like I’m five”, and “(U.S.) politics”—encompassing 300,000 posts and 1.5 million comments. The topic of “gaming” alone had a daily average of 1,257 new posts, with each post receiving up to over 9,000 comments.

Given the volume of content, it is a challenge for Reddit users to quickly identify helpful comments in popular discussions, despite the voting feature Reddit has. Users can “up-vote” a comment they liked (akin to giving brownie points) or show their disapproval by giving a “down-vote” (akin to penalty points).

The number of votes is displayed next to each comment, and it affects the commenter’s overall “karma” score. However, the system is still limited in its ability to identify quality comments and push good content to the top of the page quickly, as it takes time for votes to collected – known as the Cold Start problem.

Lim, Carman, and Wong’s research could improve how content is organised in websites like Reddit. The algorithms they tested can be used to predict which users will contribute good-quality content and which users are malicious and unhelpful. Their study focused on measuring user “expertise”, with the assumption that users of higher expertise tend to produce higher quality content.

Features to measure user expertise

The researchers experimented with various online features to estimate user “expertise” or ability to produce good comments. These features include the frequency of “good” comments, fuzzed vote difference on user posts and many more. The features are used to deduce the “credit” gained by users for their contribution. Information can then be propagated through statistical pairwise comparison approach, to quantify user expertise.

Users should be rewarded with higher scores for posting content that is better than average—rather than having their scores inflated or normalised, as is done by some Internet forums. Meanwhile, users who submit poor-quality or malicious content should be penalised in score.

Based on their relative performance, users’ posts can be better promoted or restricted by websites. The identified experts can even be approached to contribute new high-quality content or solve complex problems respectively.

Quality comments can be predicted

User votes can be used to predict a user’s likelihood to submit quality comments. The research outcome enables social platforms to circumvent expensive processing of natural language and other multimedia content. With high volume of user-generated content today, such capability is highly desired.

By using algorithms, social platforms can improve how their content is ranked or organised for reader consumption – resulting in a high density of knowledge on the World Wide Web.