Monday, January 9, 2012

Self-organizing Reddit?

Reddit and other sites use categorization instead.  So I prototyped a self-organizing news site and it worked great.  Here’s the background, and how it worked:

A few months ago, on Hacker News, there was a link to an old 2005 Coding Horror article “A Group Is Its Own Worst Enemy”.  Jeff Atwood - Mr CodingHorror - is nowadays known for his partnership with Joel Spolsky for StackOverflow.  They’d likely rather I linked to StackExchange, perhaps, which is all part of what this blog post is all about ;)

Reddit and StackExchange are both built around categorization.  The poster picks a category, the users browse categories.  Moderation makes sure posts stay on-topic.  There is generally quite a lot of complaints if anything falls between established categories, or overlaps several, and risk of things getting zapped.  Hacker News (HN) is again like this except it really is a single category.

User’s status - karma or such - is carefully counted and displayed.  Reddit is very much around the community; there’s a /r/programming and a splinter group /r/coding and a gazillion other overlapping, conflicting and sometimes on-non-talking-terms sub-reddits.  As a user you pick a few reddits and you become loyal to them for a time.  StackOverflow is a little bit about community too, although most visitors arrive by search and most early-adopters have likely moved on.  Its less a place to hang-out.  Hacker News is all about community, and is in effect one big broad things-techies-are-interested-in sub-reddit.

For a long time I have been wondering - influenced perhaps by the various strategies for genetic programming I played with in my Core War days - if there isn’t a better way than categorization.  I commented on the HN link:

I’m curious: is there any site where:

. 1) logged in users can vote for what they like, both up and down. A profile is built identifying other users who vote like you, and so your vote influences the post prominence for those others more than for other users who have different voting patterns

. 2) anonymous users get post prominence based on a smattering of different group votes

Its a bit like movie recommendations. In this way, trolls will quickly be grouped with other trolls and those who consistently vote up articles they like will see post prominence from like-minded devotees, and a single board can have a wide range of subjects that end up not needing categorization because they are self-organising.

It all seemed too obvious.

Its worth following up on some background reading; here’s some links for later:

A long time ago, Joel - whose articles has deeply influenced and refined my opinions about our industry - wrote about building online communities.  He built an exceptionally high-quality low-noise forum community based on the principles he described, and I witnessed it working.  This small community was centred around discussion (less than news or links) that the kind of people who read Joel’s articles would find interesting.  It was actually very Hacker News like, audience-wise and subject-wise.

Then, abruptly, Joel seemed to have become disenchanted with what he had built, with stronger moderation (causing a splinter to go off to a new forum called crazyontap) and then, suddenly,  closing down the whole forum for what felt like the flimsiest of excuses.

Then Joel moved sideways towards job boards and then Q&A sites.  Most recently, Jeff has posted again about forums and social sites in the Q&A vein, which is very much about categorization and moderation.

Back to doing it differently.  I made a prototype.  I tried to work out how to make it super simple to use.  Here’s a description of how it worked:

The page is defined by the fold; above the fold is divided into two lists; top-most half, a handful of random unvisited links; links that we don’t have enough data about to know how to organize.  The bottom half, still above the fold, is items we match you with, so you can see more of the same as you normally like.

Clicks away go through a URL redirector so we can snag what you visit; just visiting a link says something about the link, and can be scored as an up-vote.  Passing over a link and clicking on a link below it also says something about it, and could be counted as a weak down-vote.  Only if you have visited a link can you come back and give it a big positive up- or down- vote.  So there’s a more nuanced voting grade than simple plus and minus ones, but its mostly done by simply passively following links, rather than clicking on voting arrows.

A cron-job scraped RSS feeds to get news headlines; it scraped some reddit categories I browse, and HN, and a few very good blogs.  The scraping was just a strategy to have something to organize because I didn’t have a lots of users submitting stuff; the stuffing of the content was just to make the prototype meaningful.  Now there were lots of news items, we can move on…

Users are anonymous but tracked with a secure cookie describing their recent click history.

The site was tracking only the last n - e.g. 500 - submitted links.  Anything older just falls out of the selection window and is ignored.

And behind the scenes when it generates the page for you, it was doing a Slope One bit of collaborative filtering.  Just like shopping sites do.

The implementation was very prototypy, using Python and such.  You could obviously make massive speed gains - caching, batching, native code and such - whilst keeping the central idea stable.  I think, despite the algorithmic complexity, it could be done for a large community with reasonable hardware.  You can play with rephrasing it as item-based instead of user-based and so on.

With regards the forum part of all this, I didn’t want to go applying the same to comments.  Reddit and HN are built around the comments, which in a sense move the user away from the content creator and towards the aggregation site.  It is also where the horsepower goes (well, before self-organizing gives your CPUs something to crunch).

The approach I’d favor for comments is to scrape the source links and look for the well-known comment systems; then link directly to the content’s own comment system rather than making your own.

I got a few friends to use it.  It worked!  I degraded the Slope One by treating all weights as 1, and it still worked.

My personal passion wasn’t there to develop it further; knowing the idea works, I want the sites I loyally use to adopt it, rather than to create a site new site to compete with them.  I rather hope that StackExchange, Reddit, even Hacker News all think about how a bit of self-organization can be integrated into news lists, Q&A and forums.  You could imagine hybrids as well as total adoption.

How does StackExchange look right now?  All too often I’ve seen the interesting problems - those I pause and have to think about and often not feel able to answer - being moved or recommended moved to more specialised sibling sites.  And then you get these sibling sites and they are ghost-towns.  The sibling sites I keep an eye on like gamedev are failing to capture any kind of traction in the professionals they aim at; on gamedev you can at best talk to rank hobbyists like myself about game-development, sadly; I have a faded t-shirt to prove it :(

How might a StackExchange look with collaborative filtering?  It’d be just StackOverflow, and we could be a bit less confused about that.  99% of visits would come from off-site search e.g. Google, and search engines really don’t care about the categorization side of things; it likely hinders as much as it helps.  On the other hand, the regular users, or people navigating away from a question they found by search, will not have to worry if their question overlaps math or combines a bit of admin or is best asked in some Ubuntu sibling rather than a generic Debian one or whatever nonsense.  In fact, a question about optimal grass clipping patterns could just as easily pique the interest of the maths crowd as it could the gardening crowd; the difference between theory and practice is different in practice, right?

The answer/comment system on StackExchange works well and doesn’t need changing.

I think StackOverflow could also build a loyaler core community by expanding into opinions beyond Q&A; users with some sufficient number of points could post blog-like entries, or be allowed to post things marked as ‘subjective’ and such.  Because in a collaborative filtering world, that vocal opinion piece that Joel and Jeff think so poisonous to community and yet is so necessary can be self-suppressed by users.

Reddit is harder to imagine.  There’s the reddit that is the page of links you might want to read, and that’s easy enough to imagine self-organizing.  But there’s also the reddit that is the community; people who sub-reddit because they want to protect their opinions and such.  And it would very much be about self-organizing the comments too.  Its almost like an alternative site.  It would actually work best perhaps as a filter you can lay over the existing reddit site, and then when it works well people migrate and give up their sub-redditing identities; after all, they are going to tire and move on soon enough anyway; all opinionated communities seem to have churn.  What reddit does have a problem with is trolling, although everyone seems to disagree about who the trolls are.  In my eyes its the teenager boys who obliquely hint at sex every time a girl posts something; they are what I’d like to filter out despite their up-votes.  If enough people like me down-vote them, they will soon disappear from our view of the comment threads whilst those twits themselves will think they rule.  And they’d say the same about us, so its all win.

The whole collaborative filtering idea kind of breaks down at NSFW content.  In my prototype, I thought it very important to show everyone a smattering of atypical or un-profiled links right at the top, so you broaden your reading habbits rather than narrow them.  However, this completely fails with NSFW.  For this reason, I think NSFW has to remain a categorization or a moderation system.

Hacker News, if it had collaborative filtering, could simply grow sideways into more domains.  If there were gardening links, for example, it can go do gardening news without needing a distinct site; its just gardening-themed items would self-organize and it’d be great for people who like both.

So that’s thinking about it in terms of some existing sites; how about running it as a new site? Its not really viable for me to run this prototype for the benefit of you, I’m afraid.  But code is available on request; although really, this is the kind of thing you could now make yourself; there’s no secret sauce in it and I’ve spilled the beans on my implementation now.

So, can we get Reddit and StackExchange to un-split the sub-sites architecture?


  1. beza1e1 reblogged this from williamedwardscoder
  2. williamedwardscoder posted this

 ↓ click the "share" button below!