Creating Sparks

Creating Sparks

Gabe and I were talking on Twitter a few days ago when we hit on the subject of sparks. If you use Shaun Inman’s Fever, you’re familiar with these. If not, Shaun defines them thusly:

Sparks are inessential feeds that increase the temperature of links in the Hot view. Their unread items will never appear in the Kindling supergroup or in any of your custom groups. Link blogs and sites that frequently repost content are excellent candidates for Sparks.

Everybody’s sparks are going to be different. However, they’re important because they have a very real and direct impact on what Fever shows to be hot. If you don’t have decent sparks, your Hot list on Fever will never be dependable or useful.

So Gabe and I were talking and I thought hey, wouldn’t it be cool to have groups of starter sparks. These would be a list of feeds that you may not read but are read by people you enjoy reading. For example, I enjoy reading Daring Fireball. Gruber probably reads a lot of sites I don’t read. These would be good sparks.

So I shot an email off to Shaun and verified that, to the best of his knowledge, no one had developed a sparks repository. Additionally, he also said:

Rather than organize it by vertical market it might be interesting to group them by popular feeds. So say you really like Daring Fireball. You’d analyze John’s linking history and select his most linked sites and create an OPML from their feeds (and maybe branch another layer to their most linked sites).

He basically laid out how it needed to be done. Nice!

Also: more difficult than I wanted! So I closed the email and proceeded to live my life as though none of this had ever happened. Until h1ro got involved:

h1ro tweet

Thanks, dude. I happened to be out looking at granite samples with my wife when h1ro sent that, so I had more than a few spare brain cycles available. The wheels started turning.

What’s necessary is a script that monitors Daring Fireball’s RSS feed and strips links out. If we’ve seen the link before, skip it. If not, add the link to the list of links we’ve seen before. Also, add the domain to a list of popular domains. This “popular domains” files is eventually going to be important. It’ll be something like this:

Domain Counter
bloomberg.com 10
marco.org 7
nytimes.com 8

Et cetera. (I made those numbers up.) Obviously this data will take a while to build, so the script will have to run every couple of days for a few months or more before we have usable data.

Challenge 1

Unfortunately, not all RSS feeds are created equal. What’s necessary to strip URLs from one feed will almost positively fail on a different site’s feed. If you doubt this, compare the following:

With each of these sites, clicking the title URL will normally take you to another site, the location of the article they’re discussing. This is the URL we want. However, not everyone structures their sites this way. Usually, clicking the title URL keeps you on the same site and takes you to the single-page entry for that post. The URL that is the subject of the post is usually buried somewhere in the first paragraph or two of the article. Most Wordpress sites work this way.

So we’ll need a different script for each web site we’re tracking. This practically guarantees that I won’t be involved in the project for more time than it takes to generate 3-5 different OPMLs. But we’re already riding, so let’s see where the donkey takes us. (I have working code for Daring Fireball and parislemon.) Other sites may lend themselves better to XML::RSS::Parser.

Challenge 2

Once we have a meaningful amount of data, then we’ll need a script to hit each domain and locate their RSS feed. This should simply be a matter of running curl on each domain in the “popular domains” file and looking for something like the following in the HTML:

or

Once parsed, we dump the href field and a few others into the OMPL and we’re done.

As I mentioned, this is pretty slow going. First, stripping RSS is tricky and prone to error. Second, building the “popular domains” table is going to take time. Third, as projects go, this one is pretty boring. So nobody gets to hold my feet to the fire. I’m open to a few suggestions as to sites we should use. I’ve obviously covered the two I’m mainly interested in.

I’ll post more as things progress.