Why People Likes BuzzFeed Headlines ?

Nicolas Dirand The September 26, 2013

You may know that huffpost and buzzfeed craft their content very carefully.

In fact HuffPost Success was mostly because of  it’s Search Engine Optimization.

Most of their articles are crafted in such a way, that you feel compelled to read them!

You can see also a range of specialized content networks who promote those kind of  ‘shock articles’ all over the internet.  

It work ! It work so well that I had to take a more serious look about it!

Because I was taking the social network analysis course on coursera (by lada adamic), I saw the opportunity to make my final work about this very specific subject 🙂

Here it is, my small white paper I made  for the course:

 

How popular article titles are crafted to make peoples click?

It’s not a secret that content marketing is highly competitive and some websites seem to be ahead the competition.

Ex: BuzzFeed or Huffington Post are well known for their highly popular headlines.

What make them popular?  It is only pure keyword spamming?

Let’s get some data

 I did a little shell script to get all the article titles classified by ‘category’

This represents a large variety of subjects ranging from the funniest to the most unpleasant kind of content.

I did not want to focus on one category only but to get a general view.

—————————–8< ——-[shell script] —— 8< —————————–

#!/bin/sh
# we download the xml file which contain all the link to the categories.
 wget -U Mozilla -o tags.xml http://www.buzzfeed.com/go/sitemap_tags.xml
# we extract the urls of those category
 egrep  '[^\<]+' -o  tags.xml | sed -e 's///g' > cleanurl.html
# we download each urls
 cat cleanurl.html  | while read A; do  wget -O ${P}.buzz "${A}" ; P=$(($P+1)) ; done
# raw html is store into .buzz file
# my next step is to take each of em and use lynx to make a easy to parse output of the html
# we remove all the articles talking about buzzfeed (self promoting etc) not interesting for us
 ls *.buzz | while read A; do  mv $A tmp.html;  lynx -dump tmp.html > clean.txt ; egrep '^\[' clean.txt ; done | grep -v -i buzzfeed | tee ready.txt
# a bit of processing of the data to remove single quote and double quote encoded in utf-8
cat ready.txt | cut -f 2  -d ']'  | sort -u  > articlestmp.txt
#  now articles.txt is created at c:\python26\ and contains the url ready to process
cat articlestmp.txt  | sed -e "s/\x27//g" | sed -e "s/\x22//g" > /cygdrive/c/Python26/articles.txt # remove single and double quote

—————————–8< ——-[shell script] —— 8< —————————–

Processing the Data and create a graph

Now articles.txt contain a list of the titles presenting like this

….

32 Is The Realistic Parody Of Taylor Swifts 22 You’ve Been Waiting For

7 Minutes In Heaven With Jon Hamm

Adventure Time And The Return Of Fiona And Cake

American Horror Story Has The Best Music Of Any TV Show Right Now

Aperture R&D Is The Portal 2 Series You’ve Been Waiting For

Arrested Development Gets 9 New Posters

Arrested Development On Netflix Gets A Teaser Poster And A

Avatar: The Last Airbender Is Searching For Answers

Bah Humbug! Supercut

Bane Protests Bain In NYC

Bel Ami Trailer Is Robert Pattinsons Escape Plan

….

Now it’s time to create a graph from the titles.

I will explain it step by step.

  • Read a line.
  • Tokenize the line and remove all the ‘stopwords’ (very common words used in     English like ‘the’) and convert it into lowercase

e.g:

>>> line = [‘hello’,’my’,’dear’,’reviewer’]

>>> nltk.ngrams(line,3)

[(‘hello’, ‘my’, ‘dear’), (‘my’, ‘dear’, ‘reviewer’)]

  • Use a porter stemmer into each words of the trigram ( this will remove the plural of verbs, suffix etc..).

e.g:

>>> st = nltk.PorterStemmer()

>>> st.stem(‘interest’)

‘interest’

>>> st.stem(‘interesting’)

‘interest’

It will help us to avoid to create different node for similar verbs or adjectives

  • Now for each trigrams, we take the combinations of all the words [2 choose 3] .Then create a node with word1 and link it with word2. Also for each time we see the same association we increment the weight by one.

e.g:

>>> nltk.ngrams(line,3)[0]

(‘hello’, ‘my’, ‘dear’)

>>> for t in itertools.combinations((nltk.ngrams(line,3)[0]),2):

…     print t

(‘hello’, ‘my’)  <== node(hello) <— link[weight++] —> node(my)

(‘hello’, ‘dear’)

(‘my’, ‘dear’)

 

The idea behind this way of creating a network is to strongly associate words that are often used into the same context. We used a trigram because there is very few reason to take a bigger window.

Bigram and trigram is a standard sliding window size. Using a window larger than that could make us associate totally different words together

Don’t forget we removed all the stop words which make the line pretty succinct at the end

 

 

—————————–8< ——-[ python code ] —— 8< —————————

__author__ = 'root'
import itertools
import nltk
from nltk.corpus import stopwords
import networkx as nx
data=open('articles.txt').readlines()
line_tag=[]
dico=set()
stoplist=stopwords.words('english')
for line in data:
    clean=[ w.lower() for w in nltk.word_tokenize(line.rstrip()) if w.lower() not in stoplist and len(w) > 2]
    line_tag.append(clean)
st = nltk.PorterStemmer()
G=nx.Graph()
for line in line_tag:
    for ngram in nltk.ngrams(line,3):
         for it in itertools.combinations(ngram,2):
            if it[0] > it[1]:
                a=it[1]
                b=it[0]
            else:
                a=it[0]
                b=it[1]
            a = a.decode('utf-8').encode('ascii','ignore')
            b = b.decode('utf-8').encode('ascii','ignore')
            a=st.stem(a)
            b=st.stem(b)
            if len(a) < 1:
                continue
            if len(b) < 1:
                continue
            val=0
            if G.get_edge_data(a,b):
                val=G[a][b]['weight']
            val=val+1
            #print a+" -> "b+ " "+str(val)
            G.add_weighted_edges_from([(a,b,val)])
nx.write_graphml(G,"buzztriunstem.graphml")

—————————–8< ——-[ python code ] —— 8< —————————

Analyze of the network

Done in Gephi.

So once we had the graph open I computed first.

1) Degree:

Degree of a node is one of the most crude but efficient way to see how a node is central by itself, a must have measure before going anywhere else.

2) Weighted Degree:

Because my network is weighted this measure will tell us which node has      been seen the most.

3) Eigen vector centrality with 600 iterations:

Centrality tell us which node is highly connected because of is neighbors.

 4) Page Rank:

Page Rank is another centrality measure that I use. It measures how a node is linked to high degree node.

Page rank is less sensible to the effect of having a tons of low degree node connected to you.

We can think it like this: A link from CNN.com is worth 1000 links from random AND unknown blogs

5) Modularity

Essential! It will classify each node into a class. (more or less close)

Now it is time to visualize the network

By default there is way too many nodes. I add a filter by degree between 64 to 856.

Then I resize each node by Degree and color them by their modularity.

BuzzFeed Title Analysis

 

To continue to part 2

  • Facebook
  • Twitter
  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

There are no comments on this entry.

Trackbacks

  1. BuzzFeed Content Analysis Part 2 - hackeratwork

Add a Comment

You must be logged in to post a comment.