Flipboard’s Approach to Automatic Summarization

/ October 1, 2014

Bringing the beauty of print to the mobile interface is our all-encompassing vision at Flipboard; in doing so, we’ve learned that it’s necessary to provide our users with an experience dedicated solely to their content. With powerful magazine and topical recommendations, we’ve nearly perfected the way our users find stories, but never before, until now, have we tinkered with how our users read them.

Why Flipboard Needs Summarization

At Flipboard, we’re known for our polished and beautiful dynamic layouts. Constructing these layouts is a challenge; with a user base fragmented across iPhone, iPad, Android, and Windows, it’s important for us to optimize our content appropriately to support varying screen sizes and content layouts. Solving for these size constraints becomes much easier when we can emphasize the key parts of content and filter out the less important.

Sentence graphs in Summarization

Our focus has been on extractive summarization.

In extractive summarization, the objective is to identify the essential, or central, sentences in a document. One way of modeling a document is as a graph, with each sentence of the document represented with a node and the relationships between those sentences represented with weighted edges.

graph

We model sentences as bags of words, and the strength of interaction between two sentences as being the similarity between their respective word-sets. There are several standard metrics for this, such as Jaccard similarity and Hamming distance. Having selected a metric, we then normalize the edge weights such that the out-degree of each node sums to one.

The normalized adjacency matrix of the graph is, thus, stochastic. Given that, we can consider the centrality of nodes (sentences) in the graph; in particular, we can compute the PageRank centrality measure for each sentence in the document. Higher scoring sentences are more central and more typical of the document.1

We then sort the sentences by their scores, select the top n (depending on the amount of space available on-screen for the summary), and reorder them by their order of appearance in the original document.

All of this work can be done very quickly. Our Java implementation running on an AWS EC2 instance measured at an average of 16ms over 50,000 documents, with a standard deviation of 48ms. The 95th percentile measurement was 65ms.

Summarization in Action

As an example, here’s a summary of this blog post about Star Wars:

Summary:

Excerpt (first four to five sentences):

Notice that the first four sentences of the article, which are mostly authorial color and commentary, aren’t in the summary. The extracted summary provides a much better synopsis of the post, which is even more apparent in-app:

app_0

Excerpt

app_1

Summary

Here’s another case—a summary of a blog post about Alaska Airlines:

Summary:

Excerpt (First four to five sentences):

Surfacing important textual content becomes even more crucial when screen space is severely limited; for instance, in a push notification:

notif_0

Excerpt

notif_1

Summary

The difference in quality here is immediately apparent.

The Future

While a combination of the central parts of content serves as a great guess of what’s important to a reader, it isn’t perfect, because importance is naturally subjective; while a culinary novice might be looking for a basic ingredients list from an article about mushroom risotto, a more experienced chef might be looking to learn new techniques from the same article. Putting user-specific data to work in creating personalized summaries is something we are very interested in and are excited to explore.

We’re looking forward to rolling out summarization across Flipboard in the coming months.

We’re always looking for ways to improve Flipboard through tackling interesting problems like summarization. If working on NLP or data-related projects is interesting to you, we’re hiring.

  1. This approach was developed by Erkan and collaborators in their 2004 paper: LexRank: Graph-based Lexical Centrality as Salience in Text Summarization

Special thanks to Andrew, Boris, David, Jerry and Cecily for edits and suggestions.

by Yonatan Oren