## Who will your friends be next week? The link prediction problem

Sitting in the student center of our university, I am surrounded by hundreds of students enjoying their lunch and socializing. They’re strengthening (and in some cases weakening) their social ties. Given the ability to observe this social network over time, we would see that some relationships flourish, while others disappear altogether.

This situation is not unique to university students. In fact, whether we’re studying the spreading of an infectious disease, or the growth of an organized crime network, the reality is that the relationships in these social networks are dynamic. They change. Being able to describe the current state of a network is important. Our goal is to predict the future state of the network by forecasting who will connect to whom. If we could make these sorts of predictions, then we may be able to better recommend or warn of future probable links, as well as come to understand mechanisms which may be driving the evolution of the network.

This brings us to the link prediction problem. Liben-Nowell and Kleinberg defined it as follows:

The link prediction problem asks, “Given a snapshot of a network at time t, can we predict new links which will occur at time t+1?”

In our work, we explore how topological similarity indices of a network can be combined with node specific data to develop a link prediction tool. Rather than pre-supposing that all similarity indices are of equal importance, we employ an evolutionary algorithm to evolve coefficients to be used in a linear combination of these similarity indices. Our approach has the advantage of being able to detect which similarity indices are more salient predictors and doesn’t require any knowledge of the type of network one may be working with.

To test our method, we begin by examining one of the largest social networks in existence, namely Twitter. Below, we visualize the network of reciprocal replies for users who interacted within a single week in late 2008.

Given information about interactions in a given week, we seek to predict links in a future week.

Our evolved predictors exhibit a thousand fold improvement over random link prediction, and a substantial improvement over individual indices used in isolation. The predictor also suggests possible factors which may be driving the evolution of Twitter reciprocal reply networks during the timespan of this study.

Our predictor reveals which topological or user specific indices are most important in link detection for a given a network. For example, index B is most often the top ranking index detected by the predictor, while indices E and I are also important. This type of output can be helpful in detecting the salient features which may be driving the network’s evolution over time.

Returning to our original question, we ask: Given a snapshot of a Twitter reciprocal reply network, is it possible to predict the links which will occur in the future?

One approach is to predict a link between all of the people in the network, but clearly this is not the approach we wish to take. For example, in the case of a modestly sized network (say N=30,000 nodes), the number of potential links is roughly half a million! If we’re trying to recommend books to a shopper or potential dates to a person using a dating service, we certainly would not want to suggest half a million people for potential dates. It would be nice if we could recommend 10 books (or dates) and have the majority of those suggested links be successful.

In the language of signal processing, we hope to have a true positive rate (e.g., the % of time you’re actually right about the links that you predict) that is greater than the false positive rate (e.g., the % of time you’ve issued a false alarm). This relationship can be captured by the Receiver Operating Curve (ROC) shown below.

The Receiver Operating Curve (ROC) compares the true positive rate (TPR) and the false positive rate (FPR). The ROC curve shown here depicts a classifier (our link predictor) for which TPR>FPR.

The area under the curve (AUC) provides one way of measuring the relative success of one’s method. An AUC of .50 suggests that the true positive rate is equal to the false positive rate, while an AUC > 0.50 indicates that the true positive rate is greater than the false positive rate. As shown in the picture above, our AUC is well above 0.50, in fact it is approximately 0.72, which is good!

These results are exciting! We’ve put together a research paper in which we describe our analysis and algorithm in more detail (http://arxiv.org/abs/1304.6257). Although we focused on Twitter in this investigation, our methodology is general and may apply to link prediction in many other types of time varying networks, such as disease networks or crime. It could also improve the “friends you may know” feature offered by many social networking services.

Here is a short video summarizing our work on link prediction:

Filed under mathematics, networks, social phenomena

## The Daily Unraveling of the Human Mind

Each morning we find ourselves in wide flung arms of drowsy possibilites. Cradled by the warm embrace of our beds, we begin our day, rebooted and rejuvenated. Having not eaten for a full eight hours, we can enjoy a guilt free breakfast, setting a blissful tone for the day.

Hourly frequency of meal references on twitter.
See figure 1 page 3 of our paper for details.

Last night’s dreams of victory and triumph bolster our delusions of adequacy, preparing us to surmount any of life’s challenges. But the moment we step outside, reality commences its slow and insidious descent. Its weight, compressing our spine, crushing our dreams, alters the course of the day completely.  The soul crushing litany of work, interacting with people, and generally living our lives takes its toll. As our sanity unravels, apathy takes root. The profane becomes our standard of expression. In the throes of despair, we swear just to feel something. We swear increasingly as we realize the inevitability of repeating this all again tomorrow.

F***, that’s a terrifying thought.

This ephemeral pattern is reflected in our tweets, our spontaneous burst of being. Below, we see our happiness peaks during the early hours of the day, and degrades as the hours progress (yellow circles). The proportion of profanity in our tweets, however, follows a reverse cycle. Profanity appears in a smaller percent of tweets at the start of each day, and increases gradually as time wears on.

Daily Unraveling
See figure 10 page 15 of our paper for details.

Remarkably, the relative frequency of these five expressions of frustration (a******,  b****, s***, f***, m***********) are quite similar.

Well done, humans.

To avoid experiencing the daily unraveling, we recommend eating organic, local dark chocolate all day long.

Filed under mathematics, psychology, social phenomena

## What’s the Most Important Theorem?

Mathematical truths are organized in an incredibly structured manner. We start with the basic properties of the natural numbers, called axioms, and slowly, painfully work our way up, reaching the real numbers, the joys of calculus, and far, far beyond. To prove new theorems, we make use of old theorems, creating a network of interconnected results—a mathematical house of cards.

So what’s the big picture view of this web of theorems? Here, we take a first look at a part of the `Theorem Network’, and uncover surprising facts about the ones that are important.  This is blatantly fun for us. Really.

Let’s go through an example starting with the real numbers.  Mathematicians like to write these numbers as $\mathbb{R}$, and here we’ll start by bravely assuming that they exist. One result that follows from the existence of $\mathbb{R}$: Given a real number $a$ belonging to $\mathbb{R}$, we can find a natural number $n$ (e.g. 1, 2, 3 …) such that $n>a$. This is known as the Archimedean property.  To visualize this relationship, we draw an arrow from the existence of $\mathbb{R}$ to the Archimedean property:

Now, the fact that real numbers satisfy the Archimedean property tells us something about sets that contain them. For example, more than a century ago, two guys named Heine and Borel used the Archimedean property to help prove their glorious, eponymous theorem.  We’ll now add an arrow leading from the Archimedean Property to the Heine Borel theorem, and we’ll include the one other component Heine and Borel needed:

All right: who is this De Morgan and what are his laws?  Back in the mid 1800′s, Augustus De Morgan dropped this bit of logical wizardry on the masses: “the negation of a conjunction is the disjunction of the negation.” We know, really exciting words.  If it’s not true that both A and B are true, then this is the same as saying either A or B or both are not true.  Better?

Before diving into a larger network, let’s think some more about these links.  One could prove the Fundamental Theorem of Calculus (which sounds important but could be just good branding) with nothing more than the axioms of ZFC set theory. But such a proof would be so long and tedious that any hope of conveying a clear understanding  would be lost.  Imagine taking all the atoms that make up a duck and trying to stick them together to create a duck; this would be the worst Lego kit ever.  And so in any mathematical analysis textbook, the theorems contain small stories of logic that are meaningful to mathematicians, and theorems that are connected are neither too close or too far apart.

For this post, what we’ve done is to take all of the theorems contained in the third edition of Walter Rudin’s Principles of Mathematical Analysis, and displayed them as nodes in a network. As for our simple networks above, directed edges are drawn from Theorem $A$ to Theorem $B$ if the proof of $B$ relied on $A$ explicitly. Here’s the full network:

###### Node size weighted by total incoming degree, colored by chapter, and laid out by Gephi’s Force Atlas.

We find that Lebesgue theory (capstoned by Lebesgue Dominated Convergence) lives on the fringe, not nearly as tied up with the properties of the real numbers as the Riemann-Stieltjes integral or the integration of differential forms. Visually, it appears that the integration of differential forms and functions of several variables rely the most on prior results. Over on the right, we’ve got things going on with sequences and series, where the well-known Cauchy Convergence criterion is labeled. By sizing the nodes proportional to their outgoing degree (i.e., the number of theorems they lead to), we observe that the basic properties of $\mathbb{R}$, of sets, and of topology (purple) lie at the core.

By considering the difference between outgoing and incoming degrees, we can find the most fundamental result (highest differential in outgoing and incoming degree, or net outgoing degree), and the most important or “end of the road” result (highest differential in incoming and outgoing degrees, or net incoming degree).  In Rudin’s text, the most fundamental result is De Morgan’s Laws, and the most important result is Multivariate Change of Variables in Integration Theorem (MCVIT, that’s a mouthful).

So the Fundamental Theorem of Calculus falls short of the mark with a net incoming degree 19, not even half of MCVIT’s net incoming degree of 45. And it is not the axioms of the real numbers that are the most fundamental, with the Existence of $\mathbb{R}$ having a net outgoing degree of 94, but instead the properties of sets shown by De Morgan with a whopping net outgoing degree of 122. Larry Page’s PageRank (the original algorithm behind Google) and Jon Kleinberg’s HITS algorithm also both rate the MCVIT as the most important result.

Would you agree that MCVIT is the most “important” result in Rudin’s text? It could just be the most technical.  We have only used a few lenses through which one might choose to evaluate the importance of theorems, so let us know what you think, or give it a try. Here’s a link to the Gephi files, containing all of the data used here.

Lastly, the network itself can be built differently by changing which theorems are included, or which are used in proofs. The resulting structures combine historical development with the author’s understanding. The goal of new textbooks is, in part, to organize the results in the most understandable fashion. With this view, we can start to think of the Theorem Network as the natural structuring of complex mathematical ideas for the human mind.

Now, one might idly think of extending this analysis to all of human knowledge. In that direction, Griff over at Griff’s Graphs has been making some very nice pictures leveraging the work of all those who edit Wikipedia.

Filed under mathematics, networks

## If you’re happy and we know it … are your friends?

Do your friends influence your behavior?  Of course they do.  But it’s hard to actually measure their influence.  Social contagion is difficult to distinguish from homophily, the tendency we have to seek relationships with people like ourselves.

In response to the “happiness is contagious” phenomenon promoted by Nicholas Christakis and James Fowler, we here at onehappybird were wondering whether happy Twitter users were more likely to be connected to each other.  In other words, is happiness assortative in the Twitter social network?  (See related work here.)

In the image below, each circle represents a person in the social network of the center node.  We color nodes by the happiness of their tweets during a single week.  Pink colors are happier, gray colors are sadder, and nodes depicted with the color black did not meet our thresholding criteria (50 labMT words).

We established a friendship link between two users if they both replied directly to the other at least once during the week.

As users are added to this network, it quickly becomes difficult to tell whether pink nodes are disproportionately connected to each other, so instead we look at the correlation of their happiness scores.  The plot below shows the Spearman correlation coefficient of the happiness ranks for roughly 100,000 people, with blue squares and green diamonds indicating different word thresholds, and red circles representing the same network but with randomly shuffled happiness scores.

The larger correlation for friends indicates that happy users are likely to be connected to each other, as are sad users. Moving further away from one’s local social neighborhood to friends of friends, and friends of friends of friends, the strength of assortativity decreases as expected.

We also looked at the average happiness of users as a function of their number of friends (degree k). Happiness increases gradually with popularity, with large degree nodes demonstrating a larger average happiness than small degree nodes.

The most popular users used words such as “you,” “thanks,” and “lol” more frequently than small degree nodes, while the latter group used words such as “damn,” “hate,” and “tired” more frequently.  The transition appears to occur near Dunbar’s number (around 150), demonstrating a quantitative difference between personal and professional relationships.

Finally, here we show a visualization of the reciprocal-reply network for the day of October 28, 2008.

The size of the nodes is proportional to their degree, and colors indicate communities detected by Gephi’s community detection algorithm.

For more details, see the publication:

C. A. Bliss, I. M. Kloumann, K. D. Harris, C. M. Danforth, P. S. Dodds.  Twitter Reciprocal Reply Networks Exhibit Assortativity with Respect to Happiness. Journal of Computational Science. 2012. [pdf]

Abstract: Based on nearly 40 million message pairs posted to Twitter between September 2008 and February 2009, we construct and examine the revealed social network structure and dynamics over the time scales of days, weeks, and months. At the level of user behavior, we employ our recently developed hedonometric analysis methods to investigate patterns of sentiment expression. We find users’ average happiness scores to be positively and significantly correlated with those of users one, two, and three links away. We strengthen our analysis by proposing and using a null model to test the effect of network topology on the assortativity of happiness. We also find evidence that more well connected users write happier status updates, with a transition occurring around Dunbar’s number. More generally, our work provides evidence of a social sub-network structure within Twitter and raises several methodological points of interest with regard to social network reconstructions.

Filed under psychology, social phenomena

## Question: Where is the happiest place in New York City?

1. Immediately adjacent to any hot dog stand.
2. Madison Square Garden during moments of Linsanity.
3. Tim Tebow’s new apartment building.

No really though, let’s measure some stuff.

Facts: (1) New York City is the most populous city in the US and (2) Manhattan streets are arranged on a rectangular grid. We have already seen how cities, airports, and even streets can be identified using geotagged tweets – here we use more than a half million messages from 2011 to investigate the happiness of NYC streets and avenues (clearly visible in the image below, as is Central Park).

Binning tweets by avenue and street, we use the labMT word list to measure happiness in tweets as a function of avenue and street number:

The results suggest that the west side is slightly happier than the east side, and that happiness actually declines as one moves further uptown. Next we bin by intersection and plot a heat map showing the distribution of happiness over all of the street corners in Manhattan:

The happiest “corner” is actually just inside the western edge of Central Park, where the intersection of 7th and 77th would be (this is just north of the lake and east of the Hayden Planetarium)*. This corner elicits tweets with a relatively high abundance of the positive words “loves” and “sky”, and proportionally less negative words like “not”, “fear” and “no”. Many of the happiest locations actually fall within Central Park!

* Please note that the results reported in this post have not been vetted through panels of experts, statistical tests of significance, or scientific peer review.  They are intended to be a fun and lighthearted exploration of our more formal research interests.

Filed under psychology, social phenomena

## Does QWERTY Affect Happiness?

Last week, news broke of a paper published in the Psychonomic Bulletin and Review by Kyle Jasmin and Daniel Casasanto claiming to observe a positive relationship between the “right-handedness” of a word and its emotional valence. This is being called the ‘QWERTY effect’. (You may recall that ‘valence’ is psych-speak for ‘happiness’ associated with words.  What I called right-handedness they call the right side advantage of a word, $\text{RSA} = (\text{\# of right side letters}) - (\text{\# of left side letters})$ when typed using the ubiquitous QWERTY keyboard. )  You can read the original paper here, and there’s a Wired article that explains their conclusions.

Particularly interesting for the group here in Vermont was Jasmin and Casasanto’s use of the Affective Norms for English Words (ANEW, from Bradley and Lang (1999)) dataset, along with comparable data for Spanish and Dutch, in their analysis. The hedonometric work we’ve done on blogs, music lyrics, Twitter, etc. was initially based on the happiness scores from the ANEW study. The 1034 ANEW words were handpicked to represent the emotional spectrum, and as such don’t represent a uniform selection of words found in English-language texts. We merged the 5,000 most common words from 4 corpora (Twitter, Google Books, the New York Times, and music lyrics) and had Mechanical Turk users evaluate their valence in the same way as was done for ANEW, producing a list of ~10,000 words and their associated happiness scores. We’re calling this dataset LabMT-1.0, for Language assessment by Mechanical Turk. Since LabMT words were picked by frequency of usage, they provide much better coverage (i.e. the percent of words identified in a text) than ANEW.

When Jasmin and Casasanto’s paper appeared and achieved the impressive press coverage that it did, it also attracted the scrutiny of other language researchers who weren’t so sure of the significance of the QWERTY effect. A public debate has taken place between Mark Liberman of the Language Log blog and the authors of the study. See post1, post2, the response from J&C, and the response back. After being informed by (our) Peter Dodds of the LabMT data, Liberman made the second post, in which he calculated the RSA of our LabMT words but continued to find no or little QWERTY effect.

As we’ve explored in our hedonometrics papers, the hedonometer can be thought of as a tunable instrument when you remove neutral-valence stop words, effectively increasing the sensitivity of happiness measurements for texts. I wanted to see if tuning $\Delta h_\text{avg}$ changed anything. In the process, I also repeated Liberman’s analysis of the LabMT data and am making available the R scripts and data that went into that.

analyze_rsa_labmt.R – script for the analysis and plots
labMT.rsa.txt – Liberman’s computation of RSA for the subset of LabMT-1.0 words containing only alphabetic characters

We haven’t seen any more evidence than Liberman did when looking simply at the relationship between RSA and $h_\text{avg}$. If the QWERTY effect is real, then it is exceedingly small, but the above data point to it being indistinguishable from zero.  It’s useful to look at the raw data, binned in both variables.

Raw data binned (RSA spacing of 1, $h_\text{avg}$ spacing of 0.1) and plotted.

There is not any obvious, visually distinguishable correlation.

Now, if you take the average happiness of words for each RSA value, you can do a linear regression on that data, weighting each point by the number of words for that RSA value.

Data binned by RSA with the line indicating the linear regression weighted by the number of words for that RSA. Note that this is the same as a linear fit of the unbinned data, but the resulting plot is less cluttered and easier to read.

The trend actually runs in the negative direction, but with a p-value of 0.74, meaning there is no effect. Jasmin and Casasanto controlled for more variables in a different dataset, and independent evaluation of the significance of the correlations they observed, controlling for these other attributes, would be possible if all the original data were released. Sure, the data sources are listed, but it would be a significant effort to recreate the entire set. I’d also be curious to see if similar correlations could be observed in the other affective variables measured in ANEW (arousal and dominance).

Final note: Changing $\Delta h_\text{avg}$, our tuning knob, does change the magnitude of the correlation. (Imagine removing a horizontal band from the binned plot above; this changes the correlation.) However, it is still impossible to conclude that the effect is significant. Also, analyzing positive and negative words separately shows opposite trends for $\Delta h_\text{avg} = 1$. The code for this is all included in the script above.

Filed under psychology, social phenomena

## Chaos in an Experimental Toy Climate

In the 1960’s, MIT meteorologist Edward Lorenz was investigating the effects of nonlinearity on short-term weather prediction in a model of convection. In his ground-breaking paper “Deterministic Nonperiodic Flow,” Lorenz showed that numerical solutions of the model exhibit sensitive dependence on their initial position, leading virtually indistinguishable states to diverge quickly. This phenomenon, which became known as chaos, is a major contributor to inaccuracies in weather and climate forecasts.

The thermal convection loop is an experimental analog of Lorenz’s system in the form of a hula-hoop shaped tube, filled with fluid, and oriented vertically like a wheel. The bottom half of the tube is warmed uniformly by a bath of hot water and the top half is cooled. Under certain conditions, a steady state is never reached, and the fluid switches direction in an unpredictable pattern.

In the past few years, we have used Computational Fluid Dynamics (CFD) simulations of the loop as a testbed for data assimilation, ensemble forecasting, and model error experiments in weather and climate prediction. Our team is developing algorithms to improve forecasts and uncertainty quantification using this simple but realistic toy climate. Successful techniques are then implemented on more realistic weather and climate models.

Details:

K. D. Harris, E.-H. Ridouane, D. L. Hitt, C. M. Danforth. 2012. Predicting Flow Reversals in Chaotic Natural Convection using Data Assimilation. Tellus A, 64, 17598. [pdf]

N. Allgaier, K. D. Harris, C. M. Danforth. 2012. Empirical Correction of a Toy Climate Model. Physical Review E. 85, 026201. [pdf]

R. Lieb-Lappen, C. M. Danforth. 2012. Aggressive Shadowing of a Low-Dimensional Model of Atmospheric Dynamics. Physica D. Volume 241, Issue 6, Pages 637–648. [pdf]

E.-H. Ridouane, C. M. Danforth, D. L. Hitt. 2009. A 2-D Numerical Study Of Chaotic Flow In A Natural Convection Loop. International Journal of Heat and Mass Transfer. [pdf]

and a lecture on the topic given by Danforth to the Applied Dynamics graduate course at UNC Chapel Hill:

Funding from the project comes from NASA and NSF through the Mathematics and Climate Research Network.

1 Comment

Filed under physics