On clique detection and PageRank

It's time for some graph theory

Aug 05, 2021

We start by letting Wikipedia do the lifting:

In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically. Graphs are one of the principal objects of study in discrete mathematics.

Graph theory is generally considered a form of pure mathematics. There are many interesting1 algorithmic problems regarding calculations of various functions of graphs2.

One problem is to find the largest clique3 of a graph. This is an NP-complete problem; and the way you demonstrate that is by adding structure. You create a graph with labeled vertices for each node in a 3-SAT problem, connect non-contradictory vertices in different nodes (everything except contradictions), and if the largest clique on this multi-partite node is large enough, it will be a solution.

A second problem asks: what is the relationship between the largest clique of a graph, and the largest clique of the complement of the graph? This leads into the entire branch of mathematics known as Ramsey theory. Even simple boundaries based on graph size are virtually impossible to calculate.

There are a variety of clever algorithms to try to solve this for an arbitrary graph. However, arbitrary graphs are rarely interesting. We are more interested in properties of graphs that encode useful information. Such as your Facebook friends. Here, the limit of 5000 edges per person makes the problem much more tractable.

Or the World Wide Web.

A terminology note: once upon a time there was a difference between the World Wide Web and the Internet. The World Wide Web referred to a specific Hypertext system. It used the Hypertext Transfer Protocol and was intended to share hypertext documents. Everything else (Usenet, email) was part of the Internet, but not the World Wide Web.

Today, we use browser apps for everything. And yet we want a word for “just the pages that have information on them”. As such, certain types of web apps I consider to be separate from the “World Wide Web”. While I won’t try to develop a firm dichotomy, certainly GMail and Facebook should be considered as a separate part of the Internet.

What we are left with as the World Wide Web is a graph of billions of pages. (Our node is of web pages, not web sites4.) This graph has unidirectional edges, which means that our clique problem above is different, but that’s not a terribly useful problem. Cliques are generally just internal navigation within a single web site. A more useful question is: what are the most “popular” nodes?

The answer to that is (roughly) the eigenvector of the adjacency matrix. You know the algorithm to calculate it as PageRank. On the World Wide Web of 20 years ago, this worked very well. There are a few founts of PageRank - initially university homepages5. Faculty websites in the 1990s often had many lists of links to useful pages. In the end, the homepages of the most popular web sites (and http://www.adobe.com/products/acrobat/readstep2.html ) had the most PageRank.

Of course, this stopped working as soon as people tried to optimize for it. What followed were twenty years of epicycles, spam detection, additional algorithms, knowledge graphs, OneBoxes, and universal search. Today, the World Wide Web is too large, and is filled with infinite rabbit-holes such as calendars. Nobody even bothers to try to index all of it, and most of the time nobody even notices what is missing. For many queries, either “just ask Wikipedia” or “just ask who will pay the most” is a better result anyhow.

In a sense, this simply highlights that PageRank is not intended as an adversarial algorithm. It works as desired only so long as there aren’t agents who want the system to act differently. Next week, we will highlight some algorithms that are designed to handle adversaries and the Byzantine Generals problem. So stay tuned.

I use “interesting” in a mathematician’s sense. Which, normally, means “non-obvious to the researcher and useless to anyone else”.

Most of the time, you think of a function as f(x) = 3x+7 or f(x,y,z) = x+y+z. If you’re a programmer, you may think of main(argc, argv) {}. All of these take arguments that are either numbers or character-strings. But there’s no reason we can’t pass a graph as a function argument. Metaphorically, we are not passing a data structure representation of a graph, we are passing a graph. Similarly, f(λ, 力, 🏙️) doesn’t require English descriptions of characters.

A clique is a fully-connected sub-graph. If the vertices have identifiers, it makes sense to return the result. If the vertices are indistinguishable, the only useful property is the size of the largest clique; all fully-connected sub-graphs are the same.

We use “site” rather than “domain” (as in a Domain Name) for clarity. A single logical site may be on several domain namespaces. Also, “domain” and “range” have reserved meanings when talking about functions.

The three founts used initially were the homepages of Page and Brin’s almae matres: Stanford, UMichigan, and UMaryland. This worked for some languages and verticals.

the Tisatsar Newslettr

Discussion about this post