Yesterday I was at a meeting on this subject at Royal Holloway, organised by Simon Blackburn.
The subject is relatively young; it began in 1995 with a paper by Chor, Goldreich, Kushilevitz and Sudan. At the meeting, Alex Vardy gave us a very clear account of the theory (on which what is below is mostly based), before describing his own contribution.
Suppose Alice wants to download a file from an online database, without the database learning which file she is interested in. There is one simple and sure way to do this, though potentially rather expensive; she can download the entire database, and then privately select the file she wants. In fact, there is no protocol in general which does better.
So is this the end of the story? No, there are two approaches which have been adopted. One is computational: the database manager may be able to learn Alice’s choice, but the computation required to discover this is prohibitive. There are protocols which achieve this, but at the expense of being themselves computationally intensive. The other approach, which was the concern of the meeting, is information-theoretic. This makes the crucial assumption that the data is stored on a number (say k) of servers, which do not communicate with one another.
To simplify matters, we assume that the database consists of a binary string x of length n, and Alice wants the ith bit. Of course she probably wants a file whose size may be gigabytes, but (apart from re-scaling the resources required) the principle is the same.
To show that the goal is not impossibe, here is a simple protocol for the case k = 2. Alice generates a random string u of length n. Let u‘ be the result of changing the ith bit of u. Alice’s requests to the two servers are the strings u and u‘. The servers return the two bits x.u and x.u‘; by adding them, Alice gets the bit xi. Since each server sees a random string from Alice, neither learns anything about her interests. (Of course, if the servers do communicate, they can do the same calculation as Alice; we must also assume that the communications between Alice and the servers are secure.)
This protocol is resource-intensive. We require each server to store the entire database; Alice must generate a string of the same length, and transmit two near-identical copies; and the servers must touch every item in the database to compute the dot products. Most research has focussed on reducing either the storage overhead or the communication cost. For example, the amount of communication required has been reduced to sub-polynomial if there are more than two servers.
Vardy’s work takes a different approach. This is based on the fact that multiple servers may use protocols which involve each server storing only part of the information. Typically it is required that full information can be reconstructed by accessing a prescribed number of servers (this may be done, for example, with a Reed–Solomon code), or that if one server fails, the information it holds can be recovered from the others, or perhaps a limited number of other servers.
His first example of using this idea for PIR showed how to use a 3-server protocol which only has a storage overhead of 2 (equivalent to using 2 servers – a 3-server protocol might be better in terms of the amount of information needing to be transmitted). This involves breaking the data into four pieces, and storing these (or linear combinations of them) on eight servers, each of which has to store 1/4 of the entire data. The scheme is simply a linear code of length 4, with a certain property which he called “3-server PIR”.
In general, a binary code with an s×m generator matrix G has the k-server PIR property if, for each column of G, there are k pairwise disjoint sets of coordinates such that, for each set, the sum of the columns of G with indices in that set is the given column. Such a code enables us to emulate any k-server PIR protocol with a database distributed over m servers, each storing 1/s of the original information (so with storage overhead m/s, which may be much smaller than k). Much classical coding theory (e.g. majority logic decoding) and combinatorics (Steiner systems, linear hypergraphs) appeared in the constructions he gave.
I will describe the other talks more briefly. Vijay Kumar surveyed coding for distributed storage, dealing with both regenerating codes and codes with locality. Salim El Rouayheb continued Alex Vardy’s theme by allowing the possibility that some of the servers are “spies” and may collude to attempt to get information about Alice’s choice.
Finally, Tuvi Etzion (who is working with Simon on an EPSRC-funded project) talked about connections with network coding. He was dealing with multicast coding, where one sender has a collection of messages, and a number of receivers require all of the messages. He gave us an example to show that vector networks can beat scalar networks. (For a scalar network, the messages are taken from a finite field of order q, say, and a node can form a linear combination of its inputs to send on as its output. It is known that this is possible for sufficiently large fields. In a vector network, the symbols are t-tuples over a field of order r (and, for comparison with the scalar network, we take q = rt); a node can apply linear maps to its inputs and sum the result to produce its output. He gave an example of a network where, for a given value of q = rt, the vector network could succeed whereas the scalar network required a field of size about rt2/2.)
In the final part of his talk, he described connections between network coding and PIR, but I am afraid my shingles-affected brain was not really processing this information efficiently.