Stanford quantifies the privacy-stripping power of metadata

jrcii · on May 17, 2016

An MIT project reached the same conclusion a couple years ago http://www.independent.co.uk/life-style/gadgets-and-tech/mit...

Someone from MIT also contributed to this study in the same vein http://www.nature.com/articles/srep01376 From the abstract, "[I]n a dataset where the location of an individual is specified hourly, and with a spatial resolution equal to that given by the carrier's antennas, four spatio-temporal points are enough to uniquely identify 95% of the individuals."

notsnowden · on May 17, 2016

Those projects are related, but definitely different. The first is about email metadata. The second is about re-identifying cellphone location data. This paper is about telephone call and text metadata, which is what the NSA collects.

rayiner · on May 17, 2016

> The law currently treats call content and metadata separately and makes it easier for government agencies to obtain metadata, in part because it assumes that it shouldn’t be possible to infer specific sensitive details about people based on metadata alone.

This quote from the Stanford News article is incorrect. The reason metadata is carved out is because there is Supreme Court precedent carving out metadata: https://en.wikipedia.org/wiki/Smith_v._Maryland. And that case has nothing to do with what can or cannot be inferred from metadata.[1] It distinguishes call data from call metadata because the latter is routinely recorded and used by phone companies for various purposes:

> First, we doubt that people in general entertain any actual expectation of privacy in the numbers they dial. All telephone users realize that they must "convey" phone numbers to the telephone company, since it is through telephone company switching equipment that their calls are completed. All subscribers realize, moreover, that the phone company has facilities for making permanent records of the numbers they dial, for they see a list of their long-distance (toll) calls on their monthly bills.

[1] Because that's totally irrelevant to the 4th amendment.

tripzilch · on May 20, 2016

I know this is not how the law and precedents work in the USA, but for the sake of common sense, someone needs to call this out:

The quoted line of reasoning is absolutely disingenuous when considered in a modern and realistic setting.

What is being wilfully ignored is the change in quality of the information gleaned from the analysis of data as it is being done in bulk.

Nobody in 1979 could have foreseen the sort of information that can be extracted from the unimaginably large fire-hose of metadata we generate today.

It's a completely different thing if you had a list of who-called-who-when in 1979. How was this data kept? Well, for starters it probably wasn't centralized. Was it even digital? Probably, yes? Even if it was digital, the computers of that day could only handle trivial amounts of data. Factor in the ubiquity of phones and phone-usage today versus back then, to even consider the concept vaguely comparable is ridiculous.

It's the difference between getting records from the electricity company so that you know which parts of your house were illuminated when, versus getting "records" from all the individual CCD elements of various cameras installed in your house so that you know the same thing, which (tiny) parts of your house were illuminated when. That's the same thing right? It's just a tiny bit more fine-grained (/s).

Just because people might be okay with the former (say because you can see what lights are on from behind the curtains, on the street), doesn't mean they'd be fine with the latter.

The quality, and therefore the privacy-expectations, of the information extracted from the data changes as you blow up the number of records by some orders of magnitude. Almost nobody from 1979 could have imagined what that would mean. Almost nobody could even have fathomed the amount of computational power a desktop computer could throw at it. Hell, not even most people today can grasp that.

So there's "innocent" metadata that via some unfathomable process can be transmuted into some rather more detailed and revealing information. It's really quite hard to get a proper perspective on it (our brains aren't made for reasoning about graphs this size). I think it's more fair to call this process "magic", than otherwise. And in that case it really doesn't matter where it came from, say magic works, does it matter if the government knows all the details of your life from divining tea-leafs or divining phone metadata?

Now add to this, that thanks to having sufficiently-advanced our technology, it is also possible, using similar techniques of "magic", for the phone companies to keep records for billing purposes in a really clever (magic/encrypted) way to shield the data from those kinds of divination while still being able to do their billing. The only reasonable expectation I have is for these paradigm shifts to be applied on both sides, equally.

stepvhen · on May 17, 2016

5 years or so ago I was on a jury once concerning sexual misconduct, and spent a good hour or so in the deliberation scanning the submitted call records, looking for gaps in correspondence between the two relevant phone numbers. It was pretty easy to identify moments when the two parties were together, that's when they stopped texting, and checked they that matched up with all of the testimonies. They did, and it solidified a guilty verdict with my fellow jurors (sans one, on one count, but we had 7 already).

jacquesm · on May 17, 2016

That hinges strongly on the meaning of the word 'all' in your comment, if all is < 5 then that's not a very strong argument at all, if all is > 100 then it probably is (but it would still need analysis of a much larger volume of calls of pairs of phones to determine how often that situation would occur due to chance).

stepvhen · on May 17, 2016

I should mention that it wasn't the only piece of evidence we looked over. We deliberated for around 6 hours I believe. Forgive me if I wasn't clear; I didn't intend to make it seem like I was the lone seeker of truth and made a grand argument to my peers; there was a lot discussed, by everybody. Also keep in mind that when you're on a jury, you only have so much to go on, and few resources with which to analyze. Thats why the phrasing is "beyond reasonable doubt."

ifdefdebug · on May 17, 2016

I think if "only so much" is too little, and "few resources" are not enough, then you have no means to reach "beyond reasonable doubt", and you should say so.

zepto · on May 17, 2016

Do you know the probability that this correspondence happened by chance?

stepvhen · on May 17, 2016

I was convinced beyond reasonable doubt, having phone records that matched rather well with testimonies. I also might have fixated on it and ignored chance, but that's why there are 12 jurors.

jstanley · on May 17, 2016

If someone on HN ignores the statistics, what are the chances that the other jurors paid any attention to statistics?

abritinthebay · on May 17, 2016

Based on most of the posts here I'd say "someone on HN" is just as likely - if not more, due to misplaced confidence - to ignore statistics.

stonecraftwolf · on May 17, 2016

Reasonable doubt is not defined in terms of statistical confidence.

nitrogen · on May 17, 2016

Maybe it should be.

lucb1e · on May 17, 2016

Or all humans fall victim to largely the same kinds of fallacies. That's why there is no jury in all countries that I know of except America.

I'm not sure if a jury is a bad idea, when I first learned of it, I kinda liked the idea, but I'm worried about this kind of reasoning...

dragonwriter · on May 17, 2016

> That's why there is no jury in all countries that I know of except America.

The UK uses juries in much the same way as the US does, as do many other countries whose legal systems derives from that of the UK (which are a fair number, because colonialism.)

piptastic · on May 17, 2016

What do other countries do?

Is the decision made by one person that has these same fallacies? At least the risk is spread out some in the jury situation.

burkaman · on May 17, 2016

Other countries definitely use juries, just not as universally as the US. They more often just have a judge or a panel of judges, or sometimes a sort of jury that includes judges.

antris · on May 17, 2016

Proper training alleviates these fallacies. I trust a judge much better than a jury of random people.

yompers888 · on May 17, 2016

On the other hand, the random jury doesn't have the same perverse incentives and web of relationships that support conviction.

r00fus · on May 17, 2016

Would you still trust that judge that was elected to office with lobbyist money?

burkaman · on May 17, 2016

I think electing judges might also be a uniquely American thing, and less than half the country does it. It's a much weirder practice than using juries. Obviously you can still say a judge was appointed by people elected with lobbyist money, but it's not that direct.

lucb1e · on May 17, 2016

Might be my lack of knowledge in the subject, but I've never heard of electing judges. Isn't it just a study and a background check or so, at least in western Europe?

burkaman · on May 17, 2016

Not only do some states have elections, but they have partisan elections, which is just obviously a dumb idea. I have no idea how it works in other countries, but in most US states judges are either elected by the legislature or appointed by the governor.

antris · on May 17, 2016

Judges are not elected in my country, so can't comment on that.

seren · on May 17, 2016

This page contains a list of countries using Jury in various way.

https://en.wikipedia.org/wiki/Jury

zepto · on May 17, 2016

Did any of the other jurors calculate the probability of the result being by chance?

danso · on May 17, 2016

I wonder if the defense attorney brought in a statician to argue that?

tsunamifury · on May 17, 2016

What a tragedy of justice that you used an absence of evidence as a deciding factor. There is so much logically wrong with that, even in correlation.

Terr_ · on May 17, 2016

Suppose somebody's alibi is "I was driving between these two cities at the time, and then I spent the night working."

But their car-odometer didn't change, and their home-electricity usage dropped to zero that day.

Both of those observations are surely "evidence". Similarly, a digital signal still carries information, even though the zeroes are an "absence" of voltage.

tsunamifury · on May 17, 2016

Thats a lot different from "you weren't talking to someone, so you must have been doing X" Whereas in real life X could = anything.

woodman · on May 17, 2016

The logic isn't even close:

Odometers increment on driven cars. The odometer did not increment. The car was driven. Unsatisfied.

People having sex are together. People who are together don't text each other. During a certain hour a person didn't text 68 contacts. At this certain hour, this certain person had sex with 68 people. Satisfiable.

stepvhen · on May 17, 2016

It's more like: these two numbers are in constant contact all other hours of the day. At this hour midday, when others, including one of the relevant parties, say those two numbers are together, there is no correspondence. Later, the correspondence picks up as before.

woodman · on May 17, 2016

You seem to be aware, at least on a subconscious level, of the larger point - as you've demonstrated by including additional evidence. The larger point being that the absence of phone records can, at best, disconfirm. There are far too many alternative explanations for it to be used alone or as confirmation.

rayiner · on May 17, 2016

Remember, in court we're not talking about logical proof (eliminating every alternative but the desired conclusion), but statistical proof (eliminating alternative conclusions with 51% or 95% certainty). Given an assumed or established statistical model of the probability of related events (e.g. knowing that a secretary at a business makes a log entry 99.9% of the time when she sends out a mailing), a non-happening (say the absence of a log entry) can easily establish a conclusion to the desired level of certainty.

woodman · on May 17, 2016

Darn, I already replied to this point moments ago. There is a difference in the application of this logic in the court room where it can be challenged, and the juror room where it cannot - right?

ifdefdebug · on May 17, 2016

I see your point of high probability they might have been together. But I don't see evidence for "sexual misconduct" during the time they were together. They could have just talked.

Anyway, the whole charge is disgusting. Leave them alone dammit.

rayiner · on May 17, 2016

There is nothing logically wrong with that. Evidence is anything that makes a conclusion more or less likely (in the Bayesian sense). All else being equal, knowing of a gap in an otherwise constant stream of communication makes a in-person meet-up more likely than it is without that knowledge.

woodman · on May 17, 2016

Not every piece of evidence (in the Bayesian sense) is admissible in every court though, polygraphs being an obvious example. That is why courts get all pissy when jurors go off the reservation and start doing their own research. I'd be a little surprised if the judge in this case enthusiastically endorsed this sort of analysis going on in jury deliberation, but given your background - I'm interested to hear if I'm mistaken.

rayiner · on May 17, 2016

Evidence can be excluded if it's not relevant (in the Bayesian sense), or unreliable. You might have some argument that the phone records don't make the conclusion sufficiently more likely to warrant admitting it. But the bar for relevance is not high.

Moreover, this is kind of what juries do best: inferring what people did from other things they did or didn't do. While juries are subject to a lot of biases, I think humans have a good statistical model of the behaviors of other humans.

woodman · on May 17, 2016

I guess I've been mistaken in my understanding of the whole process. I was under the impression that all the controls are in place specifically to restrict the juror's natural inclination to infer, because humans really suck at logical inference (and not just because of bias). I'd always assumed that an an innovation in natural language processing and combination with formal logic would be welcome in the court room, but it doesn't sound like that would be the case if empiricism is preferred to rationality.

SilasX · on May 18, 2016

This is just not true -- there are tons of highly informative, relevant kinds of Bayesian evidence that are excluded. For example, remaining silent, illegally-obtained evidence, hearsay, opinion of experts (in many cases).

wglb · on May 18, 2016

The evidence is the phone record with a string of calls. A gap between calls is as much evidence as a phone call is. A gap is not absence of evidence.

duaneb · on May 17, 2016

What would you have used as a deciding factor? Often times circumstantial evidence is all there is to work with.

tsunamifury · on May 17, 2016

A case built on circumstantial evidence is the very definition of reasonable doubt and should be tossed out! I don't understand why so many people think the purpose of court is to convict people! You're question could litterally be reduced to "But how are we supposed to find them guilty if we don't have evidence?"

duaneb · on May 19, 2016

Ultimately, what is "reasonable doubt" is up to the jury.

I can't say I agree with the use of circumstantial evidence, but that's how the court works. Additionally, many people do get off based on lack of clear evidence?

dTal · on May 17, 2016

It's off-topic, but I'm deathly curious - what kind of sexual misconduct is so serious that two people consensually meeting warrants a jury trial, complete with subpoenaed phone logs?

I suppose if one of them were underage...

stepvhen · on May 17, 2016

Both parties were consenting, but one was underage at the time of the incident(s). The mother was the one who pressed charges, if I recall correctly.

Lawtonfogle · on May 17, 2016

Odd that the phone records were needed. If I thought the victim was telling the truth about beimg abused then the records wouldn't matter and if I had reason to doubt them then I don't see matching phone records satisfying the doubt.

privong · on May 17, 2016

> what kind of sexual misconduct is so serious that two people consensually meeting warrants a jury trial, complete with subpoenaed phone logs?

Their meeting up may have been consensual, but what happened while they were together might not have been. Also, the parent doesn't say both parties wanted to meet up. So, the meeting-up could have not been consensual. It could have been a stalking situation or maybe they just happened to run into each other by chance.

shas3 · on May 17, 2016

I am reminded of an interesting observation by the mathematician Terence Tao about how our anonymity on the Internet and in a connected world is so fragile [1]. Basically, because there are only 3 billion internet users, every person can be uniquely identified by a 31 bit number. The uncovering of each bit gets one closer to the identity of the person. Seen in this light, one would expect metadata to uncover quite a few of the 31 bits!

[1] https://plus.google.com/+TerenceTao27/posts/8vmpA9fgRMq?iem=...

UVB-76 · on May 17, 2016

Link to the actual article: http://www.pnas.org/content/early/2016/05/10/1508081113.full

jegoodwin3 · on May 17, 2016

Thanks for this. I would have been happier if the authors had phrased their findings in term of differential privacy rather than the effectiveness of the algorithms they were able to achieve.

https://en.wikipedia.org/wiki/Differential_privacy

When setting policy, it is better to have theoretical mathematical results rather than empirical effectiveness, since you can bet the technological frontier of privacy violation is a moving one. As with cryptography, you want solid foundations in unbreakable maths -- not 'we can't break this cipher with what we know today'. Probably, someone can.

ErikAugust · on May 17, 2016

This was particularly striking to me:

"We kill people based on metadata” - General Michael Hayden

e12e · on May 18, 2016

I'd probably strengthen that to "We kill bystanders based on metadata".

aandon · on May 17, 2016

Great Snowden interview (by Neil deGrasse Tyson) where he explains why collecting "only metadata" is no excuse: https://soundcloud.com/startalk/a-conversation-with-edward-s...

ljk · on May 17, 2016

> cross-referenced with social networking information and other public data sets, such as Yelp and Google Places

Probably would be safer to not publicize your every move publicly too, even if it'd only slow the process down a little

arca_vorago · on May 17, 2016

The real problem with issues like this is that while I have chosen to remove myself from certain social media (have a facebook but it's poisoned data), is that friends and family aren't as paranoid or knowledgeable about privacy implications as I am, so I have to remind people to not tag me, not upload photos of me, etc.

Friends and family are reporting on their friends and family without even understanding the implications of what they are doing.

It may seem benign at the moment, but given the nature of the turn-key totalitarian state, it's when that key gets turned and the cat starts getting walked back that this sort of information leakage from unexpected sources really becomes an issue.

ljk · on May 17, 2016

Good point, and facebook generates "phantom profiles" for people without an account too to make the data gathering even easier for them

arca_vorago · on May 17, 2016

It does, hence why I suggest people manage their own but spike the data... and be sure to use tor or similar to connect. (Which I only check every six months or so, just to make sure facebook doesn't try to pull any more "oh hey, we made your entire profile public" stunts again.)

dredmorbius · on May 17, 2016

How and what do you check for?

arca_vorago · on May 18, 2016

Just review privacy settings, check for tags from friends, simple stuff.

dimino · on May 17, 2016

> in part because it assumes that it shouldn’t be possible to infer specific sensitive details about people based on metadata alone.

> One of the government’s justifications for allowing law enforcement and national security agencies to access metadata without warrants is the underlying belief that it’s not sensitive information.

Is this true? I didn't think this was an actual part of the argument for using metadata, but that metadata wasn't covered under current laws, and was therefore easier to get.

I was working under the assumption that it was an unintentional oversight, not an intentional hole in legislation.

DanBC · on May 17, 2016

In England the government say (when they want to expand laws to make it easier for them to get metadata) that it doesn't contain any content, it's just about who you call or who calls you and for how long. They never say that individuals cannot be identified from this - information about identified individuals is the point of gathering the data.

So they seem clear about the difference between content and meta, and that metadata will identify people.

They're less clear about the further de-anonymisation aspects of "just" metadata, and it's hard to know if that's because they don't know or don't care.

drallison · on May 17, 2016

This paper will be read by two of the authors in the Stanford EE Computer Systems Colloquium, EE380, http://ee380.stanford.edu. EE380 is a public lecture--anyone is welcome to attend or watch the live stream video. The talk video will be posted to YouTube the day following the presentation. For details, see the announcement at http://ee380.stanford.edu/Abstracts/160518.html.

zipwitch · on May 17, 2016

There was an interesting and informative blog post along similar lines (that I think was linked here not that long ago), on, "Using Metadata to find Paul Revere".

https://kieranhealy.org/blog/archives/2013/06/09/using-metad...

lucb1e · on May 17, 2016

You can draw social graphs from who calls who, just like Facebook can? No shit. (And Facebook is scary right?)

You can find a person's city of residence in 57% of the cases? That's pretty bad, I'd almost feel oddly relieved my data is saying so little, but I'm afraid the NSA would do a better job.

You can predict who is pregnant? And who owns a rifle? Alright now we are getting somewhere. The article didn't mention how many cases succeed here so I'm not sure if I should be impressed, since the rifle hotline or a licensing agency thing (I don't know how that works) would be pretty obvious.

asdf333 · on May 17, 2016

There are studies like this from 20-30 years ago using medical data....it isn't new but good that people are aware of it.

beefsack · on May 18, 2016

I feel the media has taken a useful word in the technology world and made it almost useless for general usage. I'm scared to even mention "metadata" to people even in relevant technical context as the word has become politicised and loaded, just like I can't call myself a "hacker" any more.

alexchantavy · on May 17, 2016

> In combination with independent reviews that have found bulk metadata surveillance to be an ineffective intelligence strategy, our findings should give policymakers pause when authorizing such programs.

If metadata has such power, why do they say that it is an "ineffective intelligence strategy"?

jmcgough · on May 17, 2016

I think they mean that it's not as useful in discovering and preventing unknown threats, but it's great for tracking a particular person (so it's a better tool for surveillance of citizens).