October 3, 2012

Complexity and Scams

by Steve Bellovin

All of us use gadgets—cars, phones, computers, what have you—that we don’t really understand.  We can use them, and use them very effectively, but only because a great deal of effort has been put into making them seem simple.  That’s all well and good—until suddenly you have to deal with the complexity.  Sometimes, that happens because something has broken and needs to be fixed; other times, though, scammers try to exploit the complexity.  The complaints released today by the FTC illustrate this nicely (the press release is here; it contains links to the actual filings), with lessons for both consumers and software developers.  (It turns out that programmers speak a different language than normal people do—who knew?)

It’s a long story, but I can summarize it easily enough: scammers call people claiming to be from a reputable vendor.  They trick their victims into thinking that their computers are infected, and persuade them to fork over $100 or more.

The scam starts innocently enough: people receive a call telling them that their computer “may” be infected.  (The call itself may be illegal if it’s a “robocall”—do you know about the forthcoming FTC Robocall Summit?  Even if it’s not a robocall, it’s illegal if the recipient is on the Do Not Call list.)  The caller will claim to be from a computer or computer security company. (I received such a call, well before I joined the FTC; that person claimed to be from Microsoft.  Yes, I’m on the Do Not Call list.)  The victim will be talked through some steps designed to “demonstrate” that their computer is infected.  You’re then given the “opportunity” to pay them for fixing it.

Lesson 1: Be extremely skeptical if someone calls you; reputable security companies don’t “cold call” people.  If you have any doubt whatsoever about the legitimacy of the caller, call back using a number you’ve learned independently, perhaps using a phone number from their web site.  (This issue is broader than just this scam.  For example, if a caller claims to be from your credit card company, don’t give out any information; instead call back using the number on the back of your card.  And don’t believe Caller ID; it’s easily spoofed.  There are also lessons here for developers, but I’ll save those for another post.)

 This is the most important lesson to learn: “Don’t call me; I’ll call you.”

It’s worth noting that scammers in this case did in fact use Caller ID spoofing, to make the calls appear to be coming from the U.S. rather than India.  That turns out to be remarkably easy to do.  Here’s the crucial question: when a call starts on one phone company’s network but terminates on another’s, how does the receiving company know the caller’s number?  Answer: the receiving company believes whatever it’s told, whether the information is coming from another phone company or a private branch exchange (PBX).  This worked tolerably well when there were only a few, large telcos; now, though, there are very many—and every Voice over IP (VoIP) gateway to the phone network counts as a telco or PBX.

That trust model no longer works.  There are many more telephone companies than there once were, and there are very many VoIP gateways.  If even one doesn’t check the Caller ID asserted by its customers—and there are valid technical reasons not to, in some situations; consider the case of an employee who wants to make an expensive international call via the company PBX—it’s very easy for a malefactor to claim any number he or she wishes.  (Note that using fake Caller ID “for the purposes of defrauding or otherwise causing harm” is illegal.)  One of the accused firms here claimed to be calling from Quinnipiac University or New York City; another claimed to be from Texas, etc.

In this particular scam, the victim is asked to run a program called the “Event Viewer”.  Most computer systems log various things that take place, including mildly anomalous conditions; Event Viewer is the way to display such logs on Windows.  The information is often quite cryptic, but invaluable to support personnel.  Cryptic?  Yes, cryptic, as you can see below.

The point is that you’re not expected to understand it; it’s information for a technician if you need help.

The consumer is then directed to look for “Warnings”.  That sounds scary, right; your computer is warning you about something.  Lesson 2: Programmers use words differently.  On most computer systems, warnings are less serious than errors; you generally don’t need to do anything about a warning.  Contrast “Warning, your disk is 90% full” with “Error: no space left on disk.”  That isn’t normal usage (to the Weather Service, a storm warning is more serious than a storm watch, which is why programmers get confused when they listen to weather reports…), which gave the scammers one more thing to exploit.

Next, of course, it’s time to scare the consumer—“Jesus, did you say warning?”—followed by completely bogus cautions to avoid clicking on the warnings.  (What happens if you do click on one of those messages?  Nothing bad; you just get a new window with more information, and a URL to click on to get even more details.  That’s what the screen shot shows.)

What happens if you click on a warning or error

It’s also worth realizing that even most of the errors logged are quite irrelevant and harmless.  That isn’t always the case, but more or less any machine will experience many transient or otherwise meaningless failures, perhaps induced by things like momentary connectivity outages.

There’s also a lot of technical doubletalk, presumably intended to impress the victim with the caller’s expertise.  Most of this is pure nonsense, such as (in one call from an FTC investigator) learning that “DNS” is “dynamic network set-up”.  The DNS, of course, is really the “Domain Name System”, the Internet mechanism that translates things like www.ftc.gov into a set of numbers that the underlying hardware really understands.  My favorite was hearing that “the Javascript in your computer has been fully corrupted”.  Javascript is indeed a programming language, but its primary use is creating dynamic web pages.  It’s not normally “on” your computer in any permanent sense; rather, Javascript programs are downloaded to your  web browser when you visit most commercial web sites.  (These programs are run in what is called a “sandbox”, which in theory means that they can’t affect anything on your computer.)

Lesson 3: Just because someone can spout technical terms it doesn’t mean they’re knowledgeable or legitimate.  Of course, asking them to explain what they’re saying doesn’t prove much; they can respond with more glib doubletalk.  A legitimate support tech can probably explain things somewhat more simply; however, while lack of technical details might be a good reason for suspicion, the presence of them says very little.

The victim is then told to download and run a program from the scammer’s web site.  That’s bad, too—you should never run a program from an unknown source—but of course by this time the victim does trust the caller.  But this is really dangerous: once you run someone else’s code, it could be game over for your computer; it’s really, really hard to disinfect a machine thoroughly. The same applies to credit card numbers: once you give it out, you could be charged far more and far more often than a one-time payment to a scammer.

Where does this leave us?  Like most con artists, the callers here are trying to gain your trust before ripping you off.  The best thing is to cut them off at the start.  Remember Lesson 1—“Don’t call me; I’ll call you.” —and use a number that you’ve looked up on your own.

September 19, 2012

Password Compromises

by Steve Bellovin

There are many problems with passwords; Ed Felten explained the issues very well a few months ago.  One big problem is that they can be stolen: the bad guys can somehow get hold of your password and use it themselves.  But how can this happen?  It turns out that there are many different ways, each of which requires a different protective strategy.

The first risky spot is your own computer.  There’s a class of malware — malicious software — known as keystroke loggers, programs that copy everything you type.  Such programs can record what websites you log into, your userid, and your password.  (How do these programs know it’s a login screen?  Apart from seeing words like “password” before a text box, how do you know when you’re entering one?  Easy – characters in passwords aren’t echoed as you type them.  The malware notices the same thing.)  Keystroke loggers are one of the most common threats.

The best defense here is the obvious: run a clean computer.  Keep current on patches, run up-to-date antivirus software, don’t install software from untrusted sources, and don’t visit dubious web sites that might infect your computer.

A second way passwords are stolen is during transmission: an attacker can eavesdrop on the login sequence.  Encryption is the obvious countermeasure, and it works very well.  Most web sites are very good about using encryption for login pages, though a distressing number of mail servers don’t.  Even without encryption, though, eavesdropping on passwords isn’t that big a threat for home users – it’s difficult for the attacker to get to the right place, and the payoff is low because of the ubiquity of encryption.  Think of it as “herd immunity” for computers: even if you don’t use encryption, you’re protected by all of the other folk who do.  All that said, you should use encryption whenever possible, especially on your home wireless network; there are many other threats that encryption can avert.  WiFi in a public place – a coffee shop, a hotel, an airport, etc. – is a different matter; never send an unencrypted password from such a network!

Passwords can also be stolen from servers: web sites, mail systems, and so on.  This is a high-payoff attack for the bad guys; they can steal millions of passwords with a single attack.  The defensive onus here is on the server operator (and it’s hard for users to tell if site operators are doing it properly); among other things, sites should “hash” – mathematically irreversibly scramble – user passwords, generally after incorporating a “salt”.  (For a good discussion of hashes, see Ed Felten’s post. I’ll write more on salting some other time.  )  This means that password recovery – being able to send you your old password, as opposed to you or the site creating a new one – is a dangerous ability, since it implies that the passwords haven’t been hashed.

What about everyone’s least favorite aspect of passwords, their “strength”?  Where does strength come in?  We’ve known for more than 30 years that many people pick easy-to-guess passwords, things like “123456” or “password”.  An attacker with a suitable list of likely guesses (and such lists can include multiple languages, names of movie or book characters, etc.) can try each guess in turn.  This may be done online – repeatedly trying to log in as you – or it may be an offline attack against a hashed password file stolen from a hacked server.  (Salting makes guessing in offline attacks much more expensive for the attacker, which is why it’s a good idea.)  Strong passwords help defeat guessing, but they do nothing to protect you against keystroke loggers or phishing sites.  Against online attacks, the best defense is to limit the number of guesses an attacker can make, or at least limit the rate of guessing.  Think of it this way: if the password dictionary has 1,000,000 entries (and that’s actually a small dictionary) and guesses are limited to one per second, it will take more than 11 days to try every possibility.  If the rate limit changes to one every 10 seconds after the first five failures, it will take several months.  The risk, of course, is that the legitimate account owner can be locked out by an attacker’s attempts, but there are more sophisticated variants that can minimize that problem.  Rate-limiting doesn’t help against offline attacks, but there’s another defense: the attacker first has to break into the server.

What, then, should a user do?  The single most important defense is to avoid reusing passwords.  That way, if a site is compromised you only have to change your password in one place, not several dozen.  No one can remember many different strong passwords; you have to record them somewhere.  A piece of paper may be ideal – no hacker is going to reach into your wallet, Hollywood movies notwithstanding – but you do need to keep it safe.  Safe from whom?  Protect it from anyone who might want to get access to your accounts; that can include disgruntled coworkers, family members, and so on.  You may want to leave a few very high value passwords, such as online banking credentials, off the list; those can and should be memorized.  Consider using a password manager program; these will encrypt the list, and perhaps provide easy synchronization across different computers and mobile devices.  (The encryption password you choose should of course be very, very strong.)  Remember that you need to guard the password list against loss, too; this means that the file should be backed up or the piece of paper copied.  Don’t use an ordinary, unencrypted file on your computer, however obscurely named; similarly, don’t store passwords in an email account.

Finally, for sites that offer you the option, opt for two-factor authentication (often using something like a text message to your phone), especially for high-value accounts.  Two-factor authentication can be somewhat inconvenient to use, but it offers a very large increase in security, and largely nullifies the problem of password theft.

September 7, 2012

Howdy!

by Steve Bellovin

I’m delighted to succeed Ed Felten as Chief Technologist of the Federal Trade Commission. He’s a hard act to follow! But what does the FTC do, and what is the role of a technologist?

The FTC polices the online marketplace. While that often involves addressing complex issues, one essential requirement is that companies must keep the promises they make to consumers. If an organization’s privacy policy says that it won’t sell your personal information but it does, that’s deceptive under FTC law. Similarly, if it promises to “keep your personal information secure” but doesn’t follow industry-standard practices, that, too, can constitute deception. In such cases, the FTC can act.

Consumers have a role, too. How do you read a privacy policy? How can you tell if a web site is safe enough? Education is a big part of the FTC’s job as well.

I’ll be using this blog to discuss all of these issues. Going forward, I’ll have substantive posts on these issues and more.

August 9, 2012

FTC Settles with Google over Cookie Control Override

by Ed Felten

[Updated (4:35pm EDT, August 9, 2012): Added to the description of the HTML file quoted in this post, to say when I recorded it.]

Today the FTC announced a settlement with Google, in which the company agreed to pay $22.5 Million to settle charges that it  misled consumers about its use of tracking cookies on the Safari browser.   The Complaint and Order, which were approved by the Commission, are the official statement of the FTC’s position on the case.  In this post I’ll explain some of the technical background in more detail–speaking just for myself.

Google’s DoubleClick ad network uses tracking cookies to record a history of a user’s activities across different web sites.   A DoubleClick tracking cookie looks like this:

id: c5bffdc4700000c||t=1343680985|et=730|cs=002213fd484b7cb9af91248086

Google also uses cookies to offer an opt-out.  If a consumer clicks the opt-out button, Google creates an opt-out cookie, which clobbers any tracking cookie that was in place before.  The opt-out cookie looks like this:

id: OPT_OUT

If you have the opt-out cookie, Google won’t place a tracking cookie on your computer.   On most browsers this all works as described.

But Apple’s Safari browser–the default browser on Macs, iPhones, and iPads–puts more stringent limits on how sites can use cookies.  In its default setting (“Block cookies: From third parties and advertisers”) Safari blocks most cookies coming from third parties.    Users can change this setting, but very few do change it, so from here on, let’s assume that Safari is in its default configuration.

Safari allows a site to deposit a cookie onto your computer whenever at least one of the following things is true:

  1. you are visiting the site directly–that is, it is the “first party” site whose URL appears in the browser’s address bar, or
  2. the site already has a cookie present in your browser, or
  3. the site is responding to a form that you submitted.

One consequence of this design is that Google’s opt-out cookie mechanism doesn’t work for Safari users–Google’s attempt to deliver the opt-out cookie will fail because none of the three conditions hold.

The FTC alleged that Google told Safari users that they didn’t need to worry about the unavailability of opt-out, because Safari’s cookie controls would provide the same protection as the opt-out.

Unfortunately, according to the FTC, this promise wasn’t kept.  Google ended up placing tracking cookies in many Safari users’ browsers despite its promise to give those users the same treatment as opted-out users.

Google placed the tracking cookies in two different ways.

First, if you went to the doubleclick.net website, perhaps by typing in the URL but more likely by clicking an ad placed by DoubleClick, then you would be given a DoubleClick tracking cookie.  Safari allowed this because it treated DoubleClick as playing a first-party role in this interaction–but no cookie would have been given to an opted-out user of another browser.

(An important detail here: Though people sometimes talk about “first-party cookies” versus “third-party cookies,” there is nothing about the cookie itself that is marked as either first-party or third-party.   Instead, first-party and third-party are roles that a site can play in a particular interaction–in the same way that “home team” is not a permanent attribute of a sports team but merely a role that the team might occupy in today’s game.    When somebody says “first-party cookie,” you should read that as “cookie associated with a site that is playing a first-party role at the moment.” )

The second way that Safari users got DoubleClick tracking cookies was more complicated–and this is the one that has gotten the most attention.   This part of the story starts with Google wanting to put a “social advertising” cookie onto users’ computers.  “Social advertising” is a feature that lets you click a “+1″ button on an ad you like–and then shows the same ad to your friends with an indication that you liked it.   If implemented in a straightforward way, this wouldn’t work on Safari because Safari would block the placement of Google’s social advertising cookie.

So Google overrode Safari’s cookie controls.   They sent Safari a file that looked like this:

<html>
<head></head>
<body> 
    <form id="drt_form" method=post action="/pagead/drt/si?p=XXX&ut=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX">
    </form> 
    <script> 
        document.getElementById('drt_form').submit(); 
    </script> 
</body> 
</html>

I recorded this file in mid-December, 2011.  The line that starts with “document….” is Javascript code that told the browser that the user had submitted a form–even though the user had done no such thing.   (The “form” was invisible and had neither content nor a Submit button,  so the user could not actually submit it.)   Safari, believing that the user had chosen to submit a form, would then allow Google to put a DoubleClick cookie on the user’s computer.   This was allowed under condition number 3 above.

Once the first cookie was in place, Safari would then–according to condition number 2 above–allow Google to deliver additional cookies from doubleclick.net, including the DoubleClick tracking cookie.   So the end result of Google’s form submission was to put DoubleClick tracking cookies on Safari users’ browsers, despite Google’s alleged promise not to do so.

If you use Safari, you probably received a DoubleClick tracking cookie from Google during the relevant time period.  As part of the settlement, Google agreed to destroy as many as possible of the DoubleClick tracking cookies placed on Safari users’ computers during the relevant period.   To its credit, Google started destroying those cookies early, without waiting for the settlement to be finalized, so virtually all of the relevant cookies should be gone by now.

[Note:  I modified the HTML snippet above, putting 'X' characters in place of parts of the URL in the form tag.   I am not disclosing any of the exact URLs that we saw in our experiments, as a precaution against the possibility that they might reveal something about our investigative procedure.]

July 23, 2012

Reasoning about information: an example

by Ed Felten

One of the reasons it’s hard to think carefully about privacy is that privacy is fundamentally about information, and our (uneducated) intuition about information is often unreliable.

As a teacher, I have tried different approaches to helping students get over this barrier.  It’s not too hard to teach the theory, so that students learn how to manipulate logical formulas to answer contrived story problems about information and inference.  What is more difficult is augmenting the formal theory with a more accurate intuition that is useful outside the classroom.

One trick I find useful for building privacy intuition is to abstract away from the formality of logic and the complexities of human relationships, and consider how information behaves in another setting: simple algebra.

[To math-phobic readers: hang on--this won't hurt a bit.  I'll use the simplest possible examples, and I promise I won't ask you to solve any equations.]

Suppose we’re interested in knowing the value of X.  We start out with no knowledge about X, so X could have any value, large or small, positive or negative.   Now we learn a fact:

X – Y = 2

We have learned a fact about X, but that fact doesn’t help us narrow down what the value of X might be–it’s still the case that X could take on absolutely any value.

Some time later, we learn another fact:

X + 2Y – Z = 5

That’s another fact about X, but we still can’t narrow down what the value of X might be–it’s still the case that X could take on absolutely any value.  Your algebra teacher would say that we can’t find a solution because we have fewer equations (two) than unknowns (three: X, Y, and Z).

So at this point, we know nothing about X, right?   Or is it better to say that we know two things about X, even though our uncertainty about the value of X has not been reduced at all?   Information is odd that way.

The next day, we learn yet another fact:

Z – Y = 1

This new fact is obviously not about X.  It doesn’t mention X at all–it’s just a fact about the relationship between Y and Z.  How could that possibly tell us anything about the value of X?

But as it turns out, this last fact is the key to unlocking the value of X.   Given the three facts we now know, we can dust off our algebra skills and solve the three equations in three unknowns, to learn that X=4, Y=2, and Z=3.

The key to unlocking the value of X, as it turned out, was a fact (Z-Y=1) that wasn’t even about X.   Or maybe it was a fact about X, despite not mentioning X at all.  Information is odd that way.

This example also helps to illustrate how easy it is to make mistakes when reasoning about information.   For example, suppose we create a concept of X-identifying information (XII), and we say that a fact is XII if and only if that fact allows someone who learns it to determine the value of X.   So the fact “X = 6″ is XII, but the fact “U + V = 7″ is not XII.

Now we might try to use XII to reason about our example.  We could look at each of the three facts in isolation, and argue that they are all non-XII, because each of them in isolation does not reveal the value of X.   We might then try to argue that in revealing the three facts, we never revealed any XII, and therefore there is no reason to worry that the value of X might have been revealed.

Of course, such an argument would be incorrect, because the three facts did in fact reveal the value of X, when taken together.  To put it another way, if somebody tells us that “no XII was revealed” that statement by itself does not imply anything about whether X was revealed.

Information is odd that way.

[Extra-credit homework assignment: Devise an "XII removal" method that can take any fact that is XII, and transform it into an equivalent set of facts that (considered individually) are non-XII.]

Tags:
July 10, 2012

Controlling Robocalls

by Ed Felten

Today the FTC is announcing new initiatives to address robocalls: those annoying automated sales calls from businesses you have never heard of.

We get over 200,000 consumer complaints about robocalls every month.   A great many of these calls violate rules governing telemarketing, which the FTC enforces.  These rules generally prohibit prerecorded sales calls to consumers, unless the telemarketer has obtained permission in writing from the consumer.  (There are some exceptions, such as purely informational calls letting you know that your flight is delayed, your shipment will be delivered tomorrow, etc.)

I encourage you to check out the FTC’s robocall page, at ftc.gov/robocalls, and watch for information about the robocall summit we’ll be convening on October 18 in Washington, DC.

In this post, I want to talk about some technical issues connected to robocalls.

The growth in robocalls has been enabled by technological changes that have drastically decreased the cost of making phone calls.  The same technologies that let us talk to people around the world for almost no cost have also, unfortunately, opened the door to exploitation by bad actors.   In the old days, when phone calls were expensive, most businesses would think twice before shelling out the money to make a call to you–many would only call you if they had some reason to believe you wanted to do business with them.  But now, with their cost per call down around one cent or less, they can afford to call and call and call–assuming they’re willing to break the law and risk FTC enforcement.

Robocall companies use technical tricks to lower their costs even more.  For example, some experts believe that the robo-voices that you hear on their calls are chosen, in part, because they are especially compressible–they can be transmitted at low data rates and still sound good.   This allows the robocallers to cram more simultaneous calls through the same Internet connection.

Another interesting tech question is how to catch the robocallers and their confederates.   They have long since figured out how to evade or misuse Caller ID, a system that was never really designed to provide any kind of strong proof of the caller’s identity.   Caller ID works well when the participants in routing a call are cooperating nicely, but it relies on callers or their technological proxies to send accurate identifying information–which is no longer universal now that the phone system is no longer run by a few well-established companies but is open to connections from almost anybody.   Again, the thriving, diverse ecosystem of companies providing phone services is a good thing on the whole, having unleashed innovation and lowered prices, but it does have a dark side.  The good news is that are things we can do to track down robocallers by using a combination of technical and legal methods.

But more needs to be done.   That’s why we will be calling on the technology community to work on innovative approaches to attack the robocall problem.   Can you help consumers protect themselves?   Can you help law enforcement identify robocallers more quickly?  Can you think of some other way to frustrate robocallers?   Stay tuned for details about our technology challenge.

July 3, 2012

Privacy by Design: Frequency Capping

by Ed Felten

One of the principles of Privacy by Design, as advocated in the FTC Privacy Report, is that when you design a business process, it’s a best practice to think carefully about how to minimize the information you collect, retain, and use in that process.  Often, you can implement the feature you want, with a smaller privacy footprint, if you think carefully about your design alternatives.

As an example, let’s look at frequency capping in the online ad market.  Advertisers want to limit the number of times a particular user sees ads from a particular ad campaign.  This is called a “frequency cap” for the campaign.  The more times a user sees an ad, the less likely that one more viewing of the ad will get them to buy; and the more likely that they’ll find the repeated ad annoying.

One way to implement frequency caps is to use third-party tracking.  The ad network assigns each user a unique userID (a pseudonym), stored in a cookie on the user’s computer, and the ad network records which userIDs saw which ads on which sites.  The ad network uses these records to keep a count of how many times each userID has seen each ad, and to avoid repeating ads too many times.   This approach works, but it gathers a lot of data–full tracking of user activities across all sites served by the ad network.

There are at least two ways to do frequency capping without gathering so much data.

The first way is to move information storage to the client (i.e., the user’s computer).  The idea is to keep a count of how many ads the user has seen from each campaign, and store those counts on the client’s computer rather than on the ad network’s computers.   A blog post by Jonathan Mayer and Arvind Narayanan gives more details.  The main advantage of this approach is that, because the information is stored on the user’s own computer, the user can always delete the information if they’re concerned about the privacy implications.   The main drawback is that the ad network would have to re-engineer how they choose which ads to place, because ad placement decisions are normally made on the ad network’s servers but the frequency information will now be stored elsewhere.

The second way to do frequency capping with less information collection is to store information on the ad network’s server, but to think carefully about how to minimize what is stored and how to reduce its linkability back to the user.  In this approach, the user still gets a unique pseudonym, stored in a cookie, but the ad network does not store a complete record of what the user did online.  Instead, the ad network just keeps a count of how many times each pseudonym has seen ads from each campaign.

For example, if you see an ad for the new Monster Mega Pizza, the ad network will remember that you (i.e., your pseudonym) have seen that ad–but it won’t remember which site you were reading when you saw that ad.  And for ads that aren’t frequency-capped, it won’t store anything at all.  Of course, the data about you seeing the Monster Mega Pizza ad campaign can be deleted once that campaign is over.

In practice, an ad network might want to collect and retain more information, in order to make other uses of that information later.   But users will probably want the ad network to be straightforward about what it is doing, and to admit that it is collecting more information than it needs for frequency capping, because it wants to make other uses of the data.

[Bonus content for geeks: The ad network can use crypto to store information with even better privacy properties. Rather than using the pseudonymous userID as a key for storing and retrieving the frequency counts, the ad network can hash the userID together with the advertiser's campaignID and use the resulting value as the storage key.  Then (assuming the userID is neither recorded nor guessable) the ad network won't be able to determine whether the person who saw the Monster Mega Pizza ad also saw some other ad from a different campaign.   This is easy to do and provides some extra protection for the user's privacy, while still allowing frequency capping.]

[Thanks for participants in the W3C Tracking Protection Working Group for suggesting the second approach, including the hashing trick.]

[Extra-credit homework for serious geeks: How can you use Bloom Filters to store this information more efficiently?  Assume it's acceptable to refuse to show an ad to a user even though that user hasn't yet hit the cap for that ad, as long as the probability that this happens is small.]

June 21, 2012

Protecting privacy by adding noise

by Ed Felten

When I wrote previously about differential privacy, a mathematical framework that allows rigorous reasoning about privacy preservation, I promised to work through an example to show how the theory works.    Here goes.

Suppose that Alice has access to a detailed database about everyone in the United States.  Bob wants to do some statistical analysis, to get aggregate statistics about the population.  But Alice wants to make sure Bob can’t infer anything about an individual.  Rather than giving Bob the raw data—which would surely undermine privacy—Alice will let Bob send her queries about the data, and Alice will answer Bob’s queries.  The trick is to make sure that Alice’s answers don’t leak private information.

Recall from a previous post that even if Bob only asks for aggregate data, the result still might not be safe.   But differential privacy gives Alice a way to answer the queries in a way that is provably safe.

The key idea is for Alice to add random “noise” to the results of queries.  If Bob asks how many Americans are left-handed, Alice will compute the true result, then add a little bit of random noise to the true result, to get the altered result that she will return to Bob.    Differential privacy tells Alice exactly how much noise she needs to add, and in exactly which form, to get the necessary privacy guarantees.

The key idea is that if Bob attempts to extract facts about you, as an individual, from the answers he gets, then the noise that Alice added will wash out the effect of your individual data.    Like movie spies who turn on music to drown out the sound of a whispered conversation—and thereby frustrate listening devices—the noise added to Alice’s answers will drown out the effect of your individual data.

The key to this method succeeding is to have the noise be just loud enough to drown out an individual’s records, but not so loud that Bob loses the ability to detect trends in the population.   There is a rich technical literature about how to do this and when it tends to work well.

One interesting aspect of this approach is that it treats errors in data as a good thing.  This might seem at first to be in tension with traditional privacy principles, which generally treat error as something to be avoided. For example, the Fair Information Practices include a principle of data accuracy, and a right to correct inaccurate data.   But on further investigation, these two approaches to error turn out not to be in contradiction.   One of the main points of the differential privacy approach is that Bob, who receives the erroneous information, knows that it is erroneous, and knows roughly how much error there is.  So Bob knows that there is no point in trying to rely on this information to make decisions about individuals.   By contrast, errors are problematic in traditional privacy settings because the data recipient doesn’t know much about the distribution of errors and is likely to assume that data are more accurate than they really are.

So how much noise will Alice need to add to Bob’s query about the number of left-handed Americans?   If we assume that a 1% level of differential privacy is required—meaning that Bob can get no more than a 1% advantage over random guessing if we challenge him to guess whether or not your data is included in the data set— then the typical size of the error will have to be about 100.    The error might be bigger or smaller in a particular case, but the magnitude of the added error will on average be about 100 people.   Compared to the number of left-handed people in America, that is a small error.

If Bob wants to ask a lot of questions, then the error in each response will have to be bigger—but the good news is that Bob can ask any question of the form “How many Americans are …” and this same mechanism can provide a provable level of privacy protection.

Differential privacy can handle an ever-growing set of situations.  One thing it can’t provide, though—because no privacy mechanism can—is a free lunch.  In order to get privacy, you have to trade away some utility.  The good news is that if you do things right, you might be able to get a lot of privacy without requiring Bob to give up much utility.

May 29, 2012

The Problem with Passwords

by Ed Felten

We use passwords all the time.  Sometimes they’re called “PINs” or “access codes” or “lock combinations” but they amount to the same thing, a sequence of symbols that must be provided in order to get access to something.    Passwords have one big advantage: ease of use.  But this comes with several disadvantages.

  • People have to use passwords in many places–I have passwords on more than 100 web sites–and studies show that most people have a few passwords that they re-use across different sites. So an adversary who gets access to one password can potentially access many accounts.
  • Passwords are subject to replay attacks: an adversary who sees your password once, perhaps by looking over your shoulder as you type it or eavesdropping on the network as your password goes by, can replay the password later.  (More secure approaches use a cryptographic trick called a zero-knowledge proof, in which you can prove that you know a secret value but without revealing the secret to an eavesdropper.)
  • People have a hard time picking good passwords.  A good password is supposed to be easy for you to remember but very, very difficult for an adversary to guess.   The best password is a truly random string, but those are too hard to remember, so we tend to build patterns into our passwords, which make them easier to guess.  And brute force password-guessing gets easier every year because computers get faster.
  • People tend to forget their passwords, and the resulting password-recovery or -reset procedures can be trouble-prone.  (During the 2008 presidential campaign, Sarah Palin’s email account was compromised via the password recovery mechanism.)

The drawbacks of passwords have been evident for a long time, and security experts have been looking for something better.  There are plenty of more secure alternatives, but they have had trouble getting adopted, partly because passwords are familiar and easy to use, and partly because competing technologies have failed to get critical mass.

A recent study compared passwords against “two decades of proposals to replace text passwords,” grading each system on twenty-five factors, and found that although many of the alternatives beat passwords on some factors, every one loses to passwords on other factors.   In other words, there is no alternative out there that beats passwords hands down.  The best system will depend on your circumstances.

Despite the lack (so far) of a great leap forward, we are seeing more modest innovations start to get traction.   One example is two-factor authentication, which augments your password with another layer of checking.  It is now supported by companies such as Google (which calls it 2-Step Verification) and Facebook (which calls it Login Approvals).   These systems notice when you log in from a computer or device that you haven’t used lately, and they respond by requiring you to enter a secret code you get from your mobile phone.  The code comes either from a special app or from a text message that the company sends you.  Other two-factor systems rely on a little fob that displays ever-changing numbers, or on biometrics such as a fingerprint.  I recommend using two-factor authentication where it is available.

Last year the White House released its National Strategy for Trusted Identities in Cyberspace (NSTIC, pronounced “EN-stick”), which described what a better online authentication system would look like, and laid out a strategy for government to facilitate the creation by industry of such a system.   Now NIST is working to execute that strategy.   I’m hoping that industry, with appropriate encouragement from government, will step up and keep improving authentication practices.

Until that happens, we’ll have to keep muddling through with passwords.

[Bonus password-related trivia question:  Fill in the blank in this line from the Marx Brothers movie Horse Feathers, spoken by a character called Baravelli (played by Chico): "Hey, what's-a matter, you no understand English? You can't come in here unless you say  _____.  Now I'll give you one more guess."  This one-word password has been used in many books and movies.]

May 21, 2012

Is aggregate data always private?

by Ed Felten

I have been writing recently about data and privacy.   Today I want to continue by talking about aggregate data.   A common intuitions is aggregate data–information averaged or summed over a large population–is inherently free of privacy implications.   As we’ll see, that isn’t always right.

Suppose there is a database about all FTC employees, and you’re allowed to query it to get the total salary of all FTC employees as of any particular date.   So you ask about the date January 4, 2011, and you get back a number that includes the salaries of the roughly 1100 FTC employees as of that date.   The result seems to be privacy-safe, because it is aggregated over so many people.

Next, you ask another aggregate query: What was the total salary of all FTC employees on January 5, 2011?   Again, you get a result aggregated over 1100-ish employees–aggregate data, which might seem safe.

But if you subtract the two aggregate values, what you get is the difference in total salary between January 4 and January 5 of 2011.    Assuming that the only change in the employee roster in that one-day period is that I joined the FTC, the result will be equal to my salary, which is personal information about me.

What happened here is that subtraction caused the salaries of almost all employees to cancel out, leaving information about only one employee (me).   Doing simple math on aggregate values can give you a non-aggregate result.

You might think think that you can solve this problem by watching for a sequence of queries that are too closely related and having the system refuse to answer the last one.  But that turns out not to be feasible.   A clever analyst can find ways to ask three, or four, or any number of queries that combine to cause trouble; and queries can be related in too many subtle and complicated ways. It turns out that there is no feasible procedure for deciding whether a sequence of aggregate queries allows inferences about an individual.

This is not meant to say that aggregate data is always dangerous, or that it is never safe to release aggregated data.   Indeed, aggregated data is released safely all the time.  What I am saying is more modest: the simple argument that “it’s aggregate data, therefore safe to release” is not by itself sufficient.

There are lots of examples of aggregate data turning out not to be safe.   One example comes from my own research (done before I joined the FTC).    Joe Calandrino, Ann Kilzer, Arvind Narayanan, Vitaly Shmatikov, and I published a paper titled “You Might Also Like: Privacy Risks of Collaborative Filtering” in which we showed that collaborative filtering systems, which recommend items based on the past activities of a population of users, can sometimes leak information about the activities of individual users.   If a system tells you that people who watch the TV show “Alf” also watch “Dallas,” this fact is aggregate information–essentially a correlation that is calculated across the entire user population.  But given enough of this aggregate data, over time, it can become possible (depending on the details of the system) to infer what individual users have purchased and watched.  Our paper gave examples where we made individual inferences using data from real systems.  In other words, aggregate data can be used to infer individual private information–sometimes.

Nowadays, many collaborative filtering systems have safeguards built in that try to address exactly this kind of inference.   They make updates to the recommendations less frequent and less predictable; they show less precise information about correlations; they suppress items that have data from relatively few users; or they add random noise to the rankings.   Sometimes they let users opt out from having their data used in these calculations.  Done right, these kinds of precautions can protect privacy while maintaining the system’s usefulness.

Are there general techniques that can make aggregate data provably safe to release?   It turns out that there are, at least in some cases.   I’ll give an example in a future post.

Tags: