Can Amazon Serve Customers and Protect Privacy at the Same Time? (What Data Should DBAs Keep and Discard)

One of the biggest challenges for DBAs and programmers is deciding when to keep information and when to discard it. Good records are both the product of a natural, protective instinct, and a requirement for many businesses. That’s why websites and businesses need detailed tracking information to offer the most customization and assistance.

But this information is also turning into a big responsibility and potential liability. Customers expect their privacy to be protected, and everyone is well-aware that the right information can create opportunities for crimes when it falls into the wrong hands. Identity fraud is a frightening new crime.

When is it the right time to keep data and when should it be destroyed completely? Sometimes a DBA can take both paths. They can destroy the information while retaining the ability to use it.

I recently completed a case study of Amazon’s computer system in order to test some of the work I’ve been doing on my book Translucent Databases. Amazon is well-known for providing some of the best service by keeping detailed records on their customers. I wanted to see how many of these services I could provide without keeping detailed personalized records.

As far as I can tell, all of Amazon’s personalized service falls into these basic classes:

  • Special Choices — If someone bought a book by John Grisham last year, Amazon displays a picture of his new book when they return to the website.
  • Faster Checkout– There’s no need to type in your shipping address and credit card information each trip.
  • SPAM Notices– Anyone who bought a book by John Grisham may receive an email notice telling them that a new book by the author has become available.

The first two of these services can be offered without keeping the names, email addresses, or other personal information on the server. The key is relying upon a special cryptographically secure hash function like SHA to scramble the names and passwords for the customers. Instead of storing their identities, Amazon could keep inscrutable numbers that act like surrogates. When a person logged in, Amazon would compute this hash function and identify them. (See the case study for examples and working JavaScript code.)

This trick can be used on all levels of the database. The choices people make, the books they investigate, the size of the pants they buy, and every other tidbit in the Amazon’s database, can be linked to this inscrutable number. When someone logs in, Amazon can customize the site by unlocking all of the old data related to old visits, but when they leave the linkage can be thrown away until they return.

The personal information, like the shipping address or the credit card number, could also be encrypted with this name and password combination. The website would discard the data when they left ensuring that only the customer could unlock. Any errant clerks or malicious hackers wouldn’t be able to steal addresses or credit card numbers.

These techniques do not hurt the marketing department. They can still investigate who buys what things in which combination. They can cross-correlate, take inventory, make projections, and data mine without knowing the name of the person behind this surrogate number.

These techniques can also help speed up databases, although often by a small amount. Cryptographically secure hash functions distribute values across an entire range ensuring that the indices are well-balanced and as fast as possible.

Unfortunately, I could not find a way to offer the spam-like notifications that Amazon sends out about new books. While this is a service that I enjoy, I’m sure there are others who are so inundated by spam that they hate all unsolicited messages. The spammers have poisoned email so much now that shutting down this service probably isn’t a big loss.

Peter Wayner is the author of 12 books including Translucent Databases, an exploration of many different techniques for building databases that do useful work without holding any useful information.

His latest book, Java RAMBO Manifesto investigates how and when you can speed up your application by throwing away your database.

Published with the express written permission of the author. Copyright 20


No comments yet... Be the first to leave a reply!