Thanks to corporate monitoring of our email transactions and browsing habits, navigating the Web is becoming more and more convenient, with personalized ads and recommendations popping up everywhere we go. This convenience, however, comes at a price – with no oversight of data mining firmly in place, information gleaned from our activities online can be used against us in any number of ways.
To remedy the situation, a team of researchers led by Roxana Geambasu from Columbia Engineering and the Data Science Institute have developed a second-generation tool, called Sunlight, which builds on its predecessor XRay, capable of linking ads shown to Gmail users with text in their emails, and recommendations on Amazon and YouTube with their shopping and viewing patterns.
Whereas XRay and other similar tools traced specific ads, product recommendations and prices to specific inputs like location, search terms and gender one by one, Sunlight is the first platform to analyse numerous inputs and outputs simultaneously, and form hypotheses that are tested on a separate dataset carved out from the original.
“We’re trying to strike a balance between statistical confidence and scale so that we can start to see what’s happening across the Web as a whole,” said co-author Daniel Hsu.
Over the span of a month last fall, Geambasu’s team set up 119 Gmail accounts and sent 300 messages with sensitive words in the subject line and body of the email. About 15 percent of the ads that followed seemed to be targeted, with some apparently violating Google’s policy of not targeting ads based “on race, religion, sexual orientation, health or sensitive financial categories”.
The researchers also created fake browsing profiles and surfed the 40 most popular sites on the Web, thereby finding that only 5 percent of the ads appeared to be targeted. Some of these, however, seemed to violate Google’s advertising ban on products and services facilitating drug use. The algorithms also picked up on the political leanings of popular news sites, pitching related ads to users.
Ominous as it may sound, Geambasu noted that these violations are not necessarily deliberate, as the flow of data on the Web has become so complex that companies themselves are not always aware of how targeting works.
The tool is designed mostly for regulators, consumer watchdogs and journalists, allowing them to explore how personal data is used, and decide where closer inspection might be warranted.
“Sunlight is distinctive in that it can examine multiple types of inputs simultaneously (e.g., gender, age, browsing activity) to develop hypotheses about which of these inputs impact certain outputs (e.g., ads on Gmail),” said Anupam Datta, a researcher at Carnegie Mellon who led the development of AdFisher – a similar tool to Sunlight – and was not involved in the current study. “This tool takes us closer to the critical goal of discovering personal data use effects at scale.”
The tool, along with a related study, will be presented on October 14 in Denver, at the annual security conference of the Association for Computing Machinery.