Commercial Online Health Data Research: Weighing Privacy Concerns against Potential for Medical Insights

Bridget M. Kuehn
Search for other papers by Bridget M. Kuehn in
Current site
Google Scholar
Full access

Search engines are one of the first places many Americans turn when looking for health information, according to a 2013 survey by the Pew Research Center. But what they may not know is that the data from these searches is collected by the search engine and is increasingly being used for health research and public health surveillance.

The data has enormous potential to help researchers better understand pressing public health issues and perhaps even to identify individuals at risk of developing serious disease. But this emerging venue for health research also poses new questions about what constitutes consent for research use of online health information and what role corporations, who own the data, should play in the process.

“Innovation is crucial in our world, and these approaches that have shown promise should be pursued if we develop appropriate methods to ensure the benefit to society and individual patients,” said Mauricio Santillana, PhD, a member of the faculty at Boston Children’s Hospital and an associate at Harvard’s Institute for Applied Computational Science.

Emerging field

Epidemiologists have been at the leading edge of using search data, often combined with data from electronic health records or social media sites, Santillana said. His group at Harvard University has partnered with Google to use its data for tracking and forecasting epidemics of the flu and other infectious diseases.

While initial attempts to develop a Google search–based flu-tracking system were stymied, methods have improved substantially since then (Yang S, et al. Proc Natl Acad Sci USA 2015; 112:14473–14478). Now, Santillana and his colleagues can produce very accurate outbreak estimates in real-time and accurate forecasting of flu trends about 1 to 2 weeks ahead.

“The field has evolved quite a bit,” Santillana said. “We basically show data from Google searches may be noisy and may not be straightforward to interpret, but by developing robust methods we can minimize the effect of the noise and produce accurate forecasts.”

Other types of research are also being explored that are more longitudinal and focus on individuals. For example, researchers from Microsoft recently showed that Bing search data might be useful to identify individuals with symptoms of pancreatic cancer even before diagnosis (Paparrizos J, et al. J Oncol Pract. pii: JOPR010504 [Published online June 7, 2016]). Such early identification might help improve patient outcomes, because many patients with pancreatic cancer receive diagnoses too late to be treated effectively, wrote lead author John Paparrizos, MSc, a computer scientist at Columbia University and his colleagues from Microsoft.

“The results highlight the promise of using Web search logs as a new direction for screening for pancreatic carcinoma,” the authors wrote.

Privacy and oversight

But concerns have been raised about protecting individuals’ privacy and the oversight of online health data research.

“People have a very different sense of privacy around their medical data,” said Elizabeth Buchanan, PhD, an ethicist at the University of Wisconsin-Stout in Menomonie, Wisconsin.

They may also have different expectations for privacy depending on whether they are posting health information on a social media site or whether they are conducting a search, Buchanan said. Most companies’ terms-of-use policies outline that user data will be logged and possibly used for research or other purposes, including commercial ones, Buchanan said.

“We should be aware that third party apps are collecting, repackaging, and repurposing our data whether it is posted in a public space or if it is something we consider more private like a search query,” she said.

It’s important to be aware of how this data might be used in ways that are beneficial, for example, for disease surveillance or for patient outreach, while also understanding the ways that composite online health data might be used to identify an individual or even used to harm them, Buchanan said.

“The promise of personalized medicine and predictive analytics is that it can help,” she said. “But we want to be careful of the larger more dangerous uses of these kinds of data,” she said.

Santillana said his group protects the privacy of searchers’ health information by using aggregate data and trying to ensure that individuals can’t be re-identified through the data.

“We do our best to maintain the anonymity of the population we are trying to help,” he said.

Other potentially promising uses that track an individual’s search behavior may trigger greater public concerns about privacy, Santillana said. For example, what if insurance companies got access to the information and used it to refuse to sell the person insurance?

“If a patient is identified as likely to get a diagnosis based on search history, who gets the information?” Santillana said. “The benefit could be great, but the implementation is not clear.”

He emphasized that he supports industry-academic partnerships provided the goals are very clearly outlined.

“I’m a big supporter of innovation and a big supporter of partnering with industry with the understanding that the goal is to improve social good and patient-centered care,” he said.

There are also questions about oversight of search-data–based research. In traditional biomedical studies, academic scientists and medical professionals at universities must get approval from institutional review boards (IRBs) (Vayena E, et al. Am J Public Health 2012; 102:2225–2230). Corporations may have less strict review processes, said Buchanan.

For population-level or aggregate data research, such as that done by Santillana’s group, IRB approval is not required even at academic institutions.

Government oversight of such research is limited. The Department of Health and Human Services’ Office of Human Research Protections (OHRP) has published non-binding recommendations about online health data research, which Buchanan co-authored. The recommendations call on researchers to be sensitive to the unique privacy and security concerns associated with online health data.

The US Department of Health and Human Services National Coordinator for Health Information Technology has issued recommendations for “Big Data” health research that call for more transparency about the computer algorithms used to collect and analyze health data online. The recommendations also call for policies to protect online health data that would fall outside of the Health Insurance Portability and Accountability Act (HIPAA).

The European Union (EU) has been ahead of the curve in regulating the use of search data ( and ensuring that the public is informed, Santillana said. For example, on a recent trip to England he searched for information about fevers on Google and immediately received a notification that his information could be used for research purposes, and was given the option of saying yes or no to that use of his data.

“The EU based on their history has become very aware of the harmful potential of having a single entity control information that is sensitive,” Santillana said.

While debate continues about the regulation and uses of personal data in the US, Santillana said, “People should be informed.”

Buchanan agreed. “It comes back to our communal sense of data and social media literacy,” she said. “All of us need to understand what is happening behind the scenes. We need to be aware of the trails of data we are creating and how they are being used.”