Leveraging Open Data with a National Open Computing Strategy

November 19, 2020

Lara Mangravite,

John Wilbanks

The Challenge: Private Cloud Computing Hampers Open Data Efforts

Open data mandates and investments in public data resources, such as the Human Genome Project or the U.S. National Oceanic and Atmospheric Administration Data Discovery Portal, have provided essential data sets at a scale not possible without government support. By responsibly sharing data for wide reuse, federal policy can spur innovation inside the academy and in citizen science communities. These approaches are enabled by private-sector advances in cloud computing services and the government has benefited from innovation in this domain. However, the use of commercial products to manage the storage of and access to public data resources poses several challenges.

First, too many cloud computing systems fail to properly secure data against breaches,¹ improperly share copies of data with other vendors,² or use data to add to their own secretive and proprietary models.³ As a result, the public does not trust technology companies to responsibly manage public data—particularly private data of individual citizens. These fears are exacerbated by the market power of the major cloud computing providers, which may limit the ability of individuals or institutions to negotiate appropriate terms. This impacts the willingness of U.S. citizens to have their personal information included within these databases.

Second, open data solutions are springing up across multiple sectors without coordination. The federal government is funding a series of independent programs that are working to solve the same problem, leading to a costly duplication of effort across programs.

Third and most importantly, the high costs of data storage, transfer, and analysis preclude many academics, scientists, and researchers from taking advantage of governmental open data resources. Cloud computing has radically lowered the costs of high-performance computing, but it is still not free. The cost of building the wrong model at the wrong time can quickly run into tens of thousands of dollars.

Scarce resources mean that many academic data scientists are unable or unwilling to spend their limited funds to reuse data in exploratory analyses outside their narrow projects. And citizen scientists must use personal funds, which are especially scarce in communities traditionally underrepresented in research. The vast majority of public data made available through existing open science policy is therefore left unused, either as reference material or as “foreground” for new hypotheses and discoveries.⁴

The Solution: Public Cloud Computing

It is necessary to extend existing commitments to open science by ensuring that cloud computing on open scientific data is as safe and inexpensive as possible. This commitment should be made not just to academics, but to those with the lived experience represented in the data. The federal government can do this in the short term by negotiating on behalf of citizens for cloud computing, resulting in a deal that is inexpensive because of scale, and protective of individual privacy by contractual default. And it can accomplish this in the long term by creating a market competitor in cloud computing that operates on a “utility” business model that protects privacy and is optimized for U.S. scientific research.

By providing $1 billion in short-term vouchers to U.S. data scientists and investing another $1 billion to construct and operate a competitive public cloud computing platform, computing resources can be made available to all users who meet a minimum threshold of qualifications and agree to a social contract of open science ethics (for example, agreeing to respect restrictions on use in personally identifiable data). In so doing, it is possible to instantly increase the number of shots on goal against challenges related to biology and climate change. And by leveraging the negotiating power of the federal government, it is possible to protect federal resources and the privacy of citizens whose data are analyzed.

There is precedent for this proposal. The National Institutes of Health has recognized the potential of subsidized computing power to accelerate the use of data in the All of Us Research Program and the National COVID Cohort Collaboratory. In each of these large federal open data projects, deeply personal data about genetics and health are held in secure cloud repositories where users can visit the databanks, execute queries, upload their own data, and run exploratory analytics. They cannot however download the data, preserving the privacy of those represented and making oversight of data users more tractable.

But there is not yet a uniform policy or strategy to pair open data resources with low-cost, publicly available, privacy-protecting open data usage. Over the long term, the voucher model could address concerns associated with private cloud computing services by creating a public competitor that integrates privacy and security at a high level.

This proposal would also support equity and inclusion. Many researchers from communities underrepresented in data science are hamstrung by resource constraints that do not apply to wealthy, white communities. By easing resource constraints, the federal government can cultivate a generation of data scientists within those communities, empowered to explore questions and issues that are relevant to their own contexts and experiences.

Further, the proposal contributes to job creation. Public cloud vouchers will make it cheap and easy for entrepreneurs and community organizations to make data science a part of normal operations. These services will need to be staffed, representing an opportunity to cultivate jobs in data curation, cybersecurity, data analysis, and other areas, including in communities underserved by the knowledge economy. These jobs could be virtual and thus open to rural and urban communities across the country.

Conclusion

The benefits of a national open data computing strategy extend beyond getting processors humming on open data. Open data is desirable because it benefits individual citizens and the country as a whole. A $2 billion investment would immediately turbocharge the use of open data to solve challenges related to cancer, the coronavirus pandemic, social determinants of health, climate change, agriculture, and many other essential areas for resilience and innovation. This is the moment to accelerate U.S. data science.

Download the full report »

^{Photo Credit: spainter_vfx / Shutterstock}

Lara Mangravite is the president of Sage Bionetworks.

John Wilbanks is the chief commons officer at Sage Bionetworks.

^{1 Somayeh Sobati Moghadam and Amjad Fayoumi, “Toward Securing Cloud-Based Data Analytics: A Discussion on Current Solutions and Open Issues,” IEEE Access, vol. 7, 2019.}

^{2 DJ Pangburn, Despite the Controversy Plenty of Smaller Tech Startups Work-with ICE, Fast Company, October 4, 2019.}

^{3 Mark Harris, “How Peter Thiel’s Secretive Data Company Pushed Into Policing,” Wired, August 9, 2017.}

^{4 Christine Borgman and Irene V. Pasquetto, How and Why do Scientists Reuse Others’ Data to Produce New Knowledge?, Cochrane Colloquium: Fringe Event, Edinburgh, September 15, 2018.}