Keeping research data secure: Common mistakes and how to avoid them

5 minute read

Written by: Susan Glick - Privacy & Security Specialist

Landing squarely at the feet of the Hutch Integrated Data and Archives (HIDRA) project, I soon understood the textured value of data. Data generation, capture, curation and computation is both the vessel and lifeblood of research. Data generation and curation is pushed by the confluence of improved capabilities of next generation sequencing, electronic health records (EHR), and open science. Analysis rests on the strength of computational abilities, including scripting and compute power. In essence, it is the ability to master the dynamic convergence and ramification of multiple data streams that pushes information towards medical discovery.

However, in the shadows of rapid data exploration and robust computational pipelines stands safeguarding, safeguarding the sensitive data quietly stands central. Salient privacy and security challenges are not fully addressed by regulatory agencies (e.g., de-identification of genomic data or knitting such information into a patient’s medical record), leaving individual institutions at the local front of security and privacy guidance. Researchers often try to navigate the muddy waters with marginal experience or resources on how to be ethical, compliant and secure with their data.

Being ethical, compliant and secure with data starts at the beginning with a data management plan (DMP). Many examples can be found at the National Institute of Health (NIH). A primary key to a good plan is to understand just how sensitive your data is. Is the data generated through research or extracted from an EHR or direct care provider? Is the research likely to be shared within a multisite effort or collaboration? DMP components such as password management, data gatekeeping processes, account authorization policies can be scaled to match the information’s sensitivity. Yet, once you have solidified your plan and levied expectations to use it, all too often ongoing awareness remains a silent witness. And still, incidents and accidents do happen.

Common incidents and how to avoid them

Perhaps reviewing a list of what has actually happened will provide some insight into how to avoid incident creep. Because, honestly, incidents tend to happen where DMPs don’t reach.

Oversharing via GitHub

GitHub, GitLab or Git”X” is a love-meh relationship as far as security is concerned. A favorite for code and data analysis versioning and sharing, Git”X” doesn’t care whether your code contains sensitive information. For that reason, sensitive information and data must be carefully handled in a multiple-party, open, sharing environment. Scripting account credentials into a Git”X” repository is an all too common occurance. In one instance, within a few minutes of finding the credentials, a trolling bot/program accessed a Fred Hutch account and used Fred Hutch resources to mine for cryptocurrency. Over $1,000 of compute charges accrued to Fred Hutch, but luckily no sensitive data breached. Being aware of security issues is key to solving them as there are often tools or methods available to keep your data secure. For example, git-secrets from AWS-Labs specifically addresses the problem of accidently sharing passwords or credentials via GitHub.

Losing track of vulnerabilites and weak spots

Misconfiguring a server, cloud database, storage container, or search engine is one of the most common errors resulting in security incidents. Vulnerabilities leading to access may be exploited on machines which do not have sufficient endpoint security, may be running older operating systems or are not patched. Leaky AWS S3 buckets are a common soft spot. An example: a database specific to a protocol within REDCap was lingering with database administration responsibilities shifting between individuals. After various revisions, a misconfiguration allowed Google to index the database information, opening data to the public through Google search. The breach exposed the ability re-identify participant-donors. Database management, including configuration requirements, should be well documented, persistent and accessible through the life of the project to prevent such incidences.

Going rogue

Even the best data management plan does not work if actions are taken that are not covered in the plan. Often times these are “one-off activities” or something that a researcher did not account for because it only need to be done “just this once”. The “one-off activity” not only breaks the routine, but also the compliance mindset. The “one-off” is a time to be most vigil. Two incidents related to a “one-off” help explain this. First, a research project was moving data via Dropbox to support a research-related, one-time event. The person using drop-box was a more peripheral member of the research team. The Dropbox account was not secure and recipients included individuals not authorized to the research. An unauthorized individual recognized a family member as a research subject. Second, a collaboration using medical record information conducted a proof-of-concept visualization for a presentation with the data; the application used to develop the visualization was misconfigured to be open to the public. The medical records of 11 individuals was public for over nine months. A comprehensive data management plan should prevent the need for “one-offs”.

Mitigating security risks


Training the expert is important. Many researchers learn computation methods or scripting on the side, using a brew of Python or R and Excel to move and transform data. This largely leaves out the information security side of their computational education. Conversely, labs will hire from private industry, but lag to educate the expert software developer on compliance and security regulation specific to healthcare and research. Fred Hutch is slowly moving to improve training offerings; the revised security and confidentiality training is an excellent start.

Understanding the risks of de-identified data

De-identifications is more than just removing a participant’s name. In a Dropbox incident, a recipient was able to use age, census tract of home address, date of encounter, race, and location of enrollment to identify a research subject. This incident illustrates how, even without a participant’s name or specific demographic information, they can be identified. Re-identification is an especially important consideration for small subgroupings of participants, such as thos ein small minority groups or with very rare diseases. Learn more about de-identification at HDC Data Requests, Data Compliance, Data Security.

Utilizing your institutions resources

The open source alternative may be more development-intensive than it appears. If you adopt or adapt open source products to be used with sensitive data, the application must adhere to your institution’s Information Security Policy and Standards in such areas as audit and identity management. As a Fred Hutch employee you can find these policies and standards by searching ‘ISO policy and standards’ in Centernet.

With great data comes great responsibility

Whether computational research or experimental, research ethics, compliance and security morph in parallel with the evolving technology which powers analysis and computation. Datasets are bigger so single breach incidents have more risk. Emerging methods and technologies to de-identify human data are unfamiliar to IRBs, creating delays. Open science means matching the sensitivity of your information to the governance structure of the selected third-party repository. Data transport between collaborators is all too often not considered a pressure point. The research lab’s data pipeline with multiple linkages needs audit capability. These may not be traditional factors of privacy and security, but they may become so as the discipline advances to be more data driven.