Data Anonymization in Qualitative and Mixed Methods Research

Why Do We De-Identify Data?

The de-identification of data is of the highest importance in any social sciences study. Maintaining the confidentiality of human research subjects is a core ethical principle of research (World Medical Association, 2013). For those of us who collect and analyze participant information, it is our professional obligation to take all reasonable steps to protect participant data. In the current age of rapidly evolving AI, managing this responsibility is even more difficult. In addition to concerns of AI and data privacy, other changes in the social sciences have made appropriate data de-identification more relevant than ever.

Qualitative data transparency, or the sharing of qualitative data with a broad audience, is desirable for many purposes and increasingly recommended as a strategy for improving rigor in social sciences research. Additionally, many institutions that fund research (such as the National Institutes for Health in the United States) are requiring data transparency – in this case, the sharing of data in public repositories – in their funded projects. The rationale for these regulations is to maximize the potential value of the data being gathered through allowing others to access, review, and re-analyze data. The fundamental mandate of data repositories like Inter-university Consortium for Political and Social Research (ICPSR) at University of Michigan, the Qualitative Data Repository (QDR) at Syracuse University, and UK Data Service is to protect, preserve, and provide authorized access to quantitative, qualitative, and mixed methods data sets. This opens data analysis possibilities to many who may not have the institutional resources or connections to collect data otherwise (see example) – while contributing to the core social sciences mission of advancing our understanding of social issues.

At this point in history, the stakes for data privacy are perhaps higher than they have ever been. The mandates for increased data transparency have coincided with unprecedented advances in computer science and technology. Large language models (ChatGPT, Claude, Gemini, etc.) are rapidly evolving and can be used to near-instantly query datasets and return results from those datasets in minutes, creating new challenges for data de-identification and privacy protection. This activity is further evidence that participant data must be properly de-identified in a rigorous manner as early as possible in the research process to maintain their protection.

What Data is Identifiable?

Maintaining confidentiality can look different depending on the purpose and use of the data throughout a study. Before data collection begins, it is recommended to ask for guidance from the individuals/community participating in the research about what data they feel is potentially sensitive or identifiable. And of course, the process for maintaining participant confidentiality – as well as how data are being used – are crucial elements of the informed consent process.

There are often national guidelines available by country for what constitutes personally identifiable information (PII), or information that must be removed for data to become de-identified to acceptable levels. In the United States, the federal government has published standards for what data is considered identifiable data. One such standard for de-identification is the removal of any information classified as protected data in the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule (Office for Civil Rights, United States Department of Health and Human Services). Consistent with guidelines recommended in other countries (for example, GDPR for countries in the European Union and the Data Protection Act of 2018 in the United Kingdom), the HIPAA Safe Harbor method designates 18 identifiers that must be removed for data to be shared without infringing upon HIPAA protections, including information like individual names, addresses, birthdays, and phone numbers.

Data that are considered identifiable can also depend on the study context. The narrative data elicited in qualitative studies can be particularly challenging to de-identify. Certain elements, such as an individual's name, are universally considered PII to be removed from study data. However, other elements beyond names, dates, and locations can be present in narrative data that could make it possible to re-identify an individual. For example, in interviews where an individual discloses their experiences with divorce, the details of the divorce would also be known by their ex-spouse, which poses a risk for re-identification.

We finally decided to end things in February 2022. The Covid-19 lockdown really illuminated some problems in our relationship that we just could not get past–the phone time was just one thing. After 6 months of talking to Dr. Gallo, we were still stuck. We were sitting at Slowbrew coffee when we just looked at each other and knew it was time. It was amicable.

The example above demonstrates how some elements of the data beyond names and dates could still be identifiable depending on the context (see underlined words). In their 2023 paper, Campbell and colleagues propose a multi-step method for anonymizing qualitative data that includes assessing every data point (e.g., words or phrases in a qualitative transcript) for its potential risk of re-identification. In their process, they outline that researchers seeking to anonymize data must ask themselves, 1) who else would know the information? 2) how would they know that information? and 3) what other records contain that information? These questions then serve as the decision point for whether to retain the data (keep it as-is in the dataset), remediate the data (obscure or change the data to make it less identifiable), or redact the data (remove it entirely) (Campbell et al., 2023).

When to Start Thinking about Data Anonymization

Researchers should think about data anonymization throughout the research process and plan to allocate time and resources for data-de-identification after data are collected. Prior to conducting research, your research team should determine the level of anonymization that must be applied to the research data (i.e., using pseudonyms instead of real names) at each juncture in the research life cycle. In many cases, you will work with your Institutional Review Board or Research Ethics Committee to facilitate this process in line with institutional requirements. For others in the private sector, looking to other data privacy regulations – such as HIPAA and GDPR, as well as your institution’s regulations – will help guide how you maintain data integrity.

As you consider your data de-identification strategy, it is very important to be intentional with your method of de-identifying data and follow a repeatable and documented process that can be communicated to others. This documentation allows future investigators to understand and account for the ‘distance’ between the raw data and what is currently available for research access. Remember that de-identifying the data is vital, but you also want to retain the utility of the data for analysis for others. QDR researchers put together a helpful checklist for what qualitative study elements researchers should document and plan to report to facilitate transparency in their research process (Frohwirth, Karcher, & Lever, 2023).

De-ID: A New Tool that Streamlines Data Anonymization

De-identification of qualitative datasets can be an intensive and time-consuming process when it is done correctly. In the past, identification of text data has relied on more ‘manual’ redaction methods, such as visually scanning documents and/or using find/replace methods in a word processing program. Newer tools, like the recently released De-ID, are available to help ease the data anonymization process for you.

De-ID is an example of a customizable data anonymization tool to assist in de-identifying sensitive data. In De-ID, you can choose your level of anonymization and add custom potential identifiers for the app to flag. Tools such as this help streamline the data anonymization process, saving time while also enabling sharing of data in safe and purposeful ways. More information about De-ID can be found at de-idapp.com.

At a time of emerging and powerful technologies, the stakes for data anonymization and data privacy are as high as ever, particularly as we see large language models being integrated into more aspects of the research life cycle. Any time data are exposed to generative AI – particularly in the form of large language models which have poorly defined protections for data privacy – poses risks to participant data privacy. Importantly, De-ID does not depend on generative AI to anonymize data and instead relies on natural language processing algorithms with human-driven oversight, interaction, and approval.

Conclusion

It is important to appreciate that there is not a one-size-fits-all solution for data anonymization. That is, natural language is complex and nuanced in its meanings and every qualitative data set will be unique in the PII data being collected, the nature of the data collection process itself, and the fundamental purpose of the data use – the core ‘value’ in the data as defined by the research project intention. These factors are all important in determining how best to approach the protection of the research participants while preserving the utility of the data.

If you are new to the practice of data management and data anonymization, don’t feel discouraged if these details feel new and daunting. While data anonymization is a serious responsibility and a fundamental principle of data stewardship, the practical strategies used to maintain participant confidentiality are not often discussed in research methods courses. However, good information exists and we recommend looking into training options that might be available through your institution, such as the CITI training modules on data management in social sciences, or free resources such as those available on the Qualitative Data Repository's web site or the De-ID website.

References

Campbell, R., Javorka, M., Engleton, J., Fishwick, K., Gregory, K., & Goodman-Williams, R. (2023). Open-science guidance for qualitative research: An empirically validated approach for de-identifying sensitive narrative data. Advances in Methods and Practices in Psychological Science, 6(4), 25152459231205832. https://doi.org/10.1177/25152459231205832

Frohwirth, L., Karcher, S., & Lever, T. A. (2023). A transparency checklist for qualitative research. Preprint at https://doi.org/10.31235/osf.io/wc35g.

Office for Civil Rights, United States Department of Health and Human Services. (2025). Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. United States Department of Health and Human Services. https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html#:~:text=(a)%20Standard%3A%20de%2D,not%20individually%20identifiable%20health%20information

World Medical Association. (2013). World Medical Association Declaration of Helsinki: ethical principles for medical research involving human subjects. JAMA, 310(20), 2191-2194. https://doi.org/10.1001/jama.2013.281053

Suggested Citation:

Calvert, H.G., Lieber, E., Quash, T. M. (2026). Thinking Through Data Anonymization in Qualitative and Mixed Methods Research. Institute for Mixed Methods Research. www.immrglobal.org

‍

Note:

No generative AI was used in the conceptualization or writing of this research note. All content and writing was created by the authors.

‍

Dr. Calvert brings a decade of experience in academic research to her role as a Senior Researcher and Academic Director, leading strategy and execution of IMMR courses, research camps, and research contracts.