Which anonymization technique is best for which NLP task? - It depends.
A Systematic Study on Clinical Text Processing

Iyadh Ben Cheikh Larbi Aljoscha Burchardt Roland Roller
Speech and Language Technology Lab
German Research Center for Artificial Intelligence (DFKI)
Alt-Moabit 91c, Berlin, Germany
firstname.lastname@dfki.de

Abstract

Clinical text processing has gained more and more attention in recent years. The access to sensitive patient data, on the other hand, is still a big challenge, as text cannot be shared without legal hurdles and without removing personal information. There are many techniques to modify or remove patient related information, each with different strengths. This paper investigates the influence of different anonymization techniques on the performance of ML models using multiple datasets corresponding to five different NLP tasks. Several learnings and recommendations are presented. This work confirms that particularly stronger anonymization techniques lead to a significant drop of performance. In addition to that, most of the presented techniques are not secure against a re-identification attack based on similarity search.

1 Introduction

While clinical text processing has gained more and more attention in recent years, access to data still remains a major challenge as it typically contains sensitive, patient-related information. A straightforward solution is to apply one of the many existing de-identification and anonymization techniques, and control the access to the data, as for instance in Kittner et al. (2021) or Henry et al. (2019). But each technique has different properties and modification of the source text has effects on machine learning. What happens when you train a model on an anonymized corpus and test it on your own local data (not anoymized)? In which way does this affect the performance of your model?

To put it more generally: If each anonymization technique has different characteristics, how do they affect different natural language processing (NLP) tasks? Is there some rule of thumb we can follow when choosing an anonymization technique, e.g., to share data for a specific task? To explore those questions in detail, this work conducts a systematic analysis regarding the influence of different anonymization techniques and their effects on the performance of (state-of-the-art) machine learning (ML) models. In course of this, we train and test the models using six different datasets corresponding to five different natural language processing tasks. Main contributions of this work are a set of learnings and recommendations regarding text anonymization for NLP tasks, as well as a small, fictitious re-identification experiment to explore the (in-)effectiveness of the different techniques. The software implementations of the different anonymization techniques will be made publicly available.

2 Related Work

In accordance with the HIPAA Safe Harbor HIPAA (2022) method, we define de-identification as the removal of protected health information (PHI) that directly relate to an individual such as name, address, birth date, etc. However, de-identification does not guarantee anonymity for data subjects. Anonymization on the other hand is defined as any irreversible procedure, in which no information can be linked to any individual Meystre et al. (2010), making the data subjects anonymous and no longer identifiable.

A range of different text anonymization approaches exist in the literature, which modify the text structure within a dataset, delete, replace, or introduce synthetic information, to make it harder to identify or infer factual information on the patient. The following approaches have been explored for this work:

Suppression Mamede et al. (2016) is a technique that either completely removes certain words or sentences or masking them with a neutral label denoting their suppression.
Perturbation Zuo et al. (2021) modifies data through permutation or data swapping, in case of text, similarly to data augmentation, by flipping characters, or changing the order of words.
Substitution Mamede et al. (2016) replaces certain information with more general terms.
Finally, Aggregation (k-anonymity) Samarati and Sweeney (1998) groups individual data subjects together, e.g. by their attribute values, to make it more difficult to identify a single individual.

Only limited work has been done to describe the (systematic) influence on text anonymization on the performance of ML models, most work targets de-identification only. Meystre et al. (2014) for instance examine evaluates information loss after de-identification by examining the text-level changes rather than the performance of the machine learning models. The work concludes that only 1.2-3% of the clinical concepts are changed by de-identification and that the overall impact on the clinical information is minimal but not negligible.

The work of Obeid et al. (2019) analyzes the impact of de-identification on a binary classification task, and concludes, that there is no significant difference in performance between training on the original texts and training on the de-identified texts. Similarly, Berg et al. (2020) examines the effect of different PHI concealment strategies on named entity recognition (NER) tasks and show that using moderate to high precision de-identification models with the right concealment strategy leads to similar performance. Furthermore, Vakili et al. (2022) explored the effects of two approaches to de-identification, namely pseudonymizing PHI in a text and removing PHI-including sentences. The work concludes, that there is no negative impact on the performance of the models on downstream NLP tasks such as NER, text classification, etc.

Also the work of Lange et al. (2020) explore the performance of concept extraction using de-identified data and conclude the performance drop is only marginal. Finally, although not clinical text, in Lampoltshammer et al. (2019) show that anonymization can cause significant negative changes in the sentiment analysis performance on Twitter data.

This work however goes beyond existing related work, as we carry out a structured analysis regarding anonymization of clinical text, testing seven different techniques, on six datasets, including five different NLP tasks.

3 Data and Methods

Corpus	DeI	MNr	ShS	RaS 20%	RaS 100%	SyR 20%	SyR 100%	CnR	Ag2	Ag3	Ag4
Smoking	+1.43	+0.27	+1.05	-5.09*	-5.46*	-4.74	-8.31*	+0.22	-6.34*	-6.80*	-7.25*
Obesity	+0.80	-0.61	-2.55*	-1.94*	-5.09*	-2.99*	-8.96*	-1.31*	-12.48*	-22.59*	-36.97*
MedNLI	+1.55*	+0.14	-	-1.13	-1.93*	-2.52*	-8.42*	-0.73	-7.98*	-13.34*	-14.81*
ClinSTS	-1.21	-0.12	-	-1.36	-0.95	-1.92	-21.96*	-1.84*	-3.30*	-7.26*	-24.31*
2010	-0.32	-0.50*	-	-4.34*	-16.94*	-5.96*	-15.77*	-2.48*	-	-	-
2018	-0.83	-5.10*	-	-3.04*	-25.12*	-2.73*	-9.72*	-1.19*	-	-	-
mean	+0.368	-0.855	-0.355	-2.692	-3.353	-8.907	-12.232	-1.092	-7.342	-12.315	-20.655

Table 1: Anonymization Effects: Average performance drop/gain across all runs in percent in comparison to the best performing system on the corresponding task, according to Table 2. Significant (p<0.05) results are marked with *

The experiments in this work are based on the following datasets and tasks:

2010 i2b2/VA Uzuner et al. (2011) (named entity recognition, NER)
2018 n2c2 Henry et al. (2019) (named entity recognition, NER)
2006 Smoking Challenge Uzuner et al. (2008) (multi-class classification, MCC)
2008 Obesity Challenge Uzuner (2009) (multi-label classification, MLC)
MedNLI Shivade (2019) (natural language inference, NLI)
ClinSTS Wang et al. (2020) (semantic textual similarity, STS)

While the first four datasets include annotated discharge summaries, the last two datasets include pairs of sentences extracted from MIMIC-III Johnson et al. (2016). Due to limited space, we refer the reader to the source papers and to the appendix.

Using those datasets, different text anonymization techniques are applied to the training split. The following techniques are used, based on Suppression, Perturbation, Substitution and Aggregation, as described above:

De-identification (DeI) Using the tool Philter Norgeot et al. (2020), all PHI data in the text is replaced by "XXXX".
Mask Numbers (MNr) All occurrences of numbers in a given text, both in numerical or alphabetical form, are replaced using “XX”.
Shuffle Sentences (ShS) Sentences in a given text are shuffled.
Random Swap (RaS) A certain percentage of words are randomly chosen and swapped all over the document.
Synonym Replacement (SyR) A certain percentage of the non-stop words in the document are replaced with WordNet synonyms.
Clinical Concept Synonym Replacement (CnR) All signs/symptoms, diseases/disorders, and medications are replaced by a random UMLS synonym, using cTAKES Savova et al. (2010) for entity linking.
Text Aggregation (AgX) is done by merging a certain amount of shuffled documents (X) into one.

Finally, in order to examine the effect of anonymization on the performance of state-of-the-art machine learning models, we rely on existing BERT solutions which have achieved promising results on the different datasets in the recent past. Particularly we rely on BERT base (uncased) Devlin et al. (2019), Bio+Clinical BERT Alsentzer et al. (2019), as well as BERT long document classification Mulyar et al. (2019).

4 Experiments

4.1 Setup

For our experiments, we rely - if possible - on the original setup and configuration as described in the original publications. Given the clinical corpus, the data is split into training and test data. Next the anonymization is applied to the training data. Together with each anonymization, a model is trained and then evaluated on the original (not anonymized) text of the test split. For each technique, the model is trained and evaluated five different times. If the anonymization technique is not deterministic and produces a different anonymized dataset each time, we repeat the text anonymization five times, which results in 25 runs. The results of each approach are averaged and compared to the performance of the base model (without anonymization).

All experiments are conducted with BERT base and Bio+Clinical BERT. The experiments corresponding to the classification tasks (Smoking and Obesity), are additionally conducted with BERT long document classification, as documents in those tasks are quite long. In case of random swap, random replacement the presented anonymization is applied to 20 and 100% of the data.

4.2 Results

First, each model has been trained and tested on the original data - without applying the anonymization beforehand. Results are presented in Table 2. Note, our base results slightly differ from the results in the ref. papers, even though we use the same setup.

Model	Smoking	Obesity	MedNLI	ClinSTS	2010	2018
BERT	77.89	67.58	76.9	83.88	82.62	87.84
BioC	75.48	70.73	80.49	84.83	84.54	89.03
LDoc	87.69	82.51	-	-	-	-
Eval	F1	F1	Acc.	Pearson	F1	F1

Table 2: Base results on all datasets in terms of average scores across all runs, using BERT base, Bio+Clinical BERT (BioC) and BERT long document classification.

Next we apply the different text anonymization techniques to the training data, train the models and test them on the original data. The results of the different techniques, in comparison to the best performing base system on that task, are presented in Table 1.

	DeI	MNr	ShS	RaS 20%	RaS 100%	SyR 20%	SyR 100%	CnR	Ag2	Ag3	Ag4
found	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	0.9063	0.6351	0.2789
a/o sim	0.9529	0.8949	1.0	0.9986	0.9986	0.5486	0.2442	0.7512	0.5758	0.4261	0.3470
avg-sim	0.1502	0.1458	0.1524	0.1518	0.1518	0.1084	0.0589	0.1388	0.1646	0.1603	0.1531

Table 3: Re-identification of patients using different text anonymization techniques. found refers to the percentage of cases in which the highest ranked (most similar) document on the original dataset was the correct one; a/o sim describes the distance between the anonymized document and its original version; avg-sim describes the average similarity between a given anonymized document all 3500 original documents

4.3 Analysis

The conducted suppression methods de-identification (DeI) and mask number (MNr), mask some information with a neutral label (‘XXXX’). In most cases the general effect is rather minimal. Particularly in the case of DeI, the table shows a slight improvement of performance. This is surprising, and might be connected to the random model initialization. Another reason could be that, from a model perspective, less relevant information has been discarded. However, only in the case of MedNLI the results are significantly better. MNr on the 2018 task causes a moderate performance loss due to entities related to numerical values, such as dosage or strength. Overall, the results are inline with the findings presented by Berg et al. (2020) and Lange et al. (2020).

In our experiment, perturbation changes the sentence order (sentence shuffle; ShS) and the order to the words within the document (random swap; RaS). Different from suppression, the technique shows a stronger performance loss, particularly in case of RaS. The more words swapped across the document, the stronger in most cases the drop of performance (RaS 20% versus 100%). The technique has a particularly strong influence on NER tasks, in which the word order plays an important role. Instead, using sentence shuffle a negative significant effect can be observed on the obesity task.

Similarly it behaves with the substitution techniques (WordNet) synonym replacement and (UMLS) clinical concept synonym replacement. Generally both techniques lead to a drop of performance, which is stronger the more words affected by the technique (applying to 20% of the data versus 100%). The drop is notably stronger in case of synonym replacement, as more words are affected and possibly also wrong synonyms might have been inserted - depending on the context. In case of clinical concept synonym replacement, the performance loss is notably smaller, as possibly less words are affected. Also, according to the frequency of UMLS mentions, in various cases the preferred concept mention might have been chosen.

Finally, text aggregation, which merges documents according to different characteristics, has the strongest effect on the model performance. For all tasks we can observe, the more files are aggregated the stronger the drop in performance. We stopped with a maximum of 4 documents (Ag4), as the document length of the merged case reports was too long otherwise. In case of text classification, NLI and STS documents with the same labels have been merged together, thus the effect might not be too strong. However, in case of multi-label classification (Obesity) the new aggregated documents are now not only larger, but also contain more labels.

4.4 Re-Identification Experiment

The previous experiment presented the influence of each anonymization technique on the different NLP tasks. To validate the efficiency and robustness of each technique, we conduct in the following a small and simple fictitious re-identification experiment. The question we investigate is, how difficult it would be, to link anonymized text to a particular patient, assuming that an attacker has the anonymized text and access to the original patient database. We conduct this small experiment using 3500 texts from MIMIC-III. The setup is as follows: First we run the different anonymization techniques on the data, and then we start a similarity search by calculating the Jaccard Distance on word level, between each anonymized document and all (original) 3500 MIMIC texts.

Our setup describes a worst-case scenario, and we hope that it is unlikely to happen. However the scenario describes how much the anonymized document differs from its original version, and how easily the original could be found using a simple word based similarity search. As depicted in Table 3, the average similarity (avg-sim) from an anonymized document to the documents in the MIMIC dataset is mostly about 0.15. Instead the similarity to the correct document (a/o-sim) is always above this average score. However, while in case of suppression and perturbation techniques the a/o-sim score is about 0.9–1, the similarity strongly decreases with substitution and aggregation, most notably with SyR 100% and Ag4. Conversely, only in case of aggregation the highest ranked documents are not necessarily the corresponding original documents, thus providing some (minor) security against a possible re-identification in our scenario. Based on the outcomes we define an anonymization as ‘stronger’, the lower the values a/o sim and found are.

5 Recommendations and Learnings

Based on the outcomes of the previous two experiments, we draw the following conclusions regarding clinical text anonymization:

There is no one-size-fits-all anonymization technique that can always be recommended. The optimal technique needs to be selected depending on the (security) requirements, the sensitivity of the data as well as underlying NLP task. Overall, the results indicate a correlation between performance loss and strength of anonymization technique, but each technique of course comes at a cost. While some can be quickly conducted, such as sentence shuffle or aggregation, others require additional tools and resources such as DeI or SyR, which can prohibit their use in some scenarios.

Text aggregation is the strongest of the presented techniques. It offers relatively good security against re-identification, but leads to the strongest performance loss. This technique not only aggregates the texts and their contexts together but also results in less training data, which is one of the reasons for the performance loss, as seen in the appendix. Although text aggregation is generally the technique of choice for providing the maximum security, in case of multi label classification, it suffers a strong performance drop. In this case, we recommend relying on substitution techniques, such as synonym replacement. To provide further abstraction, substituted data could be shuffled and be possibly enriched with additional sentences.

Another disadvantage of text aggregation is the fact that long text documents get even longer. Depending on the NLP task, documents need to be processed at once (e.g. document classification) and standard BERT models can deal only with up to 512 input tokens. Relevant information might get lost, and models result in lower performance.

Tasks which can be conducted on sentence level can be easier detached from a patient, and thus weaker anonymization techniques can be applied. Finally, the applied perturbation techniques appear not to be useful for the given tasks due to the weak anonymization.

6 Conclusion

This work presented a structured analysis regarding text anonymization and its influence on the machine learning model performance. Our experiment tested seven different anonymization techniques on multiple datasets, including five different clinical NLP tasks. In addition, we conducted a simple fictitious re-identification experiment to examine the robustness of each technique. Together with the results, we present some recommendations and learnings. For short, we did not find a one-size-fits-all anonymization technique that would perform best in all tasks. The particular decision depends on several factors. In addition, we provide the software to conduct the text anonymization studies. In future work, more de-identification and anonymization techniques could be added to hopefully arrive at a comprehensive overall picture.

References

E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly Available Clinical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. Cited by: §3.
H. Berg, A. Henriksson, and H. Dalianis (2020) The Impact of De-identification on Downstream Named Entity Recognition in Clinical Text. In Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis, pp. 1–11. Cited by: §2, §4.3.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805. Cited by: §3.
S. Henry, K. Buchan, M. Filannino, A. Stubbs, and O. Uzuner (2019) 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. Journal of the American Medical Informatics Association 27 (1), pp. 3–12. External Links: ISSN 1527-974X Cited by: §1, 2nd item.
HIPAA (2022) External Links: Link Cited by: §2.
A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: §3.
M. Kittner, M. Lamping, D. T. Rieke, J. Götze, B. Bajwa, I. Jelas, G. Rüter, H. Hautow, M. Sänger, M. Habibi, et al. (2021) Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA open 4 (2), pp. ooab025. Cited by: §1.
T. J. Lampoltshammer, L. Thurnay, G. Eibl, et al. (2019) Impact of Anonymization on Sentiment Analysis of Twitter Postings. In Data Science–Analytics and Applications, pp. 41–48. Cited by: §2.
L. Lange, H. Adel, and J. Strötgen (2020) Closing the Gap: Joint De-Identification and Concept Extraction in the Clinical Domain. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6945–6952. Cited by: §2, §4.3.
N. Mamede, J. Baptista, and F. Dias (2016) Automated anonymization of text documents. In 2016 IEEE congress on evolutionary computation (CEC), pp. 1287–1294. Cited by: 1st item, 3rd item.
S. M. Meystre, O. Ferrández, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore (2014) Text de-identification for privacy protection: a study of its impact on clinical text information content. Journal of biomedical informatics 50, pp. 142–150. Cited by: §2.
S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore (2010) Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC medical research methodology 10 (1), pp. 1–16. Cited by: §2.
A. Mulyar, E. Schumacher, M. Rouhizadeh, and M. Dredze (2019) Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models. ArXiv abs/1910.13664. Cited by: §3.
B. Norgeot, K. Muenzen, T. A. Peterson, X. Fan, B. S. Glicksberg, G. Schenk, E. Rutenberg, B. Oskotsky, M. Sirota, J. Yazdany, et al. (2020) Protected Health Information filter (Philter): accurately and securely de-identifying free-text clinical notes. NPJ digital medicine 3 (1), pp. 1–8. Cited by: item 1.
J. S. Obeid, P. M. Heider, E. R. Weeda, A. J. Matuskowitz, C. M. Carr, K. Gagnon, T. Crawford, and S. M. Meystre (2019) Impact of de-identification on clinical text classification using traditional and deep learning classifiers. Studies in health technology and informatics 264, pp. 283. Cited by: §2.
P. Samarati and L. Sweeney (1998) Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Cited by: 4th item.
G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute (2010) Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. Journal of the American Medical Informatics Association 17 (5), pp. 507–513. Cited by: item 6.
C. Shivade (2019) MedNLI — A Natural Language Inference Dataset For The Clinical Domain (version 1.0.0). PhysioNet. Cited by: 5th item.
Ö. Uzuner, I. Goldstein, Y. Luo, and I. Kohane (2008) Identifying Patient Smoking Status from Medical Discharge Records. Journal of the American Medical Informatics Association 15 (1), pp. 14–24. External Links: ISSN 1067-5027 Cited by: 3rd item.
Ö. Uzuner, B. R. South, S. Shen, and S. L. DuVall (2011) 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18 (5), pp. 552–556. External Links: ISSN 1067-5027 Cited by: 1st item.
Ö. Uzuner (2009) Recognizing Obesity and Comorbidities in Sparse Data. Journal of the American Medical Informatics Association 16 (4), pp. 561–570. External Links: ISSN 1067-5027 Cited by: 4th item.
T. Vakili, A. Lamproudis, A. Henriksson, and H. Dalianis (2022) Downstream task performance of bert models pre-trained using automatically de-identified clinical data. In Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022), Cited by: §2.
Y. Wang, S. Fu, F. Shen, S. Henry, O. Uzuner, and H. Liu (2020) The 2019 n2c2/OHNLP track on clinical semantic textual similarity: overview. JMIR Medical Informatics 8 (11), pp. e23375. Cited by: 6th item.
Z. Zuo, M. Watson, D. Budgen, R. Hall, C. Kennelly, N. Al Moubayed, et al. (2021) Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study. JMIR medical informatics 9 (10), pp. e29871. Cited by: 2nd item.

Dataset Name	Smoking	Obesity	MedNLI	ClinSTS	2010	2018
NLP Task	multi class classification (MCC)	multi label classification (MLC)	natural language inference (NLI)	semantic textual similarity (STS)	named entity recognition (NER)	named entity recognition (NER)
Train size	389	730	11232	1642	170	303
Dev size	104*	507*	1395	412*	256*	202*
Test size	104	507	1422	412	256	202
Type of data	discharge summaries	discharge summaries	sentence pairs	sentence pairs	discharge summaries**	discharge summaries**
Avg # token	1100.61	1935.72	37.22	57.65	15.41	63.657

Table 4: Overview about dataset used for the anonymization experiments. (*) For all datasets except MedNLI no development (dev) set was provided. Therefore, the test set was also used as a dev set instead. (**) The discharge summaries of the NER tasks have been divided into individual sentences which have been used to train and test the models. The anonymization techniques were directly applied on those sentences.

Appendix A Appendix

a.1 Overview about data

In total, six text-based clinical datasets have been chosen in this work to examine the effects of anonymization on their respective tasks. Table 4 provides an overview on these datasets as well as some details about their sizes and types.

a.2 Text anonymization techniques - some more details

Here some additional details we left out of the main article, due to limited space:

De-identification

In all datasets we used for our experiments were already de-identified (PHI information was replaced with pseudonyms). However, we tested de-identification in our experiments, as we wanted to explore the influence of masking PHI information.

Percentage

The two techniques of random swap and synonym replacement apply the technique to a ‘certain percentage of (non-stop) words’. We have tested those techniques with different number of modification steps (in percentage), however, in the main article we just report 20% and 100%.

Clinical Concept Synonym Replacement

The technique replaces each detected signs/symptoms, diseases/disorders, and medications with a synonym provided by UMLS. This synonym is selected randomly from all English mentions corresponding to the given concept unique identifier (CUI). This means that also the previously given entity mention could be again inserted by this random selection. In this way, more common mentions are favored over seldom entity mentions in UMLS.

Aggregation

Text aggregation merges two or more files together into one file. In this way the size and number of tokens is significantly increased which makes it harder for standard BERT models (512 tokens) on discharge summaries. For this reason we also used BERT long document classification. Moreover, text aggregation decreases the size of the training data by the factor of files which are merged together. Merging for instance four files into one, decreases the data to 25% of its original size. Repeating the aggregation step several times with different files, would create a larger dataset, but makes it much easier to re-identify single patient documents. Thus, the aggregation would have no effect anymore. See more in Section A.4.

Text aggregation merges documents according to their target label. In case of the smoking task for instance, only documents reporting the exact same smoking status could be merged together (randomly). However, this did not work in case of multi label classification. Thus, in case of obesity random documents were aggregated.

a.3 Implemented Models

To perform the experiments, we implemented six different models where the pre-trained BERT models (BERT Base uncased and Bio+Clinical BERT) can be adapted and fine-tuned to perform the tasks behind the corresponding datasets. The configuration and hyper-parameters of the models are presented in Table 5.

Models	Smoking	Obesity	MedNLI	ClinSTS	2010	2018
Token sequence length	512	512	150	150	150	200
Epochs	40*	40*	3	3	3	3
Linear layer’s input	[CLS] token’s output embedding	[CLS] token’s output embedding	[CLS] token’s output embedding	[CLS] token’s output embedding	Each token’s output embedding	Each token’s output embedding
Linear layer’s input size	768	768	768	768	768	768
Linear layer’s output size	5	16	3	1	9	22
Loss function	Cross entropy loss	Binary cross entropy loss	Cross entropy loss	Mean squared error	Cross entropy loss	Cross entropy loss
Activation function	Softmax	Sigmoid	Softmax	Identity function	Softmax	Softmax
Evaluation function	Micro-F1	Micro-F1	Accuracy	Pearson	Micro-F1	Micro-F1
Optimizer	AdamW	BertAdam	BertAdam	BertAdam	BertAdam	BertAdam

Table 5: Models overview.

In addition to the implemented models, the BERT long document classification model has been downloaded from github ¹¹1https://github.com/AndriyMulyar/bert_document_classification and used as is on the text classification datasets and their anonymized versions. Some preprocessing might be necessary to input the anonymized data to the model but the architecture is intact. Furthermore, during the preprocessing of the NER texts, the words have been annotated according to the IOB2 format.

a.4 Augmented Text Aggregation

Among the anonymization techniques tested during this work is one called Augmented Text Aggregation (AAgX). It consists in repeating the simple Text Aggregation technique presented in Section 3, $n$ independent times. The resulting datasets are then simply merged together. We introduced this technique to generate more training data as the Text Aggregation technique reduces the training size to a factor of $1 / n$ . This should not be used in practice as it makes it easier, with more resources, to single out patients that are present in multiple texts.

Table 6 highlights the performance drops or gains after applying this anonymization techniques to the datasets, in comparison to the reported results in Table 1. In all cases, the results of the Augmented Text Aggregation are better than those of the simple Text Aggregation, most notably for ClinSTS.

	AAg2	AAg3	AAg4
Smoking	+2.89	-4.88	-6.25
Obesity	-5.84	-14.48	-31.68
MedNLI	-6.60	-11.09	-12.13
ClinSTS	-1.45	-3.04	-3.14
2010	-	-	-
2018	-	-	-
mean	-2.55	-8.19	-13.12

Table 6: Augmented Text Aggregation

Which anonymization technique is best for which NLP task? - It depends. A Systematic Study on Clinical Text Processing