Description
The introduction of the European General Data Protection Regulation (GDPR) in 2018 had far-reaching effects on the handling and use of personal data. Anonymized data is exempt from the GDPR, as - ideally - no conclusions can be drawn about natural persons. In response, global interest in data anonymization has greatly increased, which reflected in the development of various new anonymization techniques. Especially concerning Large Language Models (LLMs), anonymization is of special interest, since it was shown that training data can be extracted retrospectively. To achieve results in accordance with the GDPR, well performing anonymization models are necessary. While many such models exist for the English language, models for German texts are lacking. The main goal of the NERMAN project is research of machine learning models concerning the • identification of personalized information in German texts and • methods for adequate anonymization of the identified data. To achieve our goal, we plan to develop a Named-Entity-Recognition NER model with a focus on the detection of personalized data. This is to be realised on the basis of two use cases to be defined in the project. A special focus is the anonymization of texts provided by the BMI, which mostly consist of email and chat correspondence. To develop a performant model, the quality of training data is crucial. To acquire such data, we plan to implement a combination of web-scraping and synthetic data generation. The data will also be compared via statistical metrics to a ground truth to ensure that the data is valid. As there is currently a lack of German NER datasets, we will provide a benchmark dataset based on our acquired data. The developed models will be thoroughly evaluated in terms of performance, efficiency, resource use and ethical and legal aspects. The ethical and legal framework that will be constructed as a result is also of value for future evaluations of anonymization quality concerning Artificial Intelligence (AI). In order to demonstrate the practical application of our research, a proof of concept will be constructed. As a major innovation of the NERMAN project, a NER model tailored for usage with German language email and chat data will be developed for the first time. Furthermore, the generation of synthetic data exhibiting specific text characteristics for NER model training through the utilisation of LLMs is planned. The generated data will be made available to the public via a benchmark data set, which represents a significant development as it is the first time data has been provided for such a highly confidential sector. Lastly, a quantitative framework concerning the evaluation of quality for personal data and anonymization methods will be constructed.
Details
| Duration | 01/10/2025 - 31/03/2027 |
|---|---|
| Funding | FFG |
| Program | |
| Department | |
| Principle investigator for the project (University for Continuing Education Krems) | Mag. Anna-Sophie Novak, LL.M. |
| Project members |