Re: Idea: Reducing GDPR risk via automated log and data minimization

Thank you for your reply and for sharing your perspective.

I would like to clarify one point, because I may not have expressed myself clearly.

My concern is not about having AI “read” or analyze personal data as such. I fully understand that this can itself create additional GDPR and ethical risks. The point I was trying to raise comes more from an organizational angle.

Given that there are currently no dedicated people in a GDPR-focused role, my worry is that privacy-related work may end up being purely reactive, with someone having to act as a “firefighter” on top of their main responsibilities. I was thinking about whether there could be more proactive approaches to data minimization, so that fewer problematic records exist in the first place.

I am not claiming that my idea is the right solution, nor that Debian should use AI for this. I only wanted to express a concern about privacy, which I consider a very important value in Debian, and to share a possible angle for discussion.

I also noticed that there is a debian-ai mailing list, and since I am new to Debian mailing lists, it is possible that this was not the most appropriate list to bring up this idea. If so, I apologize for the noise and appreciate the guidance.

Thank you for taking the time to reply.

Best regards,
pipo

El mié, 7 ene 2026 a las 14:11, Bart Martens (<bartm@debian.org>) escribió:

On Wed, Jan 07, 2026 at 01:33:55AM -0300, pedro vezzosi wrote:
> Hello,
>
> I would like to share a conceptual idea for discussion, not a concrete
> implementation proposal.
>
> One of the current challenges for large and long-lived projects like Debian
> is the accumulation of historical logs, archives, and public records that
> may contain personal data (IPs, emails, names), especially for oldstable
> and EOL releases.
>
> My idea is a layered approach to data minimization:
>
> 1.
>
> Strict retention periods for raw logs (for example 30–90 days).
> 2.
>
> Automatic sanitization and anonymization of historical public records.
> 3.
>
> Use of an AI-assisted classification step (human-in-the-loop), where:

I would rather make that: "protect personal data from artificial intelligence",
so the opposite of AI-assisted classification of personal data. Frankly, we
should start erasing personal data before we no longer can.

> -
>
> Clear personal data is anonymized automatically.
> -
>
> Ambiguous cases are isolated for human review.
> 4.
>
> Preservation of technical knowledge via summarized, signed incident
> records, instead of keeping large volumes of raw personal data.
>
> The goal would be to reduce GDPR exposure while keeping technical value,
> without rewriting history or removing useful information.
>
> I am not proposing to implement this myself, only offering an idea that
> could be discussed or explored in the future.
>
> Thank you for your time.
>
> Best regards,
> pipo

--