[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Community renewal and project obsolescence



On 12/28/23 10:34, Rafael Laboissière wrote:

* M. Zhou <lumin@debian.org> [2023-12-27 19:00]:

Thanks for the code and the figure. Indeed, the trend is confirmed by fitting a linear model count ~ year to the new members list. The coefficient is -1.39 member/year, which is significantly different from zero (F[1,22] = 11.8, p < 0.01). Even when we take out the data from year 2001, that could be interpreted as an outlier, the trend is still siginificant, with a drop of 0.98 member/year (F[1,21] = 8.48, p < 0.01).

I thought about to use some models for population statistics, so we can get the data about DD birth rate and DD retire/leave rate, as well as a prediction. But since the descendants of DDs are not naturally new DDs, the typical population models are not likely going to work well. The birth of DD is more likely mutation, sort of.

Anyway, we do not need sophisticated math models to draw the conclusion that Debian is an aging community. And yet, we don't seem to have a good way to reshape the curve using Debian's funds. -- this is one of the key problems behind the data.

P.S.1: The correct way to do the analysis above is by using a generalized linear model, with the count data from a Poisson distribution (or, perhaps, by considering overdispersed data). I will eventually add this to my code in Git.

Why not integrate them into nm.debian.org when they are ready?

P.S.2: In your Python code, it is possible to get the data frame directly from the web page, without copying&pasting. Just replace the line:

    df = pd.read_csv('members.csv', sep='\t')

by:

    df = pd.read_html("https://nm.debian.org/members/";)[0]

I am wondering whether ChatGPT could have figured this out…

I just specified the CSV input format based on what I have copied. It produces well-formatted code with detailed documentation in most of the time. I deleted too much from its outputs to keep the snippet short.

I have to justify one thing to avoid giving you a wrong impression about large language models. In fact, the performance of an LLM (such as ChatGPT) greatly varies based on the prompt and the context people provided to it. Exploring this in-context learning capability is still one of the cutting edge research topics. For the status-quo LLMs, their answers on boilerplate code like plotting (matplotlib) and simple statistics (pandas) are terribly perfect.


Reply to: