* M. Zhou <lumin@debian.org> [2023-12-27 19:00]:
Thanks for sharing the figure. The data seems correlated with the number of new Debian accounts. See the figure below: Python Code for this figure:``` # modified from ChatGPT. # XXX: members.csv is copy-pasted from https://nm.debian.org/members/ import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('members.csv', sep='\t') df = df[df['Since'] != '(unknown)'] # filter out invalid data df['Since'] = pd.to_datetime(df['Since']) df['Year'] = df['Since'].dt.year account_counts = df['Year'].value_counts().sort_index() smoothed_counts = account_counts.rolling(window=3).mean() plt.figure(figsize=(10, 6)) plt.bar(account_counts.index, account_counts.values, color='skyblue') plt.plot(smoothed_counts.index, smoothed_counts.values, color='orange', label=f'Smoothed (Window=3)') plt.xlabel('Year') plt.ylabel('Number of Accounts Created') plt.title('Number of Accounts Created Each Year') plt.legend() plt.savefig('nm-year.png') ```
Thanks for the code and the figure. Indeed, the trend is confirmed by fitting a linear model count ~ year to the new members list. The coefficient is -1.39 member/year, which is significantly different from zero (F[1,22] = 11.8, p < 0.01). Even when we take out the data from year 2001, that could be interpreted as an outlier, the trend is still siginificant, with a drop of 0.98 member/year (F[1,21] = 8.48, p < 0.01).
Best, Rafael LaboissièreP.S.1: The correct way to do the analysis above is by using a generalized linear model, with the count data from a Poisson distribution (or, perhaps, by considering overdispersed data). I will eventually add this to my code in Git.
P.S.2: In your Python code, it is possible to get the data frame directly from the web page, without copying&pasting. Just replace the line:
df = pd.read_csv('members.csv', sep='\t') by: df = pd.read_html("https://nm.debian.org/members/")[0] I am wondering whether ChatGPT could have figured this out…