[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Introducing Free-Threaded Python to Debian



Hello everyone,
In my previous email (https://lists.debian.org/debian-python/2025/11/msg00002.html), I introduced my work on bringing free-threaded Python to Debian. I'm now pleased to announce that the integration of nogil with the GIL version has been successfully completed and has passed necessary testing.
Benefits of Introducing Free-Threaded Python:
1. Significant improvement in multi-threaded parallel performance
2. No need to migrate to multiprocessing or alternative languages - currently multiprocessing is required to utilize multi-core performance, but it comes with substantial overhead
3. Reduced maintenance burden for C/C++ extensions - currently complex GIL management or custom thread pools are needed to work around the GIL
4. Alignment with future direction - PEP 703 has been accepted, and upstream has confirmed that No-GIL will be the default in the future
Current Progress:
A merge request has been opened on salsa: https://salsa.debian.org/cpython-team/python3/-/merge_requests/41
An immediately testable package repository is available at: https://salsa.debian.org/ben0i0d/python3t-repo
The related Debian Bug report (#1117718) is also relevant to this work.
Design Decisions:
Based on discussions with Stefano, the current approach:
- Maintains isolated standard libraries
- Does NOT isolate dist-packages
- Is not a separate new package, but rather a variant of python3
My guiding principle remains: introducing free-threaded Python should not create new problems, break existing functionality, or hinder anyone's work.
Technical Implementation Details:
Key changes in python3.14:
1. Introduced ABI variant control: 'ABI_VARIANTS := gil nogil', with '--disable-gil --with-suffix=t' enabled for nogil
(Note: '--enable-experimental-jit' cannot be used with '--disable-gil')
2. Build system optimization: Implemented duplication of nogil tasks via '$(foreach abi,$(ABI_VARIANTS),$(eval $(call BUILD_STATIC,$(abi))))' to avoid code redundancy
3. Test adjustments: Added 'TEST_EXCLUDES += test_tools' since 'Tools/freeze/test/Makefile' uses hardcoded 'python'
4. venv fix: Added 'add-abiflags-sitepackages.diff' to fix venv sitepackages recognition after nogil build (this issue has been fixed upstream and the patch will be removed after 3.14.1 release)
5. Minimal installation: The nogil version excludes documentation, idle, tk, desktop, binfmt, etc., ensuring it's an extension rather than a rewrite of the original python3
6. Code readability: Made harmless text adjustments to rules file, such as consolidating scattered 'TEST_EXCLUDES +=' statements
Build Process:
git clone git@salsa.debian.org:ben0i0d/python3t.git
cd python3t
git checkout python3
uscan --download-current-version --verbose
dpkg-source -b .
sudo env DEB_BUILD_OPTIONS="nocheck nobench" pbuilder build ../python3.14_3.14.0-5.dsc 2>&1 | tee ../log.txt
Test Environment:
- OS: Debian GNU/Linux forky/sid (forky) x86_64
- CPU: AMD Ryzen 5 9600X (12) @ 5.68 GHz
- Memory: 30.47 GiB
- Kernel: 6.16.12+deb14+1-amd64
Test Results:
1. Basic Performance Test
# GIL version, single-threaded
python3.14 benchmark.py --n 512 --threads 1
Elapsed: 1.836 s
# GIL version, 8 threads
python3.14 benchmark.py --n 512 --threads 8
Elapsed: 2.026 s
# nogil version, single-threaded
python3.14t benchmark.py --n 512 --threads 1
Elapsed: 2.408 s
# nogil version, 8 threads
python3.14t benchmark.py --n 512 --threads 8
Elapsed: 0.674 s
2. NumPy Compatibility Test
Both versions successfully create virtual environments and install numpy:
GIL environment:
python3.14 -m venv gil
source gil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.003 s
nogil environment:
python3.14t -m venv nogil
source nogil/bin/activate
pip install numpy
# Runs normally, Elapsed: 0.009 s
Important Notes:
- The GIL version build results are identical to the current master branch
- Building only the GIL version is supported, but building only nogil is not (ensuring nogil doesn't become default)
- No backport to python3.13 is planned since nogil support there is still experimental
Regarding dist-packages:
I understand some colleagues have concerns about not isolating dist-packages. Currently, I recommend users employ the nogil version within venv environments. I will fully support subsequent migration efforts to help more people transition smoothly.
The current implementation requires more testing and refinement. I'm not a CPython expert, so I genuinely welcome valuable suggestions from the community and commit to actively participating in improvement efforts.
Please refer to the attachments for detailed build logs and test scripts.
I look forward to your feedback!
Best regards,
Xu Chen (ben0i0d)
"""
Multi-threaded matrix multiplication benchmark (pure Python loops).

This script measures how much parallel speedup you can get from multi-threading
in a CPU-bound workload — perfect for comparing GIL vs. no-GIL Python interpreters.

Example:
  python3.14 benchmark.py --n 512 --threads 1
  python3.14 benchmark.py --n 512 --threads 8
  python3.14t benchmark.py --n 512 --threads 1
  python3.14t benchmark.py --n 512 --threads 8
"""

import time
import random
import argparse
from concurrent.futures import ThreadPoolExecutor

def matmul(A, B):
    n = len(A)
    m = len(B[0])
    p = len(B)

    # Transpose B for better cache locality
    Bt = list(zip(*B))

    C = [[0.0] * m for _ in range(n)]
    t0 = time.perf_counter()

    for i in range(n):
        row_A = A[i]
        row_C = C[i]

        for j in range(m):
            col_B = Bt[j]
            s = 0.0
            # Use local variables in innermost loop
            for k in range(p):
                s += row_A[k] * col_B[k]
            row_C[j] = s

    t1 = time.perf_counter()
    return C, t1 - t0

def matmul_range(A, Bt, start_row, end_row):
    n = len(A)
    m = len(Bt)  # Bt is transposed, so number of columns in B is len(Bt)
    p = len(Bt[0])  # This is the original number of rows in B, which is the same as the number of columns in A

    C_part = [[0.0] * m for _ in range(end_row - start_row)]

    for i_local, i_global in enumerate(range(start_row, end_row)):
        row_A = A[i_global]
        row_C = C_part[i_local]
        for j in range(m):
            col_B = Bt[j]
            s = 0.0
            for k in range(p):
                s += row_A[k] * col_B[k]
            row_C[j] = s

    return start_row, C_part

def matmul_threaded(A, B, threads=1):
    n = len(A)
    
    if threads == 1:
        return matmul(A, B)
    
    # Transpose B for better cache locality
    Bt = list(zip(*B))
    m = len(Bt)
    
    # Multi-threaded path
    step = (n + threads - 1) // threads
    futures = []
    
    t0 = time.perf_counter()
    with ThreadPoolExecutor(max_workers=threads) as executor:
        for i in range(threads):
            start = i * step
            end = min((i + 1) * step, n)
            if start < end:
                futures.append(executor.submit(matmul_range, A, Bt, start, end))
        
        # Collect and combine results
        C = []
        for future in futures:
            start_row, C_part = future.result()
            C.extend(C_part)
    
    t1 = time.perf_counter()
    return C, t1 - t0

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Optimized matrix multiplication benchmark.")
    parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
    parser.add_argument("--threads", type=int, default=4, help="Number of threads")
    args = parser.parse_args()

    n = args.n

    print(f"Matrix size: {n}x{n}, threads: {args.threads}")

    # Generate matrices
    A = [[random.random() for _ in range(n)] for _ in range(n)]
    B = [[random.random() for _ in range(n)] for _ in range(n)]

    if args.threads == 1:
        _, elapsed = matmul(A, B)
    else:
        _, elapsed = matmul_threaded(A, B, threads=args.threads)

    print(f"Elapsed: {elapsed:.3f} s")
"""
NumPy matrix multiplication benchmark.

This script provides a comparable version of the pure-Python benchmark,
but using NumPy for the core matmul operation.
"""

import time
import random
import argparse
import numpy as np

def matmul_numpy(A, B):
    t0 = time.perf_counter()
    C = A @ B
    t1 = time.perf_counter()
    return C, t1 - t0

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="NumPy matrix multiplication benchmark.")
    parser.add_argument("--n", type=int, default=300, help="Matrix size (n x n)")
    args = parser.parse_args()

    n = args.n

    print(f"Matrix size: {n}x{n}")

    # Generate matrices (keep behavior same as Python version)
    A = np.random.rand(n, n).astype(float)
    B = np.random.rand(n, n).astype(float)

    _, elapsed = matmul_numpy(A, B)

    print(f"Elapsed: {elapsed:.3f} s")

Reply to: