PyTorch Streamlines Data Packaging with scikit-build-core

PyTorch is transitioning its non-Python data file packaging from traditional setup.py globs to CMake install() rules, facilitated by a new cmake/PackageData.cmake file. This change, implemented on May 4, 2026, is a key step in migrating PyTorch’s build system to scikit-build-core, aiming to standardize and improve the reliability of including essential assets like type stubs, benchmark utilities, and Inductor codegen files within PyTorch wheels.

PyTorch has introduced cmake/PackageData.cmake to manage the inclusion of non-Python data files into its distribution wheels.
This new CMake-based mechanism replaces the previous reliance on package_data globs within setup.py for scikit-build-core builds.
The change specifically covers critical files such as type stubs, C++ benchmark sources, Inductor codegen templates, and export serialization schemas.
For legacy setuptools builds, setup.py‘s package_data declarations remain necessary, with CMake installs acting as harmless self-copies.

What changed

The core change in PyTorch’s packaging involves the introduction of cmake/PackageData.cmake. Historically, Python packages, including PyTorch, have used setup.py with package_data globs to specify non-Python files (e.g., images, configuration files, C++ headers) that should be included in the final distribution wheel. This approach, while functional, can become complex and error-prone for large, multi-language projects like PyTorch.

The new PackageData.cmake file now defines CMake install() rules to explicitly manage these files. This is a direct response to PyTorch’s ongoing migration to scikit-build-core, a modern build backend designed to streamline the compilation and packaging of Python projects with native code dependencies. The files covered by this new system are diverse and critical, including *.pyi type stubs, py.typed markers, C++ sources for benchmarks, HTML/JS/MJS files for model dumps, Inductor codegen headers and templates, and export serialization schemas (YAML and Thrift). For example, the inclusion of *.pyi files ensures better static analysis and IDE support for PyTorch users.

For projects still relying on setuptools, the package_data declarations in setup.py will continue to be the primary mechanism for file inclusion. In this scenario, the new CMake install commands effectively perform “harmless self-copies” to the existing <root>/torch/ directory, ensuring compatibility during the transition period.

How it works

The new system leverages CMake’s robust build and installation capabilities. When PyTorch is built with scikit-build-core, the cmake/PackageData.cmake script is invoked. This script contains a series of install() commands, each specifying a particular category of non-Python data files and their target installation directory within the Python package structure. For instance, type stubs might be installed to torch/stubs/, while Inductor templates go into torch/inductor/codegen/.

CMake’s install() command is powerful because it allows precise control over file placement, permissions, and even transformations during the installation process. This contrasts with setup.py‘s package_data, which often relies on glob patterns that can be less explicit and harder to debug if files are missed or incorrectly included. By centralizing these definitions in CMake, PyTorch gains a single source of truth for its non-Python assets, aligning with best practices for hybrid Python/C++ projects.

The integration with scikit-build-core means that the Python packaging layer now delegates the responsibility of collecting and placing these native-adjacent assets to the CMake build system. This modularity separates the concerns of Python metadata from native build logic, leading to a cleaner and more maintainable build process. This approach is similar to how other complex projects manage their native components, ensuring that all necessary files are present in the final wheel, regardless of the user’s specific build environment or operating system.

Why it matters for operators

For operators – be they ML engineers managing deployment pipelines, founders building on PyTorch, or consultants advising on ML infrastructure – this seemingly minor packaging change has significant downstream implications for stability, maintainability, and debugging. The shift from setup.py globs to CMake install() rules for non-Python data files is not just an internal refactor; it’s a move towards a more robust and predictable build process for PyTorch itself.

First, enhanced build consistency is paramount. In complex environments, missing files in a deployed PyTorch wheel can lead to obscure runtime errors. By using CMake’s explicit install rules, PyTorch reduces the likelihood of such packaging mishaps. This means fewer “it works on my machine” scenarios for operators building PyTorch from source or integrating custom extensions. Debugging issues related to missing type stubs or incorrect Inductor codegen files becomes less about guessing packaging issues and more about core logic.

Second, this change signals PyTorch’s continued investment in modern build tooling via scikit-build-core. Operators should view this as a green light to adopt similar modern practices for their own Python packages that incorporate native code. Relying on deprecated or less explicit packaging methods increases technical debt. Understanding and potentially adopting scikit-build-core for your own projects could future-proof your build pipelines and improve developer experience, especially when dealing with complex dependencies like custom CUDA kernels or C++ extensions.

Finally, the explicit inclusion of files like type stubs (*.pyi) directly impacts developer productivity. For operators writing PyTorch code, robust type hints improve IDE auto-completion, catch errors earlier, and make large codebases more navigable. While this isn’t a new feature, ensuring these are reliably packaged means better tooling support out-of-the-box. Operators should ensure their internal PyTorch-dependent projects are leveraging these type hints to maximize their own development efficiency.

Risks and open questions

Transition Complexity: While the change aims for consistency, the dual-path approach (CMake for scikit-build-core, setup.py for setuptools) introduces a period of increased complexity. Operators building PyTorch from source or maintaining custom forks need to be aware of which build system is active and how it impacts file inclusion.
Potential for Build Breakage: Any significant change to a core project’s build system carries the risk of introducing new build failures, particularly in less common configurations or older environments. While the “harmless self-copies” for setuptools are intended to mitigate this, unexpected interactions are always possible.
Learning Curve for Contributors: For new contributors to PyTorch or those maintaining extensions, understanding the new CMake-centric packaging approach for non-Python assets will require an adjustment. This could temporarily increase the barrier to entry for certain types of contributions.

Sources

PyTorch GitHub Release: Add cmake/PackageData.cmake for scikit-build-core migration

Author

Siegfried Kamgo

Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

PyTorch Streamlines Data Packaging with scikit-build-core Migration

What changed

How it works

Why it matters for operators

Risks and open questions

Sources

Author

Siegfried Kamgo

Leave a Reply Cancel reply

PyTorch Streamlines Data Packaging with scikit-build-core Migration

Turn this article into a repeatable weekly edge.

What changed

How it works

Why it matters for operators

Risks and open questions

Sources

Author

Siegfried Kamgo

Get the next blueprint before it becomes common advice.

Related Articles

LLMs Optimize Zero-Shot Classification Definitions for Web Filtering

SCOPE-FE: Scalable Auto Feature Engineering for High-Dimensional Data

LLMs Implement Agent-Based Models: A Replication Study

Leave a Reply Cancel reply