The following commit has been merged in the master branch: commit 5bfa1afbefa3e5bd26c957cebb0d84e7740f3de2 Merge: 4a1a736e56370d13f19a878f28fa554fa4fd8d9a e26a543cdd84f21d29ed25a25a0e0642dfd81cfb Author: Stephen Rothwell sfr@canb.auug.org.au Date: Mon Sep 16 15:51:36 2024 +1000
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
# Conflicts: # arch/arm64/Kconfig
diff --combined Documentation/admin-guide/media/vivid.rst index ac233b142a279,c9d301ab46a38..034ca7c77fb97 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@@ -328,7 -328,7 +328,7 @@@ and an HDMI input, one input for each i detail below.
Special attention has been given to the rate at which new frames become - available. The jitter will be around 1 jiffie (that depends on the HZ + available. The jitter will be around 1 jiffy (that depends on the HZ configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second), but the long-term behavior is exactly following the framerate. So a framerate of 59.94 Hz is really different from 60 Hz. If the framerate @@@ -1343,7 -1343,7 +1343,7 @@@ Some Future Improvement Just as a reminder and in no particular order:
- Add a virtual alsa driver to test audio -- Add virtual sub-devices and media controller support +- Add virtual sub-devices - Some support for testing compressed video - Add support to loop raw VBI output to raw VBI input - Add support to loop teletext sliced VBI output to VBI input @@@ -1358,4 -1358,4 +1358,4 @@@ - Make a thread for the RDS generation, that would help in particular for the "Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated in real-time. -- Changing the EDID should cause hotplug detect emulation to happen. +- Changing the EDID doesn't wait 100 ms before setting the HPD signal. diff --combined Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst index c146e5bba8818,731c266beb1a1..dc728c739e28d --- a/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst +++ b/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst @@@ -14,10 -14,10 +14,10 @@@ Gestor de tareas CF
CFS viene de las siglas en inglés de "Gestor de tareas totalmente justo" ("Completely Fair Scheduler"), y es el nuevo gestor de tareas de escritorio -implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto de -el previo gestor de tareas SCHED_OTHER. - -Nota: El planificador EEVDF fue incorporado más recientemente al kernel. +implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto +del previo gestor de tareas SCHED_OTHER. Hoy en día se está abriendo camino +para el gestor de tareas EEVDF, cuya documentación se puede ver en +Documentation/scheduler/sched-eevdf.rst
El 80% del diseño de CFS puede ser resumido en una única frase: CFS básicamente modela una "CPU ideal, precisa y multi-tarea" sobre hardware @@@ -109,7 -109,7 +109,7 @@@ para que se ejecute, y la tarea en ejec ==================================
CFS usa una granularidad de nanosegundos y no depende de ningún - jiffie o detalles como HZ. De este modo, el gestor de tareas CFS no tiene + jiffy o detalles como HZ. De este modo, el gestor de tareas CFS no tiene noción de "ventanas de tiempo" de la forma en que tenía el gestor de tareas previo, y tampoco tiene heurísticos. Únicamente hay un parámetro central ajustable (se ha de cambiar en CONFIG_SCHED_DEBUG): diff --combined MAINTAINERS index 684f820f422ea,cce4291e223dc..97569171356ef --- a/MAINTAINERS +++ b/MAINTAINERS @@@ -334,7 -334,6 +334,7 @@@ L: linux-acpi@vger.kernel.or L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) S: Maintained F: drivers/acpi/arm64 +F: include/linux/acpi_iort.h
ACPI FOR RISC-V (ACPI/riscv) M: Sunil V L sunilvl@ventanamicro.com @@@ -538,17 -537,6 +538,17 @@@ F: drivers/leds/leds-adp5520. F: drivers/mfd/adp5520.c F: drivers/video/backlight/adp5520_bl.c
+ADP5585 GPIO EXPANDER, PWM AND KEYPAD CONTROLLER DRIVER +M: Laurent Pinchart laurent.pinchart@ideasonboard.com +L: linux-gpio@vger.kernel.org +L: linux-pwm@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/*/adi,adp5585*.yaml +F: drivers/gpio/gpio-adp5585.c +F: drivers/mfd/adp5585.c +F: drivers/pwm/pwm-adp5585.c +F: include/linux/mfd/adp5585.h + ADP5588 QWERTY KEYPAD AND IO EXPANDER DRIVER (ADP5588/ADP5587) M: Michael Hennerich michael.hennerich@analog.com S: Supported @@@ -1025,13 -1013,6 +1025,13 @@@ S: Supporte T: git https://gitlab.freedesktop.org/agd5f/linux.git F: drivers/gpu/drm/amd/display/
+AMD DISPLAY CORE - DML +M: Chaitanya Dhere chaitanya.dhere@amd.com +M: Jun Lei jun.lei@amd.com +S: Supported +F: drivers/gpu/drm/amd/display/dc/dml/ +F: drivers/gpu/drm/amd/display/dc/dml2/ + AMD FAM15H PROCESSOR POWER MONITORING DRIVER M: Huang Rui ray.huang@amd.com L: linux-hwmon@vger.kernel.org @@@ -1172,13 -1153,6 +1172,13 @@@ S: Supporte F: arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi F: drivers/net/ethernet/amd/xgbe/
+AMLOGIC BLUETOOTH DRIVER +M: Yang Li yang.li@amlogic.com +L: linux-bluetooth@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/net/bluetooth/amlogic,w155s2-bt.yaml +F: drivers/bluetooth/hci_aml.c + AMLOGIC DDR PMU DRIVER M: Jiucheng Xu jiucheng.xu@amlogic.com L: linux-amlogic@lists.infradead.org @@@ -1228,13 -1202,6 +1228,13 @@@ W: https://ez.analog.com/linux-software F: Documentation/devicetree/bindings/iio/dac/adi,ad3552r.yaml F: drivers/iio/dac/ad3552r.c
+ANALOG DEVICES INC AD4000 DRIVER +M: Marcelo Schmitt marcelo.schmitt@analog.com +L: linux-iio@vger.kernel.org +S: Supported +W: https://ez.analog.com/linux-software-drivers +F: Documentation/devicetree/bindings/iio/adc/adi,ad4000.yaml + ANALOG DEVICES INC AD4130 DRIVER M: Cosmin Tanislav cosmin.tanislav@analog.com L: linux-iio@vger.kernel.org @@@ -1642,14 -1609,6 +1642,14 @@@ F: Documentation/admin-guide/perf/xgene F: Documentation/devicetree/bindings/perf/apm-xgene-pmu.txt F: drivers/perf/xgene_pmu.c
+APPLIED MICRO QT2025 PHY DRIVER +M: FUJITA Tomonori fujita.tomonori@gmail.com +R: Trevor Gross tmgross@umich.edu +L: netdev@vger.kernel.org +L: rust-for-linux@vger.kernel.org +S: Maintained +F: drivers/net/phy/qt2025.rs + APTINA CAMERA SENSOR PLL M: Laurent Pinchart Laurent.pinchart@ideasonboard.com L: linux-media@vger.kernel.org @@@ -1778,17 -1737,6 +1778,17 @@@ F: drivers/mtd/maps/physmap-versatile. F: drivers/power/reset/arm-versatile-reboot.c F: drivers/soc/versatile/
+ARM INTERCONNECT PMU DRIVERS +M: Robin Murphy robin.murphy@arm.com +S: Supported +F: Documentation/admin-guide/perf/arm-cmn.rst +F: Documentation/admin-guide/perf/arm-ni.rst +F: Documentation/devicetree/bindings/perf/arm,cmn.yaml +F: Documentation/devicetree/bindings/perf/arm,ni.yaml +F: drivers/perf/arm-cmn.c +F: drivers/perf/arm-ni.c +F: tools/perf/pmu-events/arch/arm64/arm/cmn/ + ARM KOMEDA DRM-KMS DRIVER M: Liviu Dudau liviu.dudau@arm.com S: Supported @@@ -1806,7 -1754,6 +1806,7 @@@ L: dri-devel@lists.freedesktop.or S: Supported T: git https://gitlab.freedesktop.org/drm/misc/kernel.git F: Documentation/gpu/panfrost.rst +F: drivers/gpu/drm/ci/xfails/panfrost* F: drivers/gpu/drm/panfrost/ F: include/uapi/drm/panfrost_drm.h
@@@ -2485,7 -2432,6 +2485,7 @@@ N: lpc18x
ARM/LPC32XX SOC SUPPORT M: Vladimir Zapolskiy vz@mleia.com +M: Piotr Wojtaszczyk piotr.wojtaszczyk@timesys.com L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) S: Maintained T: git git://github.com/vzapolskiy/linux-lpc32xx.git @@@ -2498,14 -2444,6 +2498,14 @@@ F: drivers/usb/host/ohci-nxp. F: drivers/watchdog/pnx4008_wdt.c N: lpc32xx
+LPC32XX DMAMUX SUPPORT +M: J.M.B. Downing jonathan.downing@nautel.com +M: Piotr Wojtaszczyk piotr.wojtaszczyk@timesys.com +R: Vladimir Zapolskiy vz@mleia.com +L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) +S: Maintained +F: Documentation/devicetree/bindings/dma/nxp,lpc3220-dmamux.yaml + ARM/Marvell Dove/MV78xx0/Orion SOC support M: Andrew Lunn andrew@lunn.ch M: Sebastian Hesselbarth sebastian.hesselbarth@gmail.com @@@ -2793,7 -2731,7 +2793,7 @@@ F: drivers/iommu/msm F: drivers/mfd/ssbi.c F: drivers/mmc/host/mmci_qcom* F: drivers/mmc/host/sdhci-msm.c -F: drivers/pci/controller/dwc/pcie-qcom.c +F: drivers/pci/controller/dwc/pcie-qcom* F: drivers/phy/qualcomm/ F: drivers/power/*/msm* F: drivers/reset/reset-qcom-* @@@ -3848,9 -3786,10 +3848,9 @@@ F: Documentation/filesystems/befs.rs F: fs/befs/
BFQ I/O SCHEDULER -M: Paolo Valente paolo.valente@unimore.it -M: Jens Axboe axboe@kernel.dk +M: Yu Kuai yukuai3@huawei.com L: linux-block@vger.kernel.org -S: Maintained +S: Odd Fixes F: Documentation/block/bfq-iosched.rst F: block/bfq-*
@@@ -3997,7 -3936,7 +3997,7 @@@ F: Documentation/devicetree/bindings/ii F: drivers/iio/imu/bmi323/
BPF JIT for ARC -M: Shahab Vahedi shahab@synopsys.com +M: Shahab Vahedi list+bpf@vahedi.org L: bpf@vger.kernel.org S: Maintained F: arch/arc/net/ @@@ -4164,7 -4103,6 +4164,7 @@@ F: include/uapi/linux/btf F: include/uapi/linux/filter.h F: kernel/bpf/ F: kernel/trace/bpf_trace.c +F: lib/buildid.c F: lib/test_bpf.c F: net/bpf/ F: net/core/filter.c @@@ -4285,7 -4223,6 +4285,7 @@@ L: bpf@vger.kernel.or S: Maintained F: kernel/bpf/stackmap.c F: kernel/trace/bpf_trace.c +F: lib/buildid.c
BROADCOM ASP 2.0 ETHERNET DRIVER M: Justin Chen justin.chen@broadcom.com @@@ -5164,8 -5101,10 +5164,8 @@@ F: Documentation/devicetree/bindings/me F: drivers/media/cec/platform/cec-gpio/
CELL BROADBAND ENGINE ARCHITECTURE -M: Arnd Bergmann arnd@arndb.de L: linuxppc-dev@lists.ozlabs.org -S: Supported -W: http://www.ibm.com/developerworks/power/cell/ +S: Orphan F: arch/powerpc/include/asm/cell*.h F: arch/powerpc/include/asm/spu*.h F: arch/powerpc/include/uapi/asm/spu*.h @@@ -5258,7 -5197,7 +5258,7 @@@ F: Documentation/dev-tools/checkpatch.r
CHINESE DOCUMENTATION M: Alex Shi alexs@kernel.org -M: Yanteng Si siyanteng@loongson.cn +M: Yanteng Si si.yanteng@linux.dev S: Maintained F: Documentation/translations/zh_CN/
@@@ -5885,9 -5824,6 +5885,9 @@@ CPU POWER MONITORING SUBSYSTE M: Thomas Renninger trenn@suse.com M: Shuah Khan shuah@kernel.org M: Shuah Khan skhan@linuxfoundation.org +M: John B. Wyatt IV jwyatt@redhat.com +M: John B. Wyatt IV sageofredondo@gmail.com +M: John Kacur jkacur@redhat.com L: linux-pm@vger.kernel.org S: Maintained F: tools/power/cpupower/ @@@ -6570,7 -6506,6 +6570,7 @@@ F: Documentation/devicetree/bindings/re F: Documentation/devicetree/bindings/regulator/dlg,da9*.yaml F: Documentation/devicetree/bindings/regulator/dlg,slg51000.yaml F: Documentation/devicetree/bindings/sound/da[79]*.txt +F: Documentation/devicetree/bindings/sound/dlg,da7213.yaml F: Documentation/devicetree/bindings/thermal/dlg,da9062-thermal.yaml F: Documentation/devicetree/bindings/watchdog/dlg,da9062-watchdog.yaml F: Documentation/hwmon/da90??.rst @@@ -6731,7 -6666,6 +6731,7 @@@ F: drivers/dma-buf/dma-heap. F: drivers/dma-buf/heaps/* F: include/linux/dma-heap.h F: include/uapi/linux/dma-heap.h +F: tools/testing/selftests/dmabuf-heaps/
DMC FREQUENCY DRIVER FOR SAMSUNG EXYNOS5422 M: Lukasz Luba lukasz.luba@arm.com @@@ -6783,7 -6717,6 +6783,7 @@@ DOCUMENTATION PROCES M: Jonathan Corbet corbet@lwn.net L: workflows@vger.kernel.org S: Maintained +F: Documentation/dev-tools/ F: Documentation/maintainer/ F: Documentation/process/
@@@ -6791,7 -6724,6 +6791,7 @@@ DOCUMENTATION REPORTING ISSUE M: Thorsten Leemhuis linux@leemhuis.info L: linux-doc@vger.kernel.org S: Maintained +F: Documentation/admin-guide/bug-bisect.rst F: Documentation/admin-guide/quickly-build-trimmed-linux.rst F: Documentation/admin-guide/reporting-issues.rst F: Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@@ -7406,10 -7338,10 +7406,10 @@@ F: drivers/gpu/drm/udl
DRM DRIVER FOR VIRTUAL KERNEL MODESETTING (VKMS) M: Rodrigo Siqueira rodrigosiqueiramelo@gmail.com -M: Melissa Wen melissa.srw@gmail.com M: Maíra Canal mairacanal@riseup.net R: Haneen Mohammed hamohammed.sa@gmail.com -R: Daniel Vetter daniel@ffwll.ch +R: Simona Vetter simona@ffwll.ch +R: Melissa Wen melissa.srw@gmail.com L: dri-devel@lists.freedesktop.org S: Maintained T: git https://gitlab.freedesktop.org/drm/misc/kernel.git @@@ -7442,7 -7374,7 +7442,7 @@@ F: drivers/gpu/drm/panel/panel-widechip
DRM DRIVERS M: David Airlie airlied@gmail.com -M: Daniel Vetter daniel@ffwll.ch +M: Simona Vetter simona@ffwll.ch L: dri-devel@lists.freedesktop.org S: Maintained B: https://gitlab.freedesktop.org/drm @@@ -7538,6 -7470,7 +7538,6 @@@ M: Kyungmin Park <kyungmin.park@samsung L: dri-devel@lists.freedesktop.org S: Supported T: git git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git -F: Documentation/devicetree/bindings/display/exynos/ F: Documentation/devicetree/bindings/display/samsung/ F: drivers/gpu/drm/exynos/ F: include/uapi/drm/exynos_drm.h @@@ -8412,7 -8345,6 +8412,7 @@@ F: include/linux/mii. F: include/linux/of_net.h F: include/linux/phy.h F: include/linux/phy_fixed.h +F: include/linux/phy_link_topology.h F: include/linux/phylib_stubs.h F: include/linux/platform_data/mdio-bcm-unimac.h F: include/linux/platform_data/mdio-gpio.h @@@ -8428,7 -8360,6 +8428,7 @@@ L: netdev@vger.kernel.or L: rust-for-linux@vger.kernel.org S: Maintained F: rust/kernel/net/phy.rs +F: rust/kernel/net/phy/reg.rs
EXEC & BINFMT API, ELF R: Eric Biederman ebiederm@xmission.com @@@ -8536,13 -8467,6 +8536,13 @@@ F: lib/bootconfig. F: tools/bootconfig/* F: tools/bootconfig/scripts/*
+EXTRON DA HD 4K PLUS CEC DRIVER +M: Hans Verkuil hverkuil@xs4all.nl +L: linux-media@vger.kernel.org +S: Maintained +T: git git://linuxtv.org/media_tree.git +F: drivers/media/cec/usb/extron-da-hd-4k-plus/ + EXYNOS DP DRIVER M: Jingoo Han jingoohan1@gmail.com L: dri-devel@lists.freedesktop.org @@@ -8616,9 -8540,8 +8616,9 @@@ F: drivers/net/wan/farsync. FAULT INJECTION SUPPORT M: Akinobu Mita akinobu.mita@gmail.com S: Supported -F: Documentation/fault-injection/ +F: Documentation/dev-tools/fault-injection/ F: lib/fault-inject.c +F: tools/testing/fault-injection/
FBTFT Framebuffer drivers L: dri-devel@lists.freedesktop.org @@@ -8680,7 -8603,6 +8680,7 @@@ M: Christian Brauner <brauner@kernel.or R: Jan Kara jack@suse.cz L: linux-fsdevel@vger.kernel.org S: Maintained +T: git https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git F: fs/* F: include/linux/fs.h F: include/linux/fs_types.h @@@ -8893,7 -8815,7 +8893,7 @@@ W: https://floatingpoint.billm.au F: arch/x86/math-emu/
FRAMEBUFFER CORE -M: Daniel Vetter daniel@ffwll.ch +M: Simona Vetter simona@ffwll.ch S: Odd Fixes T: git https://gitlab.freedesktop.org/drm/misc/kernel.git F: drivers/video/fbdev/core/ @@@ -9090,7 -9012,6 +9090,7 @@@ M: Herve Codina <herve.codina@bootlin.c L: linuxppc-dev@lists.ozlabs.org S: Maintained F: Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml +F: Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,qe-ucc-qmc.yaml F: drivers/soc/fsl/qe/qmc.c F: include/soc/fsl/qe/qmc.h
@@@ -9106,11 -9027,9 +9106,11 @@@ M: Herve Codina <herve.codina@bootlin.c L: linuxppc-dev@lists.ozlabs.org S: Maintained F: Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-tsa.yaml +F: Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,qe-tsa.yaml F: drivers/soc/fsl/qe/tsa.c F: drivers/soc/fsl/qe/tsa.h F: include/dt-bindings/soc/cpm1-fsl,tsa.h +F: include/dt-bindings/soc/qe-fsl,tsa.h
FREESCALE QUICC ENGINE UCC ETHERNET DRIVER L: netdev@vger.kernel.org @@@ -11060,7 -10979,6 +11060,7 @@@ T: git https://gitlab.freedesktop.org/d F: Documentation/devicetree/bindings/gpu/img,powervr-rogue.yaml F: Documentation/devicetree/bindings/gpu/img,powervr-sgx.yaml F: Documentation/gpu/imagination/ +F: drivers/gpu/drm/ci/xfails/powervr* F: drivers/gpu/drm/imagination/ F: include/uapi/drm/pvr_drm.h
@@@ -11186,17 -11104,10 +11186,17 @@@ F: Documentation/devicetree/bindings/se F: Documentation/input/ F: drivers/input/ F: include/dt-bindings/input/ +F: include/linux/gameport.h +F: include/linux/i8042.h F: include/linux/input.h F: include/linux/input/ +F: include/linux/libps2.h +F: include/linux/serio.h +F: include/uapi/linux/gameport.h F: include/uapi/linux/input-event-codes.h F: include/uapi/linux/input.h +F: include/uapi/linux/serio.h +F: include/uapi/linux/uinput.h
INPUT MULTITOUCH (MT) PROTOCOL M: Henrik Rydberg rydberg@bitmath.org @@@ -11223,16 -11134,6 +11223,16 @@@ T: git git://git.kernel.org/pub/scm/lin F: security/integrity/ F: security/integrity/ima/
+INTEGRITY POLICY ENFORCEMENT (IPE) +M: Fan Wu wufan@linux.microsoft.com +L: linux-security-module@vger.kernel.org +S: Supported +T: git https://github.com/microsoft/ipe.git +F: Documentation/admin-guide/LSM/ipe.rst +F: Documentation/security/ipe.rst +F: scripts/ipe/ +F: security/ipe/ + INTEL 810/815 FRAMEBUFFER DRIVER M: Antonino Daplas adaplas@gmail.com L: linux-fbdev@vger.kernel.org @@@ -11837,7 -11738,6 +11837,7 @@@ T: git git://git.kernel.org/pub/scm/lin F: drivers/iommu/dma-iommu.c F: drivers/iommu/dma-iommu.h F: drivers/iommu/iova.c +F: include/linux/iommu-dma.h F: include/linux/iova.h
IOMMU SUBSYSTEM @@@ -12408,7 -12308,6 +12408,7 @@@ L: kvm@vger.kernel.or L: loongarch@lists.linux.dev S: Maintained T: git git://git.kernel.org/pub/scm/virt/kvm/kvm.git +F: Documentation/virt/kvm/loongarch/ F: arch/loongarch/include/asm/kvm* F: arch/loongarch/include/uapi/asm/kvm* F: arch/loongarch/kvm/ @@@ -13618,7 -13517,7 +13618,7 @@@ S: Maintaine F: Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml F: drivers/input/misc/88pm886-onkey.c F: drivers/mfd/88pm886.c -F: drivers/regulators/88pm886-regulator.c +F: drivers/regulator/88pm886-regulator.c F: include/linux/mfd/88pm886.h
MARVELL ARMADA 3700 PHY DRIVERS @@@ -13677,6 -13576,7 +13677,6 @@@ M: Sebastian Hesselbarth <sebastian.hes L: netdev@vger.kernel.org S: Maintained F: drivers/net/ethernet/marvell/mv643xx_eth.* -F: include/linux/mv643xx.h
MARVELL MV88X3310 PHY DRIVER M: Russell King linux@armlinux.org.uk @@@ -14328,8 -14228,8 +14328,8 @@@ M: Sean Wang <sean.wang@mediatek.com L: linux-bluetooth@vger.kernel.org L: linux-mediatek@lists.infradead.org (moderated for non-subscribers) S: Maintained +F: Documentation/devicetree/bindings/net/bluetooth/mediatek,bluetooth.txt F: Documentation/devicetree/bindings/net/bluetooth/mediatek,mt7921s-bluetooth.yaml -F: Documentation/devicetree/bindings/net/mediatek-bluetooth.txt F: drivers/bluetooth/btmtkuart.c
MEDIATEK BOARD LEVEL SHUTDOWN DRIVERS @@@ -14607,7 -14507,7 +14607,7 @@@ MELLANOX ETHERNET DRIVER (mlx4_en M: Tariq Toukan tariqt@nvidia.com L: netdev@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlx4/en_*
@@@ -14616,7 -14516,7 +14616,7 @@@ M: Saeed Mahameed <saeedm@nvidia.com M: Tariq Toukan tariqt@nvidia.com L: netdev@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlx5/core/en_*
@@@ -14624,7 -14524,7 +14624,7 @@@ MELLANOX ETHERNET INNOVA DRIVER R: Boris Pismenny borisp@nvidia.com L: netdev@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlx5/core/en_accel/* F: drivers/net/ethernet/mellanox/mlx5/core/fpga/* @@@ -14635,7 -14535,7 +14635,7 @@@ M: Ido Schimmel <idosch@nvidia.com M: Petr Machata petrm@nvidia.com L: netdev@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlxsw/ F: tools/testing/selftests/drivers/net/mlxsw/ @@@ -14644,7 -14544,7 +14644,7 @@@ MELLANOX FIRMWARE FLASH LIBRARY (mlxfw M: mlxsw@nvidia.com L: netdev@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlxfw/
@@@ -14663,7 -14563,7 +14663,7 @@@ M: Tariq Toukan <tariqt@nvidia.com L: netdev@vger.kernel.org L: linux-rdma@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: drivers/net/ethernet/mellanox/mlx4/ F: include/linux/mlx4/ @@@ -14672,7 -14572,7 +14672,7 @@@ MELLANOX MLX4 IB drive M: Yishai Hadas yishaih@nvidia.com L: linux-rdma@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: http://patchwork.kernel.org/project/linux-rdma/list/ F: drivers/infiniband/hw/mlx4/ F: include/linux/mlx4/ @@@ -14685,7 -14585,7 +14685,7 @@@ M: Tariq Toukan <tariqt@nvidia.com L: netdev@vger.kernel.org L: linux-rdma@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: https://patchwork.kernel.org/project/netdevbpf/list/ F: Documentation/networking/device_drivers/ethernet/mellanox/ F: drivers/net/ethernet/mellanox/mlx5/core/ @@@ -14695,7 -14595,7 +14695,7 @@@ MELLANOX MLX5 IB drive M: Leon Romanovsky leonro@nvidia.com L: linux-rdma@vger.kernel.org S: Supported -W: http://www.mellanox.com +W: https://www.nvidia.com/networking/ Q: http://patchwork.kernel.org/project/linux-rdma/list/ F: drivers/infiniband/hw/mlx5/ F: include/linux/mlx5/ @@@ -14923,7 -14823,6 +14923,7 @@@ M: Alexander Duyck <alexanderduyck@fb.c M: Jakub Kicinski kuba@kernel.org R: kernel-team@meta.com S: Supported +F: Documentation/networking/device_drivers/ethernet/meta/ F: drivers/net/ethernet/meta/
METHODE UDPU SUPPORT @@@ -15070,13 -14969,6 +15070,13 @@@ L: netdev@vger.kernel.or S: Maintained F: drivers/net/ethernet/microchip/lan743x_*
+MICROCHIP LAN8650/1 10BASE-T1S MACPHY ETHERNET DRIVER +M: Parthiban Veerasooran parthiban.veerasooran@microchip.com +L: netdev@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/net/microchip,lan8650.yaml +F: drivers/net/ethernet/microchip/lan865x/lan865x.c + MICROCHIP LAN87xx/LAN937x T1 PHY DRIVER M: Arun Ramadoss arun.ramadoss@microchip.com R: UNGLinuxDriver@microchip.com @@@ -15326,12 -15218,6 +15326,12 @@@ S: Maintaine F: Documentation/hwmon/surface_fan.rst F: drivers/hwmon/surface_fan.c
+MICROSOFT SURFACE SENSOR THERMAL DRIVER +M: Maximilian Luz luzmaximilian@gmail.com +L: linux-hwmon@vger.kernel.org +S: Maintained +F: drivers/hwmon/surface_temp.c + MICROSOFT SURFACE GPE LID SUPPORT DRIVER M: Maximilian Luz luzmaximilian@gmail.com L: platform-driver-x86@vger.kernel.org @@@ -15584,9 -15470,6 +15584,9 @@@ F: include/dt-bindings/clock/mobileye,e
MODULE SUPPORT M: Luis Chamberlain mcgrof@kernel.org +R: Petr Pavlu petr.pavlu@suse.com +R: Sami Tolvanen samitolvanen@google.com +R: Daniel Gomez da.gomez@samsung.com L: linux-modules@vger.kernel.org L: linux-kernel@vger.kernel.org S: Maintained @@@ -15905,7 -15788,6 +15905,7 @@@ M: Breno Leitao <leitao@debian.org S: Maintained F: Documentation/networking/netconsole.rst F: drivers/net/netconsole.c +F: tools/testing/selftests/drivers/net/netcons_basic.sh
NETDEVSIM M: Jakub Kicinski kuba@kernel.org @@@ -16949,7 -16831,6 +16949,7 @@@ OMNIVISION OG01A1B SENSOR DRIVE M: Sakari Ailus sakari.ailus@linux.intel.com L: linux-media@vger.kernel.org S: Maintained +F: Documentation/devicetree/bindings/media/i2c/ovti,og01a1b.yaml F: drivers/media/i2c/og01a1b.c
OMNIVISION OV01A10 SENSOR DRIVER @@@ -17220,14 -17101,6 +17220,14 @@@ L: linux-rdma@vger.kernel.or S: Supported F: drivers/infiniband/ulp/opa_vnic
+OPEN ALLIANCE 10BASE-T1S MACPHY SERIAL INTERFACE FRAMEWORK +M: Parthiban Veerasooran parthiban.veerasooran@microchip.com +L: netdev@vger.kernel.org +S: Maintained +F: Documentation/networking/oa-tc6-framework.rst +F: drivers/include/linux/oa_tc6.h +F: drivers/net/ethernet/oa_tc6.c + OPEN FIRMWARE AND FLATTENED DEVICE TREE M: Rob Herring robh@kernel.org M: Saravana Kannan saravanak@google.com @@@ -17539,7 -17412,7 +17539,7 @@@ PCI DRIVER FOR ALTERA PCIE I M: Joyce Ooi joyce.ooi@intel.com L: linux-pci@vger.kernel.org S: Supported -F: Documentation/devicetree/bindings/pci/altera-pcie.txt +F: Documentation/devicetree/bindings/pci/altr,pcie-root-port.yaml F: drivers/pci/controller/pcie-altera.c
PCI DRIVER FOR APPLIEDMICRO XGENE @@@ -17771,7 -17644,7 +17771,7 @@@ PCI MSI DRIVER FOR ALTERA MSI I M: Joyce Ooi joyce.ooi@intel.com L: linux-pci@vger.kernel.org S: Supported -F: Documentation/devicetree/bindings/pci/altera-pcie-msi.txt +F: Documentation/devicetree/bindings/pci/altr,msi-controller.yaml F: drivers/pci/controller/pcie-altera-msi.c
PCI MSI DRIVER FOR APPLIEDMICRO XGENE @@@ -17924,7 -17797,6 +17924,7 @@@ M: Manivannan Sadhasivam <manivannan.sa L: linux-pci@vger.kernel.org L: linux-arm-msm@vger.kernel.org S: Maintained +F: drivers/pci/controller/dwc/pcie-qcom-common.c F: drivers/pci/controller/dwc/pcie-qcom.c
PCIE DRIVER FOR ROCKCHIP @@@ -17961,7 -17833,6 +17961,7 @@@ L: linux-pci@vger.kernel.or L: linux-arm-msm@vger.kernel.org S: Maintained F: Documentation/devicetree/bindings/pci/qcom,pcie-ep.yaml +F: drivers/pci/controller/dwc/pcie-qcom-common.c F: drivers/pci/controller/dwc/pcie-qcom-ep.c
PCMCIA SUBSYSTEM @@@ -18917,7 -18788,7 +18917,7 @@@ M: Bryan O'Donoghue <bryan.odonoghue@li L: linux-media@vger.kernel.org S: Maintained F: Documentation/admin-guide/media/qcom_camss.rst -F: Documentation/devicetree/bindings/media/*camss* +F: Documentation/devicetree/bindings/media/qcom,*camss* F: drivers/media/platform/qcom/camss/
QUALCOMM CLOCK DRIVERS @@@ -18932,6 -18803,7 +18932,6 @@@ F: include/dt-bindings/clock/qcom, QUALCOMM CLOUD AI (QAIC) DRIVER M: Jeffrey Hugo quic_jhugo@quicinc.com R: Carl Vanderlip quic_carlv@quicinc.com -R: Pranjal Ramajor Asha Kanojiya quic_pkanojiy@quicinc.com L: linux-arm-msm@vger.kernel.org L: dri-devel@lists.freedesktop.org S: Supported @@@ -19026,7 -18898,6 +19026,7 @@@ L: linux-arm-msm@vger.kernel.or S: Maintained F: Documentation/devicetree/bindings/interconnect/qcom,msm8998-bwmon.yaml F: drivers/soc/qcom/icc-bwmon.c +F: drivers/soc/qcom/trace_icc-bwmon.h
QUALCOMM IOMMU M: Rob Clark robdclark@gmail.com @@@ -19366,7 -19237,6 +19366,7 @@@ S: Supporte W: https://oss.oracle.com/projects/rds/ F: Documentation/networking/rds.rst F: net/rds/ +F: tools/testing/selftests/net/rds/
RDT - RESOURCE ALLOCATION M: Fenghua Yu fenghua.yu@intel.com @@@ -19826,10 -19696,12 +19826,10 @@@ L: linux-riscv@lists.infradead.or S: Maintained Q: https://patchwork.kernel.org/project/linux-riscv/list/ T: git https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git/ -F: Documentation/devicetree/bindings/riscv/ -F: arch/riscv/boot/dts/ -X: arch/riscv/boot/dts/allwinner/ -X: arch/riscv/boot/dts/renesas/ -X: arch/riscv/boot/dts/sophgo/ -X: arch/riscv/boot/dts/thead/ +F: arch/riscv/boot/dts/canaan/ +F: arch/riscv/boot/dts/microchip/ +F: arch/riscv/boot/dts/sifive/ +F: arch/riscv/boot/dts/starfive/
RISC-V PMU DRIVERS M: Atish Patra atishp@atishpatra.org @@@ -19867,14 -19739,6 +19867,14 @@@ F: Documentation/ABI/*/sysfs-driver-hid F: drivers/hid/hid-roccat* F: include/linux/hid-roccat*
+ROCKCHIP CAN-FD DRIVER +M: Marc Kleine-Budde mkl@pengutronix.de +R: kernel@pengutronix.de +L: linux-can@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/net/can/rockchip,rk3568v2-canfd.yaml +F: drivers/net/can/rockchip/ + ROCKCHIP CRYPTO DRIVERS M: Corentin Labbe clabbe@baylibre.com L: linux-crypto@vger.kernel.org @@@ -19901,13 -19765,6 +19901,13 @@@ F: Documentation/userspace-api/media/v4 F: drivers/media/platform/rockchip/rkisp1 F: include/uapi/linux/rkisp1-config.h
+ROCKCHIP RK3568 RANDOM NUMBER GENERATOR SUPPORT +M: Daniel Golle daniel@makrotopia.org +M: Aurelien Jarno aurelien@aurel32.net +S: Maintained +F: Documentation/devicetree/bindings/rng/rockchip,rk3568-rng.yaml +F: drivers/char/hw_random/rockchip-rng.c + ROCKCHIP RASTER 2D GRAPHIC ACCELERATION UNIT DRIVER M: Jacob Chen jacob-chen@iotwrt.com M: Ezequiel Garcia ezequiel@vanguardiasur.com.ar @@@ -20024,26 -19881,12 +20024,26 @@@ T: git git://linuxtv.org/media_tree.gi F: Documentation/devicetree/bindings/media/allwinner,sun8i-a83t-de2-rotate.yaml F: drivers/media/platform/sunxi/sun8i-rotate/
+RPMB SUBSYSTEM +M: Jens Wiklander jens.wiklander@linaro.org +L: linux-kernel@vger.kernel.org +S: Supported +F: drivers/misc/rpmb-core.c +F: include/linux/rpmb.h + RPMSG TTY DRIVER M: Arnaud Pouliquen arnaud.pouliquen@foss.st.com L: linux-remoteproc@vger.kernel.org S: Maintained F: drivers/tty/rpmsg_tty.c
+RTASE ETHERNET DRIVER +M: Justin Lai justinlai0215@realtek.com +M: Larry Chiu larry.chiu@realtek.com +L: netdev@vger.kernel.org +S: Maintained +F: drivers/net/ethernet/realtek/rtase/ + RTL2830 MEDIA DRIVER L: linux-media@vger.kernel.org S: Orphan @@@ -20308,16 -20151,6 +20308,16 @@@ B: mailto:linux-samsung-soc@vger.kernel F: Documentation/devicetree/bindings/sound/samsung* F: sound/soc/samsung/
+SAMSUNG EXYNOS850 SoC SUPPORT +M: Sam Protsenko semen.protsenko@linaro.org +L: linux-arm-kernel@lists.infradead.org (moderated for non-subscribers) +L: linux-samsung-soc@vger.kernel.org +S: Maintained +F: Documentation/devicetree/bindings/clock/samsung,exynos850-clock.yaml +F: arch/arm64/boot/dts/exynos/exynos850* +F: drivers/clk/samsung/clk-exynos850.c +F: include/dt-bindings/clock/exynos850.h + SAMSUNG EXYNOS PSEUDO RANDOM NUMBER GENERATOR (RNG) DRIVER M: Krzysztof Kozlowski krzk@kernel.org L: linux-crypto@vger.kernel.org @@@ -21705,8 -21538,10 +21705,8 @@@ F: include/linux/spmi. F: include/trace/events/spmi.h
SPU FILE SYSTEM -M: Jeremy Kerr jk@ozlabs.org L: linuxppc-dev@lists.ozlabs.org -S: Supported -W: http://www.ibm.com/developerworks/power/cell/ +S: Orphan F: Documentation/filesystems/spufs/spufs.rst F: arch/powerpc/platforms/cell/spufs/
@@@ -22635,7 -22470,6 +22635,7 @@@ M: Jens Wiklander <jens.wiklander@linar R: Sumit Garg sumit.garg@linaro.org L: op-tee@lists.trustedfirmware.org S: Maintained +F: Documentation/ABI/testing/sysfs-class-tee F: Documentation/driver-api/tee.rst F: Documentation/tee/ F: Documentation/userspace-api/tee.rst @@@ -22681,7 -22515,6 +22681,7 @@@ M: Thierry Reding <thierry.reding@gmail R: Krishna Reddy vdumpa@nvidia.com L: linux-tegra@vger.kernel.org S: Supported +F: drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c F: drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c F: drivers/iommu/tegra*
@@@ -22787,11 -22620,12 +22787,11 @@@ F: Documentation/devicetree/bindings/so F: Documentation/devicetree/bindings/sound/ti,tas2562.yaml F: Documentation/devicetree/bindings/sound/ti,tas2770.yaml F: Documentation/devicetree/bindings/sound/ti,tas27xx.yaml +F: Documentation/devicetree/bindings/sound/ti,tpa6130a2.yaml F: Documentation/devicetree/bindings/sound/ti,pcm1681.yaml F: Documentation/devicetree/bindings/sound/ti,pcm3168a.yaml F: Documentation/devicetree/bindings/sound/ti,tlv320*.yaml F: Documentation/devicetree/bindings/sound/ti,tlv320adcx140.yaml -F: Documentation/devicetree/bindings/sound/tlv320aic31xx.txt -F: Documentation/devicetree/bindings/sound/tpa6130a2.txt F: include/sound/tas2*.h F: include/sound/tlv320*.h F: include/sound/tpa6130a2-plat.h @@@ -23369,7 -23203,6 +23369,7 @@@ Q: https://patchwork.kernel.org/project T: git git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git F: Documentation/devicetree/bindings/tpm/ F: drivers/char/tpm/ +F: tools/testing/selftests/tpm2/
TPS546D24 DRIVER M: Duke Du dukedu83@gmail.com @@@ -23382,8 -23215,9 +23382,8 @@@ TQ SYSTEMS BOARD & DRIVER SUPPOR L: linux@ew.tq-group.com S: Supported W: https://www.tq-group.com/en/products/tq-embedded/ -F: arch/arm/boot/dts/imx*mba*.dts* -F: arch/arm/boot/dts/imx*tqma*.dts* -F: arch/arm/boot/dts/mba*.dtsi +F: arch/arm/boot/dts/nxp/imx/*mba*.dts* +F: arch/arm/boot/dts/nxp/imx/*tqma*.dts* F: arch/arm64/boot/dts/freescale/fsl-*tqml*.dts* F: arch/arm64/boot/dts/freescale/imx*mba*.dts* F: arch/arm64/boot/dts/freescale/imx*tqma*.dts* @@@ -24591,20 -24425,6 +24591,20 @@@ F: include/uapi/linux/vsockmon. F: net/vmw_vsock/ F: tools/testing/vsock/
+VMA +M: Andrew Morton akpm@linux-foundation.org +R: Liam R. Howlett Liam.Howlett@oracle.com +R: Vlastimil Babka vbabka@suse.cz +R: Lorenzo Stoakes lorenzo.stoakes@oracle.com +L: linux-mm@kvack.org +S: Maintained +W: https://www.linux-mm.org +T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm +F: mm/vma.c +F: mm/vma.h +F: mm/vma_internal.h +F: tools/testing/vma/ + VMALLOC M: Andrew Morton akpm@linux-foundation.org R: Uladzislau Rezki urezki@gmail.com @@@ -24994,6 -24814,17 +24994,17 @@@ T: git git://git.kernel.org/pub/scm/lin F: Documentation/arch/x86/ F: Documentation/devicetree/bindings/x86/ F: arch/x86/ + F: tools/testing/selftests/x86 + + X86 CPUID DATABASE + M: Borislav Petkov bp@alien8.de + M: Thomas Gleixner tglx@linutronix.de + M: x86@kernel.org + R: Ahmed S. Darwish darwi@linutronix.de + L: x86-cpuid@lists.linux.dev + S: Maintained + W: https://x86-cpuid.org + F: tools/arch/x86/kcpuid/cpuid.csv
X86 ENTRY CODE M: Andy Lutomirski luto@kernel.org @@@ -25436,19 -25267,6 +25447,19 @@@ S: Maintaine F: drivers/spi/spi-xtensa-xtfpga.c F: sound/soc/xtensa/xtfpga-i2s.c
+XZ EMBEDDED +M: Lasse Collin lasse.collin@tukaani.org +S: Maintained +W: https://tukaani.org/xz/embedded.html +B: https://github.com/tukaani-project/xz-embedded/issues +C: irc://irc.libera.chat/tukaani +F: Documentation/staging/xz.rst +F: include/linux/decompress/unxz.h +F: include/linux/xz.h +F: lib/decompress_unxz.c +F: lib/xz/ +F: scripts/xz_wrap.sh + YAM DRIVER FOR AX.25 M: Jean-Paul Roubelat jpr@f6fbb.org L: linux-hams@vger.kernel.org @@@ -25473,6 -25291,7 +25484,6 @@@ F: tools/net/ynl
YEALINK PHONE DRIVER M: Henk Vergonet Henk.Vergonet@gmail.com -L: usbb2k-api-dev@nongnu.org S: Maintained F: Documentation/input/devices/yealink.rst F: drivers/input/misc/yealink.* diff --combined arch/arm64/Kconfig index 3943898f62c93,e68ea648e085b..a77453de94bc5 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@@ -24,7 -24,6 +24,7 @@@ config ARM6 select ARCH_HAS_CURRENT_STACK_POINTER select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLE + select ARCH_HAS_DMA_OPS if XEN select ARCH_HAS_DMA_PREP_COHERENT select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI select ARCH_HAS_FAST_MULTIPLIER @@@ -35,7 -34,6 +35,7 @@@ select ARCH_HAS_KERNEL_FPU_SUPPORT if KERNEL_MODE_NEON select ARCH_HAS_KEEPINITRD select ARCH_HAS_MEMBARRIER_SYNC_CORE + select ARCH_HAS_MEM_ENCRYPT select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PTE_DEVMAP @@@ -101,7 -99,7 +101,8 @@@ select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE + select ARCH_SUPPORTS_RT select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT @@@ -426,7 -424,7 +427,7 @@@ config AMPERE_ERRATUM_AC03_CPU_3 default y help This option adds an alternative code sequence to work around Ampere - erratum AC03_CPU_38 on AmpereOne. + errata AC03_CPU_38 and AC04_CPU_10 on AmpereOne.
The affected design reports FEAT_HAFDBS as not implemented in ID_AA64MMFR1_EL1.HAFDBS, but (V)TCR_ELx.{HA,HD} are not RES0 @@@ -2103,8 -2101,7 +2104,8 @@@ config ARM64_MT depends on ARM64_PAN select ARCH_HAS_SUBPAGE_FAULTS select ARCH_USES_HIGH_VMA_FLAGS - select ARCH_USES_PG_ARCH_X + select ARCH_USES_PG_ARCH_2 + select ARCH_USES_PG_ARCH_3 help Memory Tagging (part of the ARMv8.5 Extensions) provides architectural support for run-time, always-on detection of @@@ -2141,29 -2138,6 +2142,29 @@@ config ARM64_EPA if the cpu does not implement the feature. endmenu # "ARMv8.7 architectural features"
+menu "ARMv8.9 architectural features" + +config ARM64_POE + prompt "Permission Overlay Extension" + def_bool y + select ARCH_USES_HIGH_VMA_FLAGS + select ARCH_HAS_PKEYS + help + The Permission Overlay Extension is used to implement Memory + Protection Keys. Memory Protection Keys provides a mechanism for + enforcing page-based protections, but without requiring modification + of the page tables when an application changes protection domains. + + For details, see Documentation/core-api/protection-keys.rst + + If unsure, say y. + +config ARCH_PKEY_BITS + int + default 3 + +endmenu # "ARMv8.9 architectural features" + config ARM64_SVE bool "ARM Scalable Vector Extension support" default y diff --combined arch/loongarch/include/asm/irq.h index ce85d4c7d225d,9c2ca785faa9b..a0ca84da8541d --- a/arch/loongarch/include/asm/irq.h +++ b/arch/loongarch/include/asm/irq.h @@@ -39,11 -39,22 +39,22 @@@ void spurious_interrupt(void)
#define NR_IRQS_LEGACY 16
+ /* + * 256 Vectors Mapping for AVECINTC: + * + * 0 - 15: Mapping classic IPs, e.g. IP0-12. + * 16 - 255: Mapping vectors for external IRQ. + * + */ + #define NR_VECTORS 256 + #define NR_LEGACY_VECTORS 16 + #define IRQ_MATRIX_BITS NR_VECTORS + #define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace void arch_trigger_cpumask_backtrace(const struct cpumask *mask, int exclude_cpu);
#define MAX_IO_PICS 2 - #define NR_IRQS (64 + (256 * MAX_IO_PICS)) + #define NR_IRQS (64 + NR_VECTORS * (NR_CPUS + MAX_IO_PICS))
struct acpi_vector_group { int node; @@@ -54,7 -65,6 +65,7 @@@ extern struct acpi_vector_group pch_gro extern struct acpi_vector_group msi_group[MAX_IO_PICS];
#define CORES_PER_EIO_NODE 4 +#define CORES_PER_VEIO_NODE 256
#define LOONGSON_CPU_UART0_VEC 10 /* CPU UART0 */ #define LOONGSON_CPU_THSENS_VEC 14 /* CPU Thsens */ @@@ -66,7 -76,7 +77,7 @@@ #define LOONGSON_LPC_LAST_IRQ (LOONGSON_LPC_IRQ_BASE + 15)
#define LOONGSON_CPU_IRQ_BASE 16 - #define LOONGSON_CPU_LAST_IRQ (LOONGSON_CPU_IRQ_BASE + 14) + #define LOONGSON_CPU_LAST_IRQ (LOONGSON_CPU_IRQ_BASE + 15)
#define LOONGSON_PCH_IRQ_BASE 64 #define LOONGSON_PCH_ACPI_IRQ (LOONGSON_PCH_IRQ_BASE + 47) @@@ -89,20 -99,8 +100,8 @@@ struct acpi_madt_bio_pic struct acpi_madt_msi_pic; struct acpi_madt_lpc_pic;
- int liointc_acpi_init(struct irq_domain *parent, - struct acpi_madt_lio_pic *acpi_liointc); - int eiointc_acpi_init(struct irq_domain *parent, - struct acpi_madt_eio_pic *acpi_eiointc); - - int htvec_acpi_init(struct irq_domain *parent, - struct acpi_madt_ht_pic *acpi_htvec); - int pch_lpc_acpi_init(struct irq_domain *parent, - struct acpi_madt_lpc_pic *acpi_pchlpc); - int pch_msi_acpi_init(struct irq_domain *parent, - struct acpi_madt_msi_pic *acpi_pchmsi); - int pch_pic_acpi_init(struct irq_domain *parent, - struct acpi_madt_bio_pic *acpi_pchpic); - int find_pch_pic(u32 gsi); + void complete_irq_moving(void); + struct fwnode_handle *get_pch_msi_handle(int pci_segment);
extern struct acpi_madt_lio_pic *acpi_liointc; diff --combined arch/loongarch/include/asm/loongarch.h index 24a3f4925cfb2,631d249b3ef26..04bf1a7f903a2 --- a/arch/loongarch/include/asm/loongarch.h +++ b/arch/loongarch/include/asm/loongarch.h @@@ -119,7 -119,6 +119,7 @@@ #define CPUCFG6_PMP BIT(0) #define CPUCFG6_PAMVER GENMASK(3, 1) #define CPUCFG6_PMNUM GENMASK(7, 4) +#define CPUCFG6_PMNUM_SHIFT 4 #define CPUCFG6_PMBITS GENMASK(13, 8) #define CPUCFG6_UPM BIT(14)
@@@ -161,8 -160,16 +161,8 @@@
/* * CPUCFG index area: 0x40000000 -- 0x400000ff - * SW emulation for KVM hypervirsor + * SW emulation for KVM hypervirsor, see arch/loongarch/include/uapi/asm/kvm_para.h */ -#define CPUCFG_KVM_BASE 0x40000000 -#define CPUCFG_KVM_SIZE 0x100 - -#define CPUCFG_KVM_SIG (CPUCFG_KVM_BASE + 0) -#define KVM_SIGNATURE "KVM\0" -#define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4) -#define KVM_FEATURE_IPI BIT(1) -#define KVM_FEATURE_STEAL_TIME BIT(2)
#ifndef __ASSEMBLY__
@@@ -246,8 -253,8 +246,8 @@@ #define CSR_ESTAT_EXC_WIDTH 6 #define CSR_ESTAT_EXC (_ULCAST_(0x3f) << CSR_ESTAT_EXC_SHIFT) #define CSR_ESTAT_IS_SHIFT 0 - #define CSR_ESTAT_IS_WIDTH 14 - #define CSR_ESTAT_IS (_ULCAST_(0x3fff) << CSR_ESTAT_IS_SHIFT) + #define CSR_ESTAT_IS_WIDTH 15 + #define CSR_ESTAT_IS (_ULCAST_(0x7fff) << CSR_ESTAT_IS_SHIFT)
#define LOONGARCH_CSR_ERA 0x6 /* ERA */
@@@ -642,6 -649,13 +642,13 @@@
#define LOONGARCH_CSR_CTAG 0x98 /* TagLo + TagHi */
+ #define LOONGARCH_CSR_ISR0 0xa0 + #define LOONGARCH_CSR_ISR1 0xa1 + #define LOONGARCH_CSR_ISR2 0xa2 + #define LOONGARCH_CSR_ISR3 0xa3 + + #define LOONGARCH_CSR_IRR 0xa4 + #define LOONGARCH_CSR_PRID 0xc0
/* Shadow MCSR : 0xc0 ~ 0xff */ @@@ -1004,7 -1018,7 +1011,7 @@@ /* * CSR_ECFG IM */ - #define ECFG0_IM 0x00001fff + #define ECFG0_IM 0x00005fff #define ECFGB_SIP0 0 #define ECFGF_SIP0 (_ULCAST_(1) << ECFGB_SIP0) #define ECFGB_SIP1 1 @@@ -1047,6 -1061,7 +1054,7 @@@ #define IOCSRF_EIODECODE BIT_ULL(9) #define IOCSRF_FLATMODE BIT_ULL(10) #define IOCSRF_VM BIT_ULL(11) + #define IOCSRF_AVEC BIT_ULL(15)
#define LOONGARCH_IOCSR_VENDOR 0x10
@@@ -1058,6 -1073,7 +1066,7 @@@ #define IOCSR_MISC_FUNC_SOFT_INT BIT_ULL(10) #define IOCSR_MISC_FUNC_TIMER_RESET BIT_ULL(21) #define IOCSR_MISC_FUNC_EXT_IOI_EN BIT_ULL(48) + #define IOCSR_MISC_FUNC_AVEC_EN BIT_ULL(51)
#define LOONGARCH_IOCSR_CPUTEMP 0x428
@@@ -1380,9 -1396,10 +1389,10 @@@ __BUILD_CSR_OP(tlbidx #define INT_TI 11 /* Timer */ #define INT_IPI 12 #define INT_NMI 13 + #define INT_AVEC 14
/* ExcCodes corresponding to interrupts */ - #define EXCCODE_INT_NUM (INT_NMI + 1) + #define EXCCODE_INT_NUM (INT_AVEC + 1) #define EXCCODE_INT_START 64 #define EXCCODE_INT_END (EXCCODE_INT_START + EXCCODE_INT_NUM - 1)
diff --combined arch/loongarch/kernel/paravirt.c index 708eda025ed88,4d736a4e488dd..a5fc61f8b3482 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@@ -13,7 -13,6 +13,7 @@@ static int has_steal_clock struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
static u64 native_steal_clock(int cpu) { @@@ -135,6 -134,11 +135,11 @@@ static irqreturn_t pv_ipi_interrupt(in info->ipi_irqs[IPI_IRQ_WORK]++; }
+ if (action & SMP_CLEAR_VECTOR) { + complete_irq_moving(); + info->ipi_irqs[IPI_CLEAR_VECTOR]++; + } + return IRQ_HANDLED; }
@@@ -152,14 -156,11 +157,14 @@@ static void pv_init_ipi(void } #endif
-static bool kvm_para_available(void) +bool kvm_para_available(void) { int config; static int hypervisor_type;
+ if (!cpu_has_hypervisor) + return false; + if (!hypervisor_type) { config = read_cpucfg(CPUCFG_KVM_SIG); if (!memcmp(&config, KVM_SIGNATURE, 4)) @@@ -169,22 -170,17 +174,22 @@@ return hypervisor_type == HYPERVISOR_KVM; }
-int __init pv_ipi_init(void) +unsigned int kvm_arch_para_features(void) { - int feature; + static unsigned int feature;
- if (!cpu_has_hypervisor) - return 0; if (!kvm_para_available()) return 0;
- feature = read_cpucfg(CPUCFG_KVM_FEATURE); - if (!(feature & KVM_FEATURE_IPI)) + if (!feature) + feature = read_cpucfg(CPUCFG_KVM_FEATURE); + + return feature; +} + +int __init pv_ipi_init(void) +{ + if (!kvm_para_has_feature(KVM_FEATURE_IPI)) return 0;
#ifdef CONFIG_SMP @@@ -215,7 -211,7 +220,7 @@@ static int pv_enable_steal_time(void }
addr |= KVM_STEAL_PHYS_VALID; - kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr); + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, BIT(KVM_FEATURE_STEAL_TIME), addr);
return 0; } @@@ -223,7 -219,7 +228,7 @@@ static void pv_disable_steal_time(void) { if (has_steal_clock) - kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 0); + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, BIT(KVM_FEATURE_STEAL_TIME), 0); }
#ifdef CONFIG_SMP @@@ -267,9 -263,15 +272,9 @@@ static struct notifier_block pv_reboot_
int __init pv_time_init(void) { - int r, feature; + int r;
- if (!cpu_has_hypervisor) - return 0; - if (!kvm_para_available()) - return 0; - - feature = read_cpucfg(CPUCFG_KVM_FEATURE); - if (!(feature & KVM_FEATURE_STEAL_TIME)) + if (!kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) return 0;
has_steal_clock = 1; @@@ -303,13 -305,3 +308,13 @@@
return 0; } + +int __init pv_spinlock_init(void) +{ + if (!cpu_has_hypervisor) + return 0; + + static_branch_enable(&virt_spin_lock_key); + + return 0; +} diff --combined arch/loongarch/kernel/smp.c index 482b3c7e3042d,9787871fffa08..9afc2d8b34141 --- a/arch/loongarch/kernel/smp.c +++ b/arch/loongarch/kernel/smp.c @@@ -72,6 -72,7 +72,7 @@@ static const char *ipi_types[NR_IPI] __ [IPI_RESCHEDULE] = "Rescheduling interrupts", [IPI_CALL_FUNCTION] = "Function call interrupts", [IPI_IRQ_WORK] = "IRQ work interrupts", + [IPI_CLEAR_VECTOR] = "Clear vector interrupts", };
void show_ipi_list(struct seq_file *p, int prec) @@@ -248,6 -249,11 +249,11 @@@ static irqreturn_t loongson_ipi_interru per_cpu(irq_stat, cpu).ipi_irqs[IPI_IRQ_WORK]++; }
+ if (action & SMP_CLEAR_VECTOR) { + complete_irq_moving(); + per_cpu(irq_stat, cpu).ipi_irqs[IPI_CLEAR_VECTOR]++; + } + return IRQ_HANDLED; }
@@@ -509,8 -515,6 +515,8 @@@ void __init smp_prepare_boot_cpu(void rr_node = next_node_in(rr_node, node_online_map); } } + + pv_spinlock_init(); }
/* called from main before smp_init() */ diff --combined arch/riscv/Kconfig index 801ee681059ca,3c78edf8e5b90..a9be4dca380dc --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@@ -13,7 -13,6 +13,7 @@@ config 32BI config RISCV def_bool y select ACPI_GENERIC_GSI if ACPI + select ACPI_MCFG if (ACPI && PCI) select ACPI_PPTT if ACPI select ACPI_REDUCED_HARDWARE_ONLY if ACPI select ACPI_SPCR_TABLE if ACPI @@@ -65,6 -64,7 +65,7 @@@ select ARCH_SUPPORTS_LTO_CLANG_THIN if LLD_VERSION >= 140000 select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU select ARCH_SUPPORTS_PER_VMA_LOCK if MMU + select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK select ARCH_USE_CMPXCHG_LOCKREF if 64BIT select ARCH_USE_MEMTEST @@@ -93,7 -93,6 +94,7 @@@ select GENERIC_ATOMIC64 if !64BIT select GENERIC_CLOCKEVENTS_BROADCAST if SMP select GENERIC_CPU_DEVICES + select GENERIC_CPU_VULNERABILITIES select GENERIC_EARLY_IOREMAP select GENERIC_ENTRY select GENERIC_GETTIMEOFDAY if HAVE_GENERIC_VDSO @@@ -158,7 -157,6 +159,7 @@@ select HAVE_KERNEL_LZO if !XIP_KERNEL && !EFI_ZBOOT select HAVE_KERNEL_UNCOMPRESSED if !XIP_KERNEL && !EFI_ZBOOT select HAVE_KERNEL_ZSTD if !XIP_KERNEL && !EFI_ZBOOT + select HAVE_KERNEL_XZ if !XIP_KERNEL && !EFI_ZBOOT select HAVE_KPROBES if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 @@@ -191,7 -189,6 +192,7 @@@ select OF_EARLY_FLATTREE select OF_IRQ select PCI_DOMAINS_GENERIC if PCI + select PCI_ECAM if (ACPI && PCI) select PCI_MSI if PCI select RISCV_ALTERNATIVE if !XIP_KERNEL select RISCV_APLIC diff --combined arch/x86/Kconfig index 47a2ff9096dad,d45d22fa83f7c..a8cf61c52b063 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@@ -28,7 -28,6 +28,7 @@@ config X86_6 select ARCH_HAS_GIGANTIC_PAGE select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_PER_VMA_LOCK + select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE select HAVE_ARCH_SOFT_DIRTY select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE @@@ -80,7 -79,6 +80,7 @@@ config X8 select ARCH_HAS_DEBUG_VIRTUAL select ARCH_HAS_DEBUG_VM_PGTABLE if !X86_PAE select ARCH_HAS_DEVMEM_IS_ALLOWED + select ARCH_HAS_DMA_OPS if GART_IOMMU || XEN select ARCH_HAS_EARLY_DEBUG if KGDB select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FAST_MULTIPLIER @@@ -109,6 -107,7 +109,7 @@@ select ARCH_HAS_DEBUG_WX select ARCH_HAS_ZONE_DMA_SET if EXPERT select ARCH_HAVE_NMI_SAFE_CMPXCHG + select ARCH_HAVE_EXTRA_ELF_NOTES select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI select ARCH_MIGHT_HAVE_PC_PARPORT @@@ -124,6 -123,7 +125,7 @@@ select ARCH_USES_CFI_TRAPS if X86_64 && CFI_CLANG select ARCH_SUPPORTS_LTO_CLANG select ARCH_SUPPORTS_LTO_CLANG_THIN + select ARCH_SUPPORTS_RT select ARCH_USE_BUILTIN_BSWAP select ARCH_USE_CMPXCHG_LOCKREF if X86_CMPXCHG64 select ARCH_USE_MEMTEST @@@ -298,7 -298,6 +300,7 @@@ select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK select NEED_SG_DMA_LENGTH + select NUMA_MEMBLKS if NUMA select PCI_DOMAINS if PCI select PCI_LOCKLESS_CONFIG if PCI select PERF_EVENTS @@@ -946,6 -945,7 +948,6 @@@ config DM
config GART_IOMMU bool "Old AMD GART IOMMU support" - select DMA_OPS select IOMMU_HELPER select SWIOTLB depends on X86_64 && PCI && AMD_NB @@@ -1601,6 -1601,14 +1603,6 @@@ config X86_64_ACPI_NUM help Enable ACPI SRAT based node topology detection.
-config NUMA_EMU - bool "NUMA emulation" - depends on NUMA - help - Enable NUMA emulation. A flat machine will be split - into virtual nodes when booted with "numa=fake=N", where N is the - number of nodes. This is only useful for debugging. - config NODES_SHIFT int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP range 1 10 @@@ -1800,7 -1808,6 +1802,7 @@@ config X86_PA def_bool y prompt "x86 PAT support" if EXPERT depends on MTRR + select ARCH_USES_PG_ARCH_2 help Use PAT attributes to setup page level cache control.
@@@ -1812,6 -1819,10 +1814,6 @@@
If unsure, say Y.
-config ARCH_USES_PG_UNCACHED - def_bool y - depends on X86_PAT - config X86_UMIP def_bool y prompt "User Mode Instruction Prevention" if EXPERT @@@ -1880,10 -1891,6 +1882,10 @@@ config X86_INTEL_MEMORY_PROTECTION_KEY
If unsure, say y.
+config ARCH_PKEY_BITS + int + default 4 + choice prompt "TSX enable mode" depends on CPU_SUP_INTEL @@@ -2421,6 -2428,14 +2423,14 @@@ config CFI_AUTO_DEFAUL
source "kernel/livepatch/Kconfig"
+ config X86_BUS_LOCK_DETECT + bool "Split Lock Detect and Bus Lock Detect support" + depends on CPU_SUP_INTEL || CPU_SUP_AMD + default y + help + Enable Split Lock Detect and Bus Lock Detect functionalities. + See file:Documentation/arch/x86/buslock.rst for more information. + endmenu
config CC_HAS_NAMED_AS @@@ -2605,24 -2620,15 +2615,15 @@@ config MITIGATION_SL against straight line speculation. The kernel image might be slightly larger.
- config MITIGATION_GDS_FORCE - bool "Force GDS Mitigation" + config MITIGATION_GDS + bool "Mitigate Gather Data Sampling" depends on CPU_SUP_INTEL - default n + default y help - Gather Data Sampling (GDS) is a hardware vulnerability which allows - unprivileged speculative access to data which was previously stored in - vector registers. - - This option is equivalent to setting gather_data_sampling=force on the - command line. The microcode mitigation is used if present, otherwise - AVX is disabled as a mitigation. On affected systems that are missing - the microcode any userspace code that unconditionally uses AVX will - break with this option set. - - Setting this option on systems not vulnerable to GDS has no effect. - - If in doubt, say N. + Enable mitigation for Gather Data Sampling (GDS). GDS is a hardware + vulnerability which allows unprivileged speculative access to data + which was previously stored in vector registers. The attacker uses gather + instructions to infer the stale vector register data.
config MITIGATION_RFDS bool "RFDS Mitigation" @@@ -2645,6 -2651,107 +2646,107 @@@ config MITIGATION_SPECTRE_BH indirect branches. See file:Documentation/admin-guide/hw-vuln/spectre.rst
+ config MITIGATION_MDS + bool "Mitigate Microarchitectural Data Sampling (MDS) hardware bug" + depends on CPU_SUP_INTEL + default y + help + Enable mitigation for Microarchitectural Data Sampling (MDS). MDS is + a hardware vulnerability which allows unprivileged speculative access + to data which is available in various CPU internal buffers. + See also file:Documentation/admin-guide/hw-vuln/mds.rst + + config MITIGATION_TAA + bool "Mitigate TSX Asynchronous Abort (TAA) hardware bug" + depends on CPU_SUP_INTEL + default y + help + Enable mitigation for TSX Asynchronous Abort (TAA). TAA is a hardware + vulnerability that allows unprivileged speculative access to data + which is available in various CPU internal buffers by using + asynchronous aborts within an Intel TSX transactional region. + See also file:Documentation/admin-guide/hw-vuln/tsx_async_abort.rst + + config MITIGATION_MMIO_STALE_DATA + bool "Mitigate MMIO Stale Data hardware bug" + depends on CPU_SUP_INTEL + default y + help + Enable mitigation for MMIO Stale Data hardware bugs. Processor MMIO + Stale Data Vulnerabilities are a class of memory-mapped I/O (MMIO) + vulnerabilities that can expose data. The vulnerabilities require the + attacker to have access to MMIO. + See also + file:Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst + + config MITIGATION_L1TF + bool "Mitigate L1 Terminal Fault (L1TF) hardware bug" + depends on CPU_SUP_INTEL + default y + help + Mitigate L1 Terminal Fault (L1TF) hardware bug. L1 Terminal Fault is a + hardware vulnerability which allows unprivileged speculative access to data + available in the Level 1 Data Cache. + See <file:Documentation/admin-guide/hw-vuln/l1tf.rst + + config MITIGATION_RETBLEED + bool "Mitigate RETBleed hardware bug" + depends on (CPU_SUP_INTEL && MITIGATION_SPECTRE_V2) || MITIGATION_UNRET_ENTRY || MITIGATION_IBPB_ENTRY + default y + help + Enable mitigation for RETBleed (Arbitrary Speculative Code Execution + with Return Instructions) vulnerability. RETBleed is a speculative + execution attack which takes advantage of microarchitectural behavior + in many modern microprocessors, similar to Spectre v2. An + unprivileged attacker can use these flaws to bypass conventional + memory security restrictions to gain read access to privileged memory + that would otherwise be inaccessible. + + config MITIGATION_SPECTRE_V1 + bool "Mitigate SPECTRE V1 hardware bug" + default y + help + Enable mitigation for Spectre V1 (Bounds Check Bypass). Spectre V1 is a + class of side channel attacks that takes advantage of speculative + execution that bypasses conditional branch instructions used for + memory access bounds check. + See also file:Documentation/admin-guide/hw-vuln/spectre.rst + + config MITIGATION_SPECTRE_V2 + bool "Mitigate SPECTRE V2 hardware bug" + default y + help + Enable mitigation for Spectre V2 (Branch Target Injection). Spectre + V2 is a class of side channel attacks that takes advantage of + indirect branch predictors inside the processor. In Spectre variant 2 + attacks, the attacker can steer speculative indirect branches in the + victim to gadget code by poisoning the branch target buffer of a CPU + used for predicting indirect branch addresses. + See also file:Documentation/admin-guide/hw-vuln/spectre.rst + + config MITIGATION_SRBDS + bool "Mitigate Special Register Buffer Data Sampling (SRBDS) hardware bug" + depends on CPU_SUP_INTEL + default y + help + Enable mitigation for Special Register Buffer Data Sampling (SRBDS). + SRBDS is a hardware vulnerability that allows Microarchitectural Data + Sampling (MDS) techniques to infer values returned from special + register accesses. An unprivileged user can extract values returned + from RDRAND and RDSEED executed on another core or sibling thread + using MDS techniques. + See also + file:Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst + + config MITIGATION_SSB + bool "Mitigate Speculative Store Bypass (SSB) hardware bug" + default y + help + Enable mitigation for Speculative Store Bypass (SSB). SSB is a + hardware security vulnerability and its exploitation takes advantage + of speculative execution in a similar way to the Meltdown and Spectre + security vulnerabilities. + endif
config ARCH_HAS_ADD_PAGES diff --combined arch/x86/include/asm/mmu_context.h index 80f2a3187aa66,19091ebb86338..2886cb668d7fa --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@@ -88,7 -88,13 +88,13 @@@ static inline void switch_ldt(struct mm #ifdef CONFIG_ADDRESS_MASKING static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm) { - return mm->context.lam_cr3_mask; + /* + * When switch_mm_irqs_off() is called for a kthread, it may race with + * LAM enablement. switch_mm_irqs_off() uses the LAM mask to do two + * things: populate CR3 and populate 'cpu_tlbstate.lam'. Make sure it + * reads a single value for both. + */ + return READ_ONCE(mm->context.lam_cr3_mask); }
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm) @@@ -232,6 -238,11 +238,6 @@@ static inline bool is_64bit_mm(struct m } #endif
-static inline void arch_unmap(struct mm_struct *mm, unsigned long start, - unsigned long end) -{ -} - /* * We only want to enforce protection keys on the current process * because we effectively have no access to PKRU for other diff --combined arch/x86/include/asm/processor.h index 775acbdea1a96,399f7d1c4c61f..4a686f0e5dbf6 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@@ -582,7 -582,8 +582,8 @@@ extern void switch_gdt_and_percpu_base( extern void load_direct_gdt(int); extern void load_fixmap_gdt(int); extern void cpu_init(void); - extern void cpu_init_exception_handling(void); + extern void cpu_init_exception_handling(bool boot_cpu); + extern void cpu_init_replace_early_idt(void); extern void cr4_init(void);
extern void set_task_blockstep(struct task_struct *task, bool on); @@@ -691,6 -692,8 +692,6 @@@ static inline u32 per_cpu_l2c_id(unsign }
#ifdef CONFIG_CPU_SUP_AMD -extern u32 amd_get_highest_perf(void); - /* * Issue a DIV 0/1 insn to clear any division data from previous DIV * operations. @@@ -703,6 -706,7 +704,6 @@@ static __always_inline void amd_clear_d
extern void amd_check_microcode(void); #else -static inline u32 amd_get_highest_perf(void) { return 0; } static inline void amd_clear_divider(void) { } static inline void amd_check_microcode(void) { } #endif diff --combined arch/x86/kernel/cpu/sgx/main.c index d01deb3863955,3a79105455f1d..9ace84486499b --- a/arch/x86/kernel/cpu/sgx/main.c +++ b/arch/x86/kernel/cpu/sgx/main.c @@@ -475,24 -475,25 +475,25 @@@ struct sgx_epc_page *__sgx_alloc_epc_pa { struct sgx_epc_page *page; int nid_of_current = numa_node_id(); - int nid = nid_of_current; + int nid_start, nid;
- if (node_isset(nid_of_current, sgx_numa_mask)) { - page = __sgx_alloc_epc_page_from_node(nid_of_current); - if (page) - return page; - } - - /* Fall back to the non-local NUMA nodes: */ - while (true) { - nid = next_node_in(nid, sgx_numa_mask); - if (nid == nid_of_current) - break; + /* + * Try local node first. If it doesn't have an EPC section, + * fall back to the non-local NUMA nodes. + */ + if (node_isset(nid_of_current, sgx_numa_mask)) + nid_start = nid_of_current; + else + nid_start = next_node_in(nid_of_current, sgx_numa_mask);
+ nid = nid_start; + do { page = __sgx_alloc_epc_page_from_node(nid); if (page) return page; - } + + nid = next_node_in(nid, sgx_numa_mask); + } while (nid != nid_start);
return ERR_PTR(-ENOMEM); } @@@ -732,7 -733,7 +733,7 @@@ out return 0; }
- /** + /* * A section metric is concatenated in a way that @low bits 12-31 define the * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the * metric. @@@ -847,6 -848,13 +848,13 @@@ static bool __init sgx_page_cache_init( return false; }
+ for_each_online_node(nid) { + if (!node_isset(nid, sgx_numa_mask) && + node_state(nid, N_MEMORY) && node_state(nid, N_CPU)) + pr_info("node%d has both CPUs and memory but doesn't have an EPC section\n", + nid); + } + return true; }
@@@ -895,10 -903,10 +903,10 @@@ int sgx_set_attribute(unsigned long *al { struct fd f = fdget(attribute_fd);
- if (!f.file) + if (!fd_file(f)) return -EINVAL;
- if (f.file->f_op != &sgx_provision_fops) { + if (fd_file(f)->f_op != &sgx_provision_fops) { fdput(f); return -EINVAL; } diff --combined block/blk-mq.c index 82219f0e9a256,aa28157b1aafc..4b2c8e940f591 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@@ -376,7 -376,9 +376,7 @@@ static struct request *blk_mq_rq_ctx_in rq->io_start_time_ns = 0; rq->stats_sectors = 0; rq->nr_phys_segments = 0; -#if defined(CONFIG_BLK_DEV_INTEGRITY) rq->nr_integrity_segments = 0; -#endif rq->end_io = NULL; rq->end_io_data = NULL;
@@@ -1126,7 -1128,7 +1126,7 @@@ static void blk_complete_reqs(struct ll rq->q->mq_ops->complete(rq); }
- static __latent_entropy void blk_done_softirq(struct softirq_action *h) + static __latent_entropy void blk_done_softirq(void) { blk_complete_reqs(this_cpu_ptr(&blk_cpu_done)); } @@@ -2544,9 -2546,6 +2544,9 @@@ static void blk_mq_bio_to_request(struc rq->__sector = bio->bi_iter.bi_sector; rq->write_hint = bio->bi_write_hint; blk_rq_bio_prep(rq, bio, nr_segs); + if (bio_integrity(bio)) + rq->nr_integrity_segments = blk_rq_count_integrity_sg(rq->q, + bio);
/* This can't fail, since GFP_NOIO includes __GFP_DIRECT_RECLAIM. */ err = blk_crypto_rq_bio_prep(rq, bio, GFP_NOIO); @@@ -2754,7 -2753,6 +2754,7 @@@ static void blk_mq_dispatch_plug_list(s void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) { struct request *rq; + unsigned int depth;
/* * We may have been called recursively midway through handling @@@ -2765,7 -2763,6 +2765,7 @@@ */ if (plug->rq_count == 0) return; + depth = plug->rq_count; plug->rq_count = 0;
if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) { @@@ -2773,7 -2770,6 +2773,7 @@@
rq = rq_list_peek(&plug->mq_list); q = rq->q; + trace_block_unplug(q, depth, true);
/* * Peek first request and see if we have a ->queue_rqs() hook. @@@ -2943,7 -2939,7 +2943,7 @@@ void blk_mq_submit_bio(struct bio *bio struct blk_plug *plug = current->plug; const int is_sync = op_is_sync(bio->bi_opf); struct blk_mq_hw_ctx *hctx; - unsigned int nr_segs = 1; + unsigned int nr_segs; struct request *rq; blk_status_t ret;
@@@ -2985,10 -2981,11 +2985,10 @@@ goto queue_exit; }
- if (unlikely(bio_may_exceed_limits(bio, &q->limits))) { - bio = __bio_split_to_limits(bio, &q->limits, &nr_segs); - if (!bio) - goto queue_exit; - } + bio = __bio_split_to_limits(bio, &q->limits, &nr_segs); + if (!bio) + goto queue_exit; + if (!bio_integrity_prep(bio)) goto queue_exit;
diff --combined drivers/gpu/drm/i915/i915_utils.c index b34a2d3d331d6,f2ba51c20e975..2576f8f6c0f69 --- a/drivers/gpu/drm/i915/i915_utils.c +++ b/drivers/gpu/drm/i915/i915_utils.c @@@ -11,10 -11,51 +11,10 @@@ #include "i915_reg.h" #include "i915_utils.h"
-#define FDO_BUG_MSG "Please file a bug on drm/i915; see " FDO_BUG_URL " for details." - -void -__i915_printk(struct drm_i915_private *dev_priv, const char *level, - const char *fmt, ...) -{ - static bool shown_bug_once; - struct device *kdev = dev_priv->drm.dev; - bool is_error = level[1] <= KERN_ERR[1]; - bool is_debug = level[1] == KERN_DEBUG[1]; - struct va_format vaf; - va_list args; - - if (is_debug && !drm_debug_enabled(DRM_UT_DRIVER)) - return; - - va_start(args, fmt); - - vaf.fmt = fmt; - vaf.va = &args; - - if (is_error) - dev_printk(level, kdev, "%pV", &vaf); - else - dev_printk(level, kdev, "[" DRM_NAME ":%ps] %pV", - __builtin_return_address(0), &vaf); - - va_end(args); - - if (is_error && !shown_bug_once) { - /* - * Ask the user to file a bug report for the error, except - * if they may have caused the bug by fiddling with unsafe - * module parameters. - */ - if (!test_taint(TAINT_USER)) - dev_notice(kdev, "%s", FDO_BUG_MSG); - shown_bug_once = true; - } -} - void add_taint_for_CI(struct drm_i915_private *i915, unsigned int taint) { - __i915_printk(i915, KERN_NOTICE, "CI tainted:%#x by %pS\n", - taint, (void *)_RET_IP_); + drm_notice(&i915->drm, "CI tainted: %#x by %pS\n", + taint, __builtin_return_address(0));
/* Failures that occur during fault injection testing are expected */ if (!i915_error_injected()) @@@ -33,9 -74,9 +33,9 @@@ int __i915_inject_probe_error(struct dr if (++i915_probe_fail_count < i915_modparams.inject_probe_failure) return 0;
- __i915_printk(i915, KERN_INFO, - "Injecting failure %d at checkpoint %u [%s:%d]\n", - err, i915_modparams.inject_probe_failure, func, line); + drm_info(&i915->drm, "Injecting failure %d at checkpoint %u [%s:%d]\n", + err, i915_modparams.inject_probe_failure, func, line); + i915_modparams.inject_probe_failure = 0; return err; } @@@ -69,7 -110,7 +69,7 @@@ void set_timer_ms(struct timer_list *t * Paranoia to make sure the compiler computes the timeout before * loading 'jiffies' as jiffies is volatile and may be updated in * the background by a timer tick. All to reduce the complexity - * of the addition and reduce the risk of losing a jiffie. + * of the addition and reduce the risk of losing a jiffy. */ barrier();
diff --combined drivers/gpu/drm/v3d/v3d_bo.c index ecb80fd75b1a0,9eafe53a8f41a..ebe52bef4ffb8 --- a/drivers/gpu/drm/v3d/v3d_bo.c +++ b/drivers/gpu/drm/v3d/v3d_bo.c @@@ -26,17 -26,6 +26,17 @@@ #include "v3d_drv.h" #include "uapi/drm/v3d_drm.h"
+static enum drm_gem_object_status v3d_gem_status(struct drm_gem_object *obj) +{ + struct v3d_bo *bo = to_v3d_bo(obj); + enum drm_gem_object_status res = 0; + + if (bo->base.pages) + res |= DRM_GEM_OBJECT_RESIDENT; + + return res; +} + /* Called DRM core on the last userspace/kernel unreference of the * BO. */ @@@ -74,7 -63,6 +74,7 @@@ static const struct drm_gem_object_func .vmap = drm_gem_shmem_object_vmap, .vunmap = drm_gem_shmem_object_vunmap, .mmap = drm_gem_shmem_object_mmap, + .status = v3d_gem_status, .vm_ops = &drm_gem_shmem_vm_ops, };
@@@ -291,7 -279,7 +291,7 @@@ v3d_wait_bo_ioctl(struct drm_device *de else args->timeout_ns = 0;
- /* Asked to wait beyond the jiffie/scheduler precision? */ + /* Asked to wait beyond the jiffy/scheduler precision? */ if (ret == -ETIME && args->timeout_ns) ret = -EAGAIN;
diff --combined drivers/hwmon/k10temp.c index 85a7632f3b50a,f96b91e433126..7dc19c5d62ac3 --- a/drivers/hwmon/k10temp.c +++ b/drivers/hwmon/k10temp.c @@@ -438,21 -438,16 +438,21 @@@ static int k10temp_probe(struct pci_de data->disp_negative = true; }
- if (boot_cpu_data.x86 == 0x15 && + data->is_zen = cpu_feature_enabled(X86_FEATURE_ZEN); + if (data->is_zen) { + data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; + data->read_tempreg = read_tempreg_nb_zen; + } else if (boot_cpu_data.x86 == 0x15 && ((boot_cpu_data.x86_model & 0xf0) == 0x60 || (boot_cpu_data.x86_model & 0xf0) == 0x70)) { data->read_htcreg = read_htcreg_nb_f15; data->read_tempreg = read_tempreg_nb_f15; - } else if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) { - data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; - data->read_tempreg = read_tempreg_nb_zen; - data->is_zen = true; + } else { + data->read_htcreg = read_htcreg_pci; + data->read_tempreg = read_tempreg_pci; + }
+ if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) { switch (boot_cpu_data.x86_model) { case 0x1: /* Zen */ case 0x8: /* Zen+ */ @@@ -474,6 -469,10 +474,6 @@@ break; } } else if (boot_cpu_data.x86 == 0x19) { - data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; - data->read_tempreg = read_tempreg_nb_zen; - data->is_zen = true; - switch (boot_cpu_data.x86_model) { case 0x0 ... 0x1: /* Zen3 SP3/TR */ case 0x8: /* Zen3 TR Chagall */ @@@ -497,6 -496,13 +497,6 @@@ k10temp_get_ccd_support(data, 12); break; } - } else if (boot_cpu_data.x86 == 0x1a) { - data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK; - data->read_tempreg = read_tempreg_nb_zen; - data->is_zen = true; - } else { - data->read_htcreg = read_htcreg_pci; - data->read_tempreg = read_tempreg_pci; }
for (i = 0; i < ARRAY_SIZE(tctl_offset_table); i++) { @@@ -542,6 -548,7 +542,7 @@@ static const struct pci_device_id k10te { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) }, { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3) }, { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3) }, + { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3) }, { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_AMD_17H_DF_F3) }, {} }; diff --combined drivers/iommu/intel/iommu.h index 428d253f13484,5c0e93042f7a2..1497f3112b12c --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@@ -584,23 -584,11 +584,23 @@@ struct iommu_domain_info * to VT-d spec, section 9.3 */ };
+/* + * We start simply by using a fixed size for the batched descriptors. This + * size is currently sufficient for our needs. Future improvements could + * involve dynamically allocating the batch buffer based on actual demand, + * allowing us to adjust the batch size for optimal performance in different + * scenarios. + */ +#define QI_MAX_BATCHED_DESC_COUNT 16 +struct qi_batch { + struct qi_desc descs[QI_MAX_BATCHED_DESC_COUNT]; + unsigned int index; +}; + struct dmar_domain { int nid; /* node id */ struct xarray iommu_array; /* Attached IOMMU array */
- u8 has_iotlb_device: 1; u8 iommu_coherency: 1; /* indicate coherency of iommu access */ u8 force_snooping : 1; /* Create IOPTEs with snoop control */ u8 set_pte_snp:1; @@@ -621,7 -609,6 +621,7 @@@
spinlock_t cache_lock; /* Protect the cache tag list */ struct list_head cache_tags; /* Cache tag list */ + struct qi_batch *qi_batch; /* Batched QI descriptors */
int iommu_superpage;/* Level of superpages supported: 0 == 4KiB (no superpages), 1 == 2MiB, @@@ -700,8 -687,6 +700,6 @@@ struct iommu_pmu DECLARE_BITMAP(used_mask, IOMMU_PMU_IDX_MAX); struct perf_event *event_list[IOMMU_PMU_IDX_MAX]; unsigned char irq_name[16]; - struct hlist_node cpuhp_node; - int cpu; };
#define IOMMU_IRQ_ID_OFFSET_PRQ (DMAR_UNITS_SUPPORTED) @@@ -1080,115 -1065,6 +1078,115 @@@ static inline unsigned long nrpages_to_ return npages << VTD_PAGE_SHIFT; }
+static inline void qi_desc_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, + unsigned int size_order, u64 type, + struct qi_desc *desc) +{ + u8 dw = 0, dr = 0; + int ih = 0; + + if (cap_write_drain(iommu->cap)) + dw = 1; + + if (cap_read_drain(iommu->cap)) + dr = 1; + + desc->qw0 = QI_IOTLB_DID(did) | QI_IOTLB_DR(dr) | QI_IOTLB_DW(dw) + | QI_IOTLB_GRAN(type) | QI_IOTLB_TYPE; + desc->qw1 = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih) + | QI_IOTLB_AM(size_order); + desc->qw2 = 0; + desc->qw3 = 0; +} + +static inline void qi_desc_dev_iotlb(u16 sid, u16 pfsid, u16 qdep, u64 addr, + unsigned int mask, struct qi_desc *desc) +{ + if (mask) { + addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1; + desc->qw1 = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE; + } else { + desc->qw1 = QI_DEV_IOTLB_ADDR(addr); + } + + if (qdep >= QI_DEV_IOTLB_MAX_INVS) + qdep = 0; + + desc->qw0 = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) | + QI_DIOTLB_TYPE | QI_DEV_IOTLB_PFSID(pfsid); + desc->qw2 = 0; + desc->qw3 = 0; +} + +static inline void qi_desc_piotlb(u16 did, u32 pasid, u64 addr, + unsigned long npages, bool ih, + struct qi_desc *desc) +{ + if (npages == -1) { + desc->qw0 = QI_EIOTLB_PASID(pasid) | + QI_EIOTLB_DID(did) | + QI_EIOTLB_GRAN(QI_GRAN_NONG_PASID) | + QI_EIOTLB_TYPE; + desc->qw1 = 0; + } else { + int mask = ilog2(__roundup_pow_of_two(npages)); + unsigned long align = (1ULL << (VTD_PAGE_SHIFT + mask)); + + if (WARN_ON_ONCE(!IS_ALIGNED(addr, align))) + addr = ALIGN_DOWN(addr, align); + + desc->qw0 = QI_EIOTLB_PASID(pasid) | + QI_EIOTLB_DID(did) | + QI_EIOTLB_GRAN(QI_GRAN_PSI_PASID) | + QI_EIOTLB_TYPE; + desc->qw1 = QI_EIOTLB_ADDR(addr) | + QI_EIOTLB_IH(ih) | + QI_EIOTLB_AM(mask); + } +} + +static inline void qi_desc_dev_iotlb_pasid(u16 sid, u16 pfsid, u32 pasid, + u16 qdep, u64 addr, + unsigned int size_order, + struct qi_desc *desc) +{ + unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1); + + desc->qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) | + QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE | + QI_DEV_IOTLB_PFSID(pfsid); + + /* + * If S bit is 0, we only flush a single page. If S bit is set, + * The least significant zero bit indicates the invalidation address + * range. VT-d spec 6.5.2.6. + * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB. + * size order = 0 is PAGE_SIZE 4KB + * Max Invs Pending (MIP) is set to 0 for now until we have DIT in + * ECAP. + */ + if (!IS_ALIGNED(addr, VTD_PAGE_SIZE << size_order)) + pr_warn_ratelimited("Invalidate non-aligned address %llx, order %d\n", + addr, size_order); + + /* Take page address */ + desc->qw1 = QI_DEV_EIOTLB_ADDR(addr); + + if (size_order) { + /* + * Existing 0s in address below size_order may be the least + * significant bit, we must set them to 1s to avoid having + * smaller size than desired. + */ + desc->qw1 |= GENMASK_ULL(size_order + VTD_PAGE_SHIFT - 1, + VTD_PAGE_SHIFT); + /* Clear size_order bit to indicate size */ + desc->qw1 &= ~mask; + /* Set the S bit to indicate flushing more than 1 page */ + desc->qw1 |= QI_DEV_EIOTLB_SIZE; + } +} + /* Convert value to context PASID directory size field coding. */ #define context_pdts(pds) (((pds) & 0x7) << 9)
@@@ -1220,15 -1096,13 +1218,15 @@@ void qi_flush_pasid_cache(struct intel_
int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc, unsigned int count, unsigned long options); + +void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr, + unsigned int size_order, u64 type); /* * Options used in qi_submit_sync: * QI_OPT_WAIT_DRAIN - Wait for PRQ drain completion, spec 6.5.2.8. */ #define QI_OPT_WAIT_DRAIN BIT(0)
-void domain_update_iotlb(struct dmar_domain *domain); int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu); void device_block_translation(struct device *dev); diff --combined drivers/irqchip/irq-loongson-eiointc.c index 2057388470660,e24db71a8783c..9a46af3083d61 --- a/drivers/irqchip/irq-loongson-eiointc.c +++ b/drivers/irqchip/irq-loongson-eiointc.c @@@ -14,10 -14,11 +14,12 @@@ #include <linux/irqdomain.h> #include <linux/irqchip/chained_irq.h> #include <linux/kernel.h> +#include <linux/kvm_para.h> #include <linux/syscore_ops.h> #include <asm/numa.h>
+ #include "irq-loongson.h" + #define EIOINTC_REG_NODEMAP 0x14a0 #define EIOINTC_REG_IPMAP 0x14c0 #define EIOINTC_REG_ENABLE 0x1600 @@@ -25,37 -26,15 +27,37 @@@ #define EIOINTC_REG_ISR 0x1800 #define EIOINTC_REG_ROUTE 0x1c00
+#define EXTIOI_VIRT_FEATURES 0x40000000 +#define EXTIOI_HAS_VIRT_EXTENSION BIT(0) +#define EXTIOI_HAS_ENABLE_OPTION BIT(1) +#define EXTIOI_HAS_INT_ENCODE BIT(2) +#define EXTIOI_HAS_CPU_ENCODE BIT(3) +#define EXTIOI_VIRT_CONFIG 0x40000004 +#define EXTIOI_ENABLE BIT(1) +#define EXTIOI_ENABLE_INT_ENCODE BIT(2) +#define EXTIOI_ENABLE_CPU_ENCODE BIT(3) + #define VEC_REG_COUNT 4 #define VEC_COUNT_PER_REG 64 #define VEC_COUNT (VEC_REG_COUNT * VEC_COUNT_PER_REG) #define VEC_REG_IDX(irq_id) ((irq_id) / VEC_COUNT_PER_REG) #define VEC_REG_BIT(irq_id) ((irq_id) % VEC_COUNT_PER_REG) #define EIOINTC_ALL_ENABLE 0xffffffff +#define EIOINTC_ALL_ENABLE_VEC_MASK(vector) (EIOINTC_ALL_ENABLE & ~BIT(vector & 0x1f)) +#define EIOINTC_REG_ENABLE_VEC(vector) (EIOINTC_REG_ENABLE + ((vector >> 5) << 2)) +#define EIOINTC_USE_CPU_ENCODE BIT(0)
#define MAX_EIO_NODES (NR_CPUS / CORES_PER_EIO_NODE)
+/* + * Routing registers are 32bit, and there is 8-bit route setting for every + * interrupt vector. So one Route register contains four vectors routing + * information. + */ +#define EIOINTC_REG_ROUTE_VEC(vector) (EIOINTC_REG_ROUTE + (vector & ~0x03)) +#define EIOINTC_REG_ROUTE_VEC_SHIFT(vector) ((vector & 0x03) << 3) +#define EIOINTC_REG_ROUTE_VEC_MASK(vector) (0xff << EIOINTC_REG_ROUTE_VEC_SHIFT(vector)) + static int nr_pics;
struct eiointc_priv { @@@ -65,7 -44,6 +67,7 @@@ cpumask_t cpuspan_map; struct fwnode_handle *domain_handle; struct irq_domain *eiointc_domain; + int flags; };
static struct eiointc_priv *eiointc_priv[MAX_IO_PICS]; @@@ -81,10 -59,7 +83,10 @@@ static void eiointc_enable(void
static int cpu_to_eio_node(int cpu) { - return cpu_logical_map(cpu) / CORES_PER_EIO_NODE; + if (!kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI)) + return cpu_logical_map(cpu) / CORES_PER_EIO_NODE; + else + return cpu_logical_map(cpu) / CORES_PER_VEIO_NODE; }
#ifdef CONFIG_SMP @@@ -116,17 -91,6 +118,17 @@@ static void eiointc_set_irq_route(int p
static DEFINE_RAW_SPINLOCK(affinity_lock);
+static void veiointc_set_irq_route(unsigned int vector, unsigned int cpu) +{ + unsigned long reg = EIOINTC_REG_ROUTE_VEC(vector); + unsigned int data; + + data = iocsr_read32(reg); + data &= ~EIOINTC_REG_ROUTE_VEC_MASK(vector); + data |= cpu_logical_map(cpu) << EIOINTC_REG_ROUTE_VEC_SHIFT(vector); + iocsr_write32(data, reg); +} + static int eiointc_set_irq_affinity(struct irq_data *d, const struct cpumask *affinity, bool force) { unsigned int cpu; @@@ -143,24 -107,18 +145,24 @@@ }
vector = d->hwirq; - regaddr = EIOINTC_REG_ENABLE + ((vector >> 5) << 2); - - /* Mask target vector */ - csr_any_send(regaddr, EIOINTC_ALL_ENABLE & (~BIT(vector & 0x1F)), - 0x0, priv->node * CORES_PER_EIO_NODE); - - /* Set route for target vector */ - eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map); - - /* Unmask target vector */ - csr_any_send(regaddr, EIOINTC_ALL_ENABLE, - 0x0, priv->node * CORES_PER_EIO_NODE); + regaddr = EIOINTC_REG_ENABLE_VEC(vector); + + if (priv->flags & EIOINTC_USE_CPU_ENCODE) { + iocsr_write32(EIOINTC_ALL_ENABLE_VEC_MASK(vector), regaddr); + veiointc_set_irq_route(vector, cpu); + iocsr_write32(EIOINTC_ALL_ENABLE, regaddr); + } else { + /* Mask target vector */ + csr_any_send(regaddr, EIOINTC_ALL_ENABLE_VEC_MASK(vector), + 0x0, priv->node * CORES_PER_EIO_NODE); + + /* Set route for target vector */ + eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map); + + /* Unmask target vector */ + csr_any_send(regaddr, EIOINTC_ALL_ENABLE, + 0x0, priv->node * CORES_PER_EIO_NODE); + }
irq_data_update_effective_affinity(d, cpumask_of(cpu));
@@@ -184,23 -142,17 +186,23 @@@ static int eiointc_index(int node
static int eiointc_router_init(unsigned int cpu) { - int i, bit; - uint32_t data; - uint32_t node = cpu_to_eio_node(cpu); - int index = eiointc_index(node); + int i, bit, cores, index, node; + unsigned int data; + + node = cpu_to_eio_node(cpu); + index = eiointc_index(node);
if (index < 0) { pr_err("Error: invalid nodemap!\n"); - return -1; + return -EINVAL; }
- if ((cpu_logical_map(cpu) % CORES_PER_EIO_NODE) == 0) { + if (!(eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE)) + cores = CORES_PER_EIO_NODE; + else + cores = CORES_PER_VEIO_NODE; + + if ((cpu_logical_map(cpu) % cores) == 0) { eiointc_enable();
for (i = 0; i < eiointc_priv[0]->vec_count / 32; i++) { @@@ -216,9 -168,7 +218,9 @@@
for (i = 0; i < eiointc_priv[0]->vec_count / 4; i++) { /* Route to Node-0 Core-0 */ - if (index == 0) + if (eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE) + bit = cpu_logical_map(0); + else if (index == 0) bit = BIT(cpu_logical_map(0)); else bit = (eiointc_priv[index]->node << 4) | 1; @@@ -412,6 -362,9 +414,9 @@@ static int __init acpi_cascade_irqdomai if (r < 0) return r;
+ if (cpu_has_avecint) + return 0; + r = acpi_table_parse_madt(ACPI_MADT_TYPE_MSI_PIC, pch_msi_parse_madt, 1); if (r < 0) return r; @@@ -422,7 -375,7 +427,7 @@@ static int __init eiointc_init(struct eiointc_priv *priv, int parent_irq, u64 node_map) { - int i; + int i, val;
node_map = node_map ? node_map : -1ULL; for_each_possible_cpu(i) { @@@ -442,28 -395,14 +447,28 @@@ return -ENOMEM; }
+ if (kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI)) { + val = iocsr_read32(EXTIOI_VIRT_FEATURES); + /* + * With EXTIOI_ENABLE_CPU_ENCODE set + * interrupts can route to 256 vCPUs. + */ + if (val & EXTIOI_HAS_CPU_ENCODE) { + val = iocsr_read32(EXTIOI_VIRT_CONFIG); + val |= EXTIOI_ENABLE_CPU_ENCODE; + iocsr_write32(val, EXTIOI_VIRT_CONFIG); + priv->flags = EIOINTC_USE_CPU_ENCODE; + } + } + eiointc_priv[nr_pics++] = priv; eiointc_router_init(0); irq_set_chained_handler_and_data(parent_irq, eiointc_irq_dispatch, priv);
if (nr_pics == 1) { register_syscore_ops(&eiointc_syscore_ops); - cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_LOONGARCH_STARTING, - "irqchip/loongarch/intc:starting", + cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_EIOINTC_STARTING, + "irqchip/loongarch/eiointc:starting", eiointc_router_init, NULL); }
diff --combined fs/proc/base.c index 1ad51858528f7,632cf1fc8f8c1..b31283d81c52e --- a/fs/proc/base.c +++ b/fs/proc/base.c @@@ -85,7 -85,6 +85,7 @@@ #include <linux/elf.h> #include <linux/pid_namespace.h> #include <linux/user_namespace.h> +#include <linux/fs_parser.h> #include <linux/fs_struct.h> #include <linux/slab.h> #include <linux/sched/autogroup.h> @@@ -118,40 -117,6 +118,40 @@@ static u8 nlink_tid __ro_after_init; static u8 nlink_tgid __ro_after_init;
+enum proc_mem_force { + PROC_MEM_FORCE_ALWAYS, + PROC_MEM_FORCE_PTRACE, + PROC_MEM_FORCE_NEVER +}; + +static enum proc_mem_force proc_mem_force_override __ro_after_init = + IS_ENABLED(CONFIG_PROC_MEM_NO_FORCE) ? PROC_MEM_FORCE_NEVER : + IS_ENABLED(CONFIG_PROC_MEM_FORCE_PTRACE) ? PROC_MEM_FORCE_PTRACE : + PROC_MEM_FORCE_ALWAYS; + +static const struct constant_table proc_mem_force_table[] __initconst = { + { "always", PROC_MEM_FORCE_ALWAYS }, + { "ptrace", PROC_MEM_FORCE_PTRACE }, + { "never", PROC_MEM_FORCE_NEVER }, + { } +}; + +static int __init early_proc_mem_force_override(char *buf) +{ + if (!buf) + return -EINVAL; + + /* + * lookup_constant() defaults to proc_mem_force_override to preseve + * the initial Kconfig choice in case an invalid param gets passed. + */ + proc_mem_force_override = lookup_constant(proc_mem_force_table, + buf, proc_mem_force_override); + + return 0; +} +early_param("proc_mem.force_override", early_proc_mem_force_override); + struct pid_entry { const char *name; unsigned int len; @@@ -862,31 -827,12 +862,31 @@@ static int __mem_open(struct inode *ino
static int mem_open(struct inode *inode, struct file *file) { - int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH); - - /* OK to pass negative loff_t, we can catch out-of-range */ - file->f_mode |= FMODE_UNSIGNED_OFFSET; + if (WARN_ON_ONCE(!(file->f_op->fop_flags & FOP_UNSIGNED_OFFSET))) + return -EINVAL; + return __mem_open(inode, file, PTRACE_MODE_ATTACH); +}
- return ret; +static bool proc_mem_foll_force(struct file *file, struct mm_struct *mm) +{ + struct task_struct *task; + bool ptrace_active = false; + + switch (proc_mem_force_override) { + case PROC_MEM_FORCE_NEVER: + return false; + case PROC_MEM_FORCE_PTRACE: + task = get_proc_task(file_inode(file)); + if (task) { + ptrace_active = READ_ONCE(task->ptrace) && + READ_ONCE(task->mm) == mm && + READ_ONCE(task->parent) == current; + put_task_struct(task); + } + return ptrace_active; + default: + return true; + } }
static ssize_t mem_rw(struct file *file, char __user *buf, @@@ -909,9 -855,7 +909,9 @@@ if (!mmget_not_zero(mm)) goto free;
- flags = FOLL_FORCE | (write ? FOLL_WRITE : 0); + flags = write ? FOLL_WRITE : 0; + if (proc_mem_foll_force(file, mm)) + flags |= FOLL_FORCE;
while (count > 0) { size_t this_len = min_t(size_t, count, PAGE_SIZE); @@@ -988,7 -932,6 +988,7 @@@ static const struct file_operations pro .write = mem_write, .open = mem_open, .release = mem_release, + .fop_flags = FOP_UNSIGNED_OFFSET, };
static int environ_open(struct inode *inode, struct file *file) @@@ -2333,8 -2276,8 +2333,8 @@@ proc_map_files_instantiate(struct dentr inode->i_op = &proc_map_files_link_inode_operations; inode->i_size = 64;
- d_set_d_op(dentry, &tid_map_files_dentry_operations); - return d_splice_alias(inode, dentry); + return proc_splice_unmountable(inode, dentry, + &tid_map_files_dentry_operations); }
static struct dentry *proc_map_files_lookup(struct inode *dir, @@@ -2513,13 -2456,13 +2513,13 @@@ static void *timers_start(struct seq_fi if (!tp->sighand) return ERR_PTR(-ESRCH);
- return seq_list_start(&tp->task->signal->posix_timers, *pos); + return seq_hlist_start(&tp->task->signal->posix_timers, *pos); }
static void *timers_next(struct seq_file *m, void *v, loff_t *pos) { struct timers_private *tp = m->private; - return seq_list_next(v, &tp->task->signal->posix_timers, pos); + return seq_hlist_next(v, &tp->task->signal->posix_timers, pos); }
static void timers_stop(struct seq_file *m, void *v) @@@ -2548,7 -2491,7 +2548,7 @@@ static int show_timer(struct seq_file * [SIGEV_THREAD] = "thread", };
- timer = list_entry((struct list_head *)v, struct k_itimer, list); + timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list); notify = timer->it_sigev_notify;
seq_printf(m, "ID: %d\n", timer->it_id); @@@ -2626,10 -2569,11 +2626,11 @@@ static ssize_t timerslack_ns_write(stru }
task_lock(p); - if (slack_ns == 0) - p->timer_slack_ns = p->default_timer_slack_ns; - else - p->timer_slack_ns = slack_ns; + if (rt_or_dl_task_policy(p)) + slack_ns = 0; + else if (slack_ns == 0) + slack_ns = p->default_timer_slack_ns; + p->timer_slack_ns = slack_ns; task_unlock(p);
out: @@@ -3927,12 -3871,12 +3928,12 @@@ static int proc_task_readdir(struct fil if (!dir_emit_dots(file, ctx)) return 0;
- /* f_version caches the tgid value that the last readdir call couldn't - * return. lseek aka telldir automagically resets f_version to 0. + /* We cache the tgid value that the last readdir call couldn't + * return and lseek resets it to 0. */ ns = proc_pid_ns(inode->i_sb); - tid = (int)file->f_version; - file->f_version = 0; + tid = (int)(intptr_t)file->private_data; + file->private_data = NULL; for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns); task; task = next_tid(task), ctx->pos++) { @@@ -3947,7 -3891,7 +3948,7 @@@ proc_task_instantiate, task, NULL)) { /* returning this tgid failed, save it as the first * pid for the next readir call */ - file->f_version = (u64)tid; + file->private_data = (void *)(intptr_t)tid; put_task_struct(task); break; } @@@ -3972,24 -3916,6 +3973,24 @@@ static int proc_task_getattr(struct mnt return 0; }
+/* + * proc_task_readdir() set @file->private_data to a positive integer + * value, so casting that to u64 is safe. generic_llseek_cookie() will + * set @cookie to 0, so casting to an int is safe. The WARN_ON_ONCE() is + * here to catch any unexpected change in behavior either in + * proc_task_readdir() or generic_llseek_cookie(). + */ +static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence) +{ + u64 cookie = (u64)(intptr_t)file->private_data; + loff_t off; + + off = generic_llseek_cookie(file, offset, whence, &cookie); + WARN_ON_ONCE(cookie > INT_MAX); + file->private_data = (void *)(intptr_t)cookie; /* serialized by f_pos_lock */ + return off; +} + static const struct inode_operations proc_task_inode_operations = { .lookup = proc_task_lookup, .getattr = proc_task_getattr, @@@ -4000,7 -3926,7 +4001,7 @@@ static const struct file_operations proc_task_operations = { .read = generic_read_dir, .iterate_shared = proc_task_readdir, - .llseek = generic_file_llseek, + .llseek = proc_dir_llseek, };
void __init set_proc_pid_nlink(void) diff --combined fs/select.c index f8880a6c069cf,ad171b7a5c11f..6a179c4c2632e --- a/fs/select.c +++ b/fs/select.c @@@ -77,19 -77,16 +77,16 @@@ u64 select_estimate_accuracy(struct tim { u64 ret; struct timespec64 now; + u64 slack = current->timer_slack_ns;
- /* - * Realtime tasks get a slack of 0 for obvious reasons. - */ - - if (rt_task(current)) + if (slack == 0) return 0;
ktime_get_ts64(&now); now = timespec64_sub(*tv, now); ret = __estimate_accuracy(&now); - if (ret < current->timer_slack_ns) - return current->timer_slack_ns; + if (ret < slack) + return slack; return ret; }
@@@ -532,10 -529,10 +529,10 @@@ static noinline_for_stack int do_select continue; mask = EPOLLNVAL; f = fdget(i); - if (f.file) { + if (fd_file(f)) { wait_key_set(wait, in, out, bit, busy_flag); - mask = vfs_poll(f.file, wait); + mask = vfs_poll(fd_file(f), wait);
fdput(f); } @@@ -840,7 -837,7 +837,7 @@@ SYSCALL_DEFINE1(old_select, struct sel_ struct poll_list { struct poll_list *next; unsigned int len; - struct pollfd entries[]; + struct pollfd entries[] __counted_by(len); };
#define POLLFD_PER_PAGE ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd)) @@@ -864,13 -861,13 +861,13 @@@ static inline __poll_t do_pollfd(struc goto out; mask = EPOLLNVAL; f = fdget(fd); - if (!f.file) + if (!fd_file(f)) goto out;
/* userland u16 ->events contains POLL... bitmap */ filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP; pwait->_key = filter | busy_flag; - mask = vfs_poll(f.file, pwait); + mask = vfs_poll(fd_file(f), pwait); if (mask & busy_flag) *can_busy_poll = true; mask &= filter; /* Mask out unneeded events. */ diff --combined fs/signalfd.c index 777e889ab0e80,d0333bce015e9..736bebf935918 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@@ -159,7 -159,7 +159,7 @@@ static ssize_t signalfd_dequeue(struct DECLARE_WAITQUEUE(wait, current);
spin_lock_irq(¤t->sighand->siglock); - ret = dequeue_signal(current, &ctx->sigmask, info, &type); + ret = dequeue_signal(&ctx->sigmask, info, &type); switch (ret) { case 0: if (!nonblock) @@@ -174,7 -174,7 +174,7 @@@ add_wait_queue(¤t->sighand->signalfd_wqh, &wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); - ret = dequeue_signal(current, &ctx->sigmask, info, &type); + ret = dequeue_signal(&ctx->sigmask, info, &type); if (ret != 0) break; if (signal_pending(current)) { @@@ -289,10 -289,10 +289,10 @@@ static int do_signalfd4(int ufd, sigset fd_install(ufd, file); } else { struct fd f = fdget(ufd); - if (!f.file) + if (!fd_file(f)) return -EBADF; - ctx = f.file->private_data; - if (f.file->f_op != &signalfd_fops) { + ctx = fd_file(f)->private_data; + if (fd_file(f)->f_op != &signalfd_fops) { fdput(f); return -EINVAL; } diff --combined include/linux/cleanup.h index a3d3e888cf1f3,9c6b4f2c01765..038b2d523bf88 --- a/include/linux/cleanup.h +++ b/include/linux/cleanup.h @@@ -4,6 -4,142 +4,142 @@@
#include <linux/compiler.h>
+ /** + * DOC: scope-based cleanup helpers + * + * The "goto error" pattern is notorious for introducing subtle resource + * leaks. It is tedious and error prone to add new resource acquisition + * constraints into code paths that already have several unwind + * conditions. The "cleanup" helpers enable the compiler to help with + * this tedium and can aid in maintaining LIFO (last in first out) + * unwind ordering to avoid unintentional leaks. + * + * As drivers make up the majority of the kernel code base, here is an + * example of using these helpers to clean up PCI drivers. The target of + * the cleanups are occasions where a goto is used to unwind a device + * reference (pci_dev_put()), or unlock the device (pci_dev_unlock()) + * before returning. + * + * The DEFINE_FREE() macro can arrange for PCI device references to be + * dropped when the associated variable goes out of scope:: + * + * DEFINE_FREE(pci_dev_put, struct pci_dev *, if (_T) pci_dev_put(_T)) + * ... + * struct pci_dev *dev __free(pci_dev_put) = + * pci_get_slot(parent, PCI_DEVFN(0, 0)); + * + * The above will automatically call pci_dev_put() if @dev is non-NULL + * when @dev goes out of scope (automatic variable scope). If a function + * wants to invoke pci_dev_put() on error, but return @dev (i.e. without + * freeing it) on success, it can do:: + * + * return no_free_ptr(dev); + * + * ...or:: + * + * return_ptr(dev); + * + * The DEFINE_GUARD() macro can arrange for the PCI device lock to be + * dropped when the scope where guard() is invoked ends:: + * + * DEFINE_GUARD(pci_dev, struct pci_dev *, pci_dev_lock(_T), pci_dev_unlock(_T)) + * ... + * guard(pci_dev)(dev); + * + * The lifetime of the lock obtained by the guard() helper follows the + * scope of automatic variable declaration. Take the following example:: + * + * func(...) + * { + * if (...) { + * ... + * guard(pci_dev)(dev); // pci_dev_lock() invoked here + * ... + * } // <- implied pci_dev_unlock() triggered here + * } + * + * Observe the lock is held for the remainder of the "if ()" block not + * the remainder of "func()". + * + * Now, when a function uses both __free() and guard(), or multiple + * instances of __free(), the LIFO order of variable definition order + * matters. GCC documentation says: + * + * "When multiple variables in the same scope have cleanup attributes, + * at exit from the scope their associated cleanup functions are run in + * reverse order of definition (last defined, first cleanup)." + * + * When the unwind order matters it requires that variables be defined + * mid-function scope rather than at the top of the file. Take the + * following example and notice the bug highlighted by "!!":: + * + * LIST_HEAD(list); + * DEFINE_MUTEX(lock); + * + * struct object { + * struct list_head node; + * }; + * + * static struct object *alloc_add(void) + * { + * struct object *obj; + * + * lockdep_assert_held(&lock); + * obj = kzalloc(sizeof(*obj), GFP_KERNEL); + * if (obj) { + * LIST_HEAD_INIT(&obj->node); + * list_add(obj->node, &list): + * } + * return obj; + * } + * + * static void remove_free(struct object *obj) + * { + * lockdep_assert_held(&lock); + * list_del(&obj->node); + * kfree(obj); + * } + * + * DEFINE_FREE(remove_free, struct object *, if (_T) remove_free(_T)) + * static int init(void) + * { + * struct object *obj __free(remove_free) = NULL; + * int err; + * + * guard(mutex)(&lock); + * obj = alloc_add(); + * + * if (!obj) + * return -ENOMEM; + * + * err = other_init(obj); + * if (err) + * return err; // remove_free() called without the lock!! + * + * no_free_ptr(obj); + * return 0; + * } + * + * That bug is fixed by changing init() to call guard() and define + + * initialize @obj in this order:: + * + * guard(mutex)(&lock); + * struct object *obj __free(remove_free) = alloc_add(); + * + * Given that the "__free(...) = NULL" pattern for variables defined at + * the top of the function poses this potential interdependency problem + * the recommendation is to always define and assign variables in one + * statement and not group variable definitions at the top of the + * function when __free() is used. + * + * Lastly, given that the benefit of cleanup helpers is removal of + * "goto", and that the "goto" statement can jump between scopes, the + * expectation is that usage of "goto" and cleanup helpers is never + * mixed in the same function. I.e. for a given routine, convert all + * resources that need a "goto" cleanup to scope-based cleanup, or + * convert none of them. + */ + /* * DEFINE_FREE(name, type, free): * simple helper macro that defines the required wrapper for a __free() @@@ -98,7 -234,7 +234,7 @@@ const volatile void * __must_check_fn(c * DEFINE_CLASS(fdget, struct fd, fdput(_T), fdget(fd), int fd) * * CLASS(fdget, f)(fd); - * if (!f.file) + * if (!fd_file(f)) * return -EBADF; * * // use 'f' without concern diff --combined include/linux/pci_ids.h index 2c94d4004dd50,91182aa1d2ec5..e4bddb9277956 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@@ -580,6 -580,7 +580,7 @@@ #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb #define PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3 0x12c3 #define PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3 0x16fb + #define PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3 0x124b #define PCI_DEVICE_ID_AMD_1AH_M70H_DF_F3 0x12bb #define PCI_DEVICE_ID_AMD_MI200_DF_F3 0x14d3 #define PCI_DEVICE_ID_AMD_MI300_DF_F3 0x152b @@@ -2661,8 -2662,6 +2662,8 @@@ #define PCI_DEVICE_ID_DCI_PCCOM8 0x0002 #define PCI_DEVICE_ID_DCI_PCCOM2 0x0004
+#define PCI_VENDOR_ID_GLENFLY 0x6766 + #define PCI_VENDOR_ID_INTEL 0x8086 #define PCI_DEVICE_ID_INTEL_EESSC 0x0008 #define PCI_DEVICE_ID_INTEL_HDA_CML_LP 0x02c8 diff --combined include/linux/perf_event.h index e336306b8c08e,794f660578780..fb908843f2092 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@@ -168,6 -168,9 +168,9 @@@ struct hw_perf_event struct hw_perf_event_extra extra_reg; struct hw_perf_event_extra branch_reg; }; + struct { /* aux / Intel-PT */ + u64 aux_config; + }; struct { /* software */ struct hrtimer hrtimer; }; @@@ -292,6 -295,19 +295,19 @@@ struct perf_event_pmu_context #define PERF_PMU_CAP_AUX_OUTPUT 0x0080 #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
+ /** + * pmu::scope + */ + enum perf_pmu_scope { + PERF_PMU_SCOPE_NONE = 0, + PERF_PMU_SCOPE_CORE, + PERF_PMU_SCOPE_DIE, + PERF_PMU_SCOPE_CLUSTER, + PERF_PMU_SCOPE_PKG, + PERF_PMU_SCOPE_SYS_WIDE, + PERF_PMU_MAX_SCOPE, + }; + struct perf_output_handle;
#define PMU_NULL_DEV ((void *)(~0UL)) @@@ -315,6 -331,11 +331,11 @@@ struct pmu */ int capabilities;
+ /* + * PMU scope + */ + unsigned int scope; + int __percpu *pmu_disable_count; struct perf_cpu_pmu_context __percpu *cpu_pmu_context; atomic_t exclusive_cnt; /* < 0: cpu; > 0: tsk */ @@@ -615,10 -636,13 +636,13 @@@ typedef void (*perf_overflow_handler_t) * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and * cannot be a group leader. If an event with this flag is detached from the * group it is scheduled out and moved into an unrecoverable ERROR state. + * PERF_EV_CAP_READ_SCOPE: A CPU event that can be read from any CPU of the + * PMU scope where it is active. */ #define PERF_EV_CAP_SOFTWARE BIT(0) #define PERF_EV_CAP_READ_ACTIVE_PKG BIT(1) #define PERF_EV_CAP_SIBLING BIT(2) + #define PERF_EV_CAP_READ_SCOPE BIT(3)
#define SWEVENT_HLIST_BITS 8 #define SWEVENT_HLIST_SIZE (1 << SWEVENT_HLIST_BITS) @@@ -963,12 -987,16 +987,16 @@@ struct perf_event_context struct rcu_head rcu_head;
/* - * Sum (event->pending_work + event->pending_work) + * The count of events for which using the switch-out fast path + * should be avoided. + * + * Sum (event->pending_work + events with + * (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ))) * * The SIGTRAP is targeted at ctx->task, as such it won't do changing * that until the signal is delivered. */ - local_t nr_pending; + local_t nr_no_switch_fast; };
struct perf_cpu_pmu_context { @@@ -1602,7 -1630,13 +1630,7 @@@ static inline int perf_is_paranoid(void return sysctl_perf_event_paranoid > -1; }
-static inline int perf_allow_kernel(struct perf_event_attr *attr) -{ - if (sysctl_perf_event_paranoid > 1 && !perfmon_capable()) - return -EACCES; - - return security_perf_event_open(attr, PERF_SECURITY_KERNEL); -} +int perf_allow_kernel(struct perf_event_attr *attr);
static inline int perf_allow_cpu(struct perf_event_attr *attr) { diff --combined include/linux/uprobes.h index 493dc95d912c9,2b294bf1881fe..e6f4e73125ffa --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@@ -16,6 -16,7 +16,7 @@@ #include <linux/types.h> #include <linux/wait.h>
+ struct uprobe; struct vm_area_struct; struct mm_struct; struct inode; @@@ -27,22 -28,22 +28,22 @@@ struct page
#define MAX_URETPROBE_DEPTH 64
- enum uprobe_filter_ctx { - UPROBE_FILTER_REGISTER, - UPROBE_FILTER_UNREGISTER, - UPROBE_FILTER_MMAP, - }; - struct uprobe_consumer { + /* + * handler() can return UPROBE_HANDLER_REMOVE to signal the need to + * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is + * returned, filter() callback has to be implemented as well and it + * should return false to "confirm" the decision to uninstall uprobe + * for the current process. If filter() is omitted or returns true, + * UPROBE_HANDLER_REMOVE is effectively ignored. + */ int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); int (*ret_handler)(struct uprobe_consumer *self, unsigned long func, struct pt_regs *regs); - bool (*filter)(struct uprobe_consumer *self, - enum uprobe_filter_ctx ctx, - struct mm_struct *mm); + bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
- struct uprobe_consumer *next; + struct list_head cons_node; };
#ifdef CONFIG_UPROBES @@@ -76,6 -77,8 +77,8 @@@ struct uprobe_task struct uprobe *active_uprobe; unsigned long xol_vaddr;
+ struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned int depth; }; @@@ -110,10 -113,10 +113,10 @@@ extern bool is_trap_insn(uprobe_opcode_ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); - extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); - extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); - extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); - extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); + extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); + extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool); + extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc); + extern void uprobe_unregister_sync(void); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@@ -126,6 -129,7 +129,6 @@@ extern int uprobe_pre_sstep_notifier(st extern void uprobe_notify_resume(struct pt_regs *regs); extern bool uprobe_deny_signal(void); extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs); -extern void uprobe_clear_state(struct mm_struct *mm); extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr); extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs); extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs); @@@ -150,22 -154,21 +153,21 @@@ static inline void uprobes_init(void
#define uprobe_get_trap_addr(regs) instruction_pointer(regs)
- static inline int - uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) - { - return -ENOSYS; - } - static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) + static inline struct uprobe * + uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) { - return -ENOSYS; + return ERR_PTR(-ENOSYS); } static inline int - uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) + uprobe_apply(struct uprobe* uprobe, struct uprobe_consumer *uc, bool add) { return -ENOSYS; } static inline void - uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) + uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc) + { + } + static inline void uprobe_unregister_sync(void) { } static inline int uprobe_mmap(struct vm_area_struct *vma) diff --combined include/uapi/linux/elf.h index 81762ff3c99e1,e30a9b47dc87f..b9935988da5cf --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@@ -411,6 -411,7 +411,7 @@@ typedef struct elf64_shdr #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */ /* Old binutils treats 0x203 as a CET state */ #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */ + #define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */ #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */ #define NT_S390_TIMER 0x301 /* s390 timer register */ #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */ @@@ -441,7 -442,6 +442,7 @@@ #define NT_ARM_ZA 0x40c /* ARM SME ZA registers */ #define NT_ARM_ZT 0x40d /* ARM SME ZT registers */ #define NT_ARM_FPMR 0x40e /* ARM floating point mode register */ +#define NT_ARM_POE 0x40f /* ARM POE registers */ #define NT_ARC_V2 0x600 /* ARCv2 accumulator/extra registers */ #define NT_VMCOREDD 0x700 /* Vmcore Device Dump Note */ #define NT_MIPS_DSP 0x800 /* MIPS DSP ASE registers */ diff --combined kernel/events/core.c index d932557d5664b,2766090de84e4..bfd5553e53b28 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@@ -155,20 -155,55 +155,55 @@@ static int cpu_function_call(int cpu, r return data.ret; }
+ enum event_type_t { + EVENT_FLEXIBLE = 0x01, + EVENT_PINNED = 0x02, + EVENT_TIME = 0x04, + EVENT_FROZEN = 0x08, + /* see ctx_resched() for details */ + EVENT_CPU = 0x10, + EVENT_CGROUP = 0x20, + + /* compound helpers */ + EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED, + EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN, + }; + + static inline void __perf_ctx_lock(struct perf_event_context *ctx) + { + raw_spin_lock(&ctx->lock); + WARN_ON_ONCE(ctx->is_active & EVENT_FROZEN); + } + static void perf_ctx_lock(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) { - raw_spin_lock(&cpuctx->ctx.lock); + __perf_ctx_lock(&cpuctx->ctx); if (ctx) - raw_spin_lock(&ctx->lock); + __perf_ctx_lock(ctx); + } + + static inline void __perf_ctx_unlock(struct perf_event_context *ctx) + { + /* + * If ctx_sched_in() didn't again set any ALL flags, clean up + * after ctx_sched_out() by clearing is_active. + */ + if (ctx->is_active & EVENT_FROZEN) { + if (!(ctx->is_active & EVENT_ALL)) + ctx->is_active = 0; + else + ctx->is_active &= ~EVENT_FROZEN; + } + raw_spin_unlock(&ctx->lock); }
static void perf_ctx_unlock(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) { if (ctx) - raw_spin_unlock(&ctx->lock); - raw_spin_unlock(&cpuctx->ctx.lock); + __perf_ctx_unlock(ctx); + __perf_ctx_unlock(&cpuctx->ctx); }
#define TASK_TOMBSTONE ((void *)-1L) @@@ -264,6 -299,7 +299,7 @@@ static void event_function_call(struct { struct perf_event_context *ctx = event->ctx; struct task_struct *task = READ_ONCE(ctx->task); /* verified in event_function */ + struct perf_cpu_context *cpuctx; struct event_function_struct efs = { .event = event, .func = func, @@@ -291,22 -327,25 +327,25 @@@ again if (!task_function_call(task, event_function, &efs)) return;
- raw_spin_lock_irq(&ctx->lock); + local_irq_disable(); + cpuctx = this_cpu_ptr(&perf_cpu_context); + perf_ctx_lock(cpuctx, ctx); /* * Reload the task pointer, it might have been changed by * a concurrent perf_event_context_sched_out(). */ task = ctx->task; - if (task == TASK_TOMBSTONE) { - raw_spin_unlock_irq(&ctx->lock); - return; - } + if (task == TASK_TOMBSTONE) + goto unlock; if (ctx->is_active) { - raw_spin_unlock_irq(&ctx->lock); + perf_ctx_unlock(cpuctx, ctx); + local_irq_enable(); goto again; } func(event, NULL, ctx, data); - raw_spin_unlock_irq(&ctx->lock); + unlock: + perf_ctx_unlock(cpuctx, ctx); + local_irq_enable(); }
/* @@@ -369,16 -408,6 +408,6 @@@ unlock (PERF_SAMPLE_BRANCH_KERNEL |\ PERF_SAMPLE_BRANCH_HV)
- enum event_type_t { - EVENT_FLEXIBLE = 0x1, - EVENT_PINNED = 0x2, - EVENT_TIME = 0x4, - /* see ctx_resched() for details */ - EVENT_CPU = 0x8, - EVENT_CGROUP = 0x10, - EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED, - }; - /* * perf_sched_events : >0 events exist */ @@@ -407,6 -436,11 +436,11 @@@ static LIST_HEAD(pmus) static DEFINE_MUTEX(pmus_lock); static struct srcu_struct pmus_srcu; static cpumask_var_t perf_online_mask; + static cpumask_var_t perf_online_core_mask; + static cpumask_var_t perf_online_die_mask; + static cpumask_var_t perf_online_cluster_mask; + static cpumask_var_t perf_online_pkg_mask; + static cpumask_var_t perf_online_sys_mask; static struct kmem_cache *perf_event_cache;
/* @@@ -685,30 -719,32 +719,32 @@@ do { ___p; \ })
+ #define for_each_epc(_epc, _ctx, _pmu, _cgroup) \ + list_for_each_entry(_epc, &((_ctx)->pmu_ctx_list), pmu_ctx_entry) \ + if (_cgroup && !_epc->nr_cgroups) \ + continue; \ + else if (_pmu && _epc->pmu != _pmu) \ + continue; \ + else + static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup) { struct perf_event_pmu_context *pmu_ctx;
- list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { - if (cgroup && !pmu_ctx->nr_cgroups) - continue; + for_each_epc(pmu_ctx, ctx, NULL, cgroup) perf_pmu_disable(pmu_ctx->pmu); - } }
static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup) { struct perf_event_pmu_context *pmu_ctx;
- list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { - if (cgroup && !pmu_ctx->nr_cgroups) - continue; + for_each_epc(pmu_ctx, ctx, NULL, cgroup) perf_pmu_enable(pmu_ctx->pmu); - } }
- static void ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type); - static void ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type); + static void ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type); + static void ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type);
#ifdef CONFIG_CGROUP_PERF
@@@ -865,7 -901,7 +901,7 @@@ static void perf_cgroup_switch(struct t perf_ctx_lock(cpuctx, cpuctx->task_ctx); perf_ctx_disable(&cpuctx->ctx, true);
- ctx_sched_out(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP); + ctx_sched_out(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP); /* * must not be done before ctxswout due * to update_cgrp_time_from_cpuctx() in @@@ -877,7 -913,7 +913,7 @@@ * perf_cgroup_set_timestamp() in ctx_sched_in() * to not have to pass task around */ - ctx_sched_in(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP); + ctx_sched_in(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP);
perf_ctx_enable(&cpuctx->ctx, true); perf_ctx_unlock(cpuctx, cpuctx->task_ctx); @@@ -933,10 -969,10 +969,10 @@@ static inline int perf_cgroup_connect(i struct fd f = fdget(fd); int ret = 0;
- if (!f.file) + if (!fd_file(f)) return -EBADF;
- css = css_tryget_online_from_dir(f.file->f_path.dentry, + css = css_tryget_online_from_dir(fd_file(f)->f_path.dentry, &perf_event_cgrp_subsys); if (IS_ERR(css)) { ret = PTR_ERR(css); @@@ -1768,6 -1804,14 +1804,14 @@@ perf_event_groups_next(struct perf_even event = rb_entry_safe(rb_next(&event->group_node), \ typeof(*event), group_node))
+ /* + * Does the event attribute request inherit with PERF_SAMPLE_READ + */ + static inline bool has_inherit_and_sample_read(struct perf_event_attr *attr) + { + return attr->inherit && (attr->sample_type & PERF_SAMPLE_READ); + } + /* * Add an event from the lists for its context. * Must be called with ctx->mutex and ctx->lock held. @@@ -1798,6 -1842,8 +1842,8 @@@ list_add_event(struct perf_event *event ctx->nr_user++; if (event->attr.inherit_stat) ctx->nr_stat++; + if (has_inherit_and_sample_read(&event->attr)) + local_inc(&ctx->nr_no_switch_fast);
if (event->state > PERF_EVENT_STATE_OFF) perf_cgroup_event_enable(event, ctx); @@@ -2022,6 -2068,8 +2068,8 @@@ list_del_event(struct perf_event *event ctx->nr_user--; if (event->attr.inherit_stat) ctx->nr_stat--; + if (has_inherit_and_sample_read(&event->attr)) + local_dec(&ctx->nr_no_switch_fast);
list_del_rcu(&event->event_entry);
@@@ -2317,6 -2365,45 +2365,45 @@@ group_sched_out(struct perf_event *grou event_sched_out(event, ctx); }
+ static inline void + __ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final) + { + if (ctx->is_active & EVENT_TIME) { + if (ctx->is_active & EVENT_FROZEN) + return; + update_context_time(ctx); + update_cgrp_time_from_cpuctx(cpuctx, final); + } + } + + static inline void + ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) + { + __ctx_time_update(cpuctx, ctx, false); + } + + /* + * To be used inside perf_ctx_lock() / perf_ctx_unlock(). Lasts until perf_ctx_unlock(). + */ + static inline void + ctx_time_freeze(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx) + { + ctx_time_update(cpuctx, ctx); + if (ctx->is_active & EVENT_TIME) + ctx->is_active |= EVENT_FROZEN; + } + + static inline void + ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event) + { + if (ctx->is_active & EVENT_TIME) { + if (ctx->is_active & EVENT_FROZEN) + return; + update_context_time(ctx); + update_cgrp_time_from_event(event); + } + } + #define DETACH_GROUP 0x01UL #define DETACH_CHILD 0x02UL #define DETACH_DEAD 0x04UL @@@ -2336,10 -2423,7 +2423,7 @@@ __perf_remove_from_context(struct perf_ struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx; unsigned long flags = (unsigned long)info;
- if (ctx->is_active & EVENT_TIME) { - update_context_time(ctx); - update_cgrp_time_from_cpuctx(cpuctx, false); - } + ctx_time_update(cpuctx, ctx);
/* * Ensure event_sched_out() switches to OFF, at the very least @@@ -2424,12 -2508,8 +2508,8 @@@ static void __perf_event_disable(struc if (event->state < PERF_EVENT_STATE_INACTIVE) return;
- if (ctx->is_active & EVENT_TIME) { - update_context_time(ctx); - update_cgrp_time_from_event(event); - } - perf_pmu_disable(event->pmu_ctx->pmu); + ctx_time_update_event(ctx, event);
if (event == event->group_leader) group_sched_out(event, ctx); @@@ -2645,7 -2725,8 +2725,8 @@@ static void add_event_to_ctx(struct per }
static void task_ctx_sched_out(struct perf_event_context *ctx, - enum event_type_t event_type) + struct pmu *pmu, + enum event_type_t event_type) { struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
@@@ -2655,18 -2736,19 +2736,19 @@@ if (WARN_ON_ONCE(ctx != cpuctx->task_ctx)) return;
- ctx_sched_out(ctx, event_type); + ctx_sched_out(ctx, pmu, event_type); }
static void perf_event_sched_in(struct perf_cpu_context *cpuctx, - struct perf_event_context *ctx) + struct perf_event_context *ctx, + struct pmu *pmu) { - ctx_sched_in(&cpuctx->ctx, EVENT_PINNED); + ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED); if (ctx) - ctx_sched_in(ctx, EVENT_PINNED); - ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE); + ctx_sched_in(ctx, pmu, EVENT_PINNED); + ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE); if (ctx) - ctx_sched_in(ctx, EVENT_FLEXIBLE); + ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE); }
/* @@@ -2684,16 -2766,12 +2766,12 @@@ * event_type is a bit mask of the types of events involved. For CPU events, * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE. */ - /* - * XXX: ctx_resched() reschedule entire perf_event_context while adding new - * event to the context or enabling existing event in the context. We can - * probably optimize it by rescheduling only affected pmu_ctx. - */ static void ctx_resched(struct perf_cpu_context *cpuctx, struct perf_event_context *task_ctx, - enum event_type_t event_type) + struct pmu *pmu, enum event_type_t event_type) { bool cpu_event = !!(event_type & EVENT_CPU); + struct perf_event_pmu_context *epc;
/* * If pinned groups are involved, flexible groups also need to be @@@ -2704,10 -2782,14 +2782,14 @@@
event_type &= EVENT_ALL;
- perf_ctx_disable(&cpuctx->ctx, false); + for_each_epc(epc, &cpuctx->ctx, pmu, false) + perf_pmu_disable(epc->pmu); + if (task_ctx) { - perf_ctx_disable(task_ctx, false); - task_ctx_sched_out(task_ctx, event_type); + for_each_epc(epc, task_ctx, pmu, false) + perf_pmu_disable(epc->pmu); + + task_ctx_sched_out(task_ctx, pmu, event_type); }
/* @@@ -2718,15 -2800,19 +2800,19 @@@ * - otherwise, do nothing more. */ if (cpu_event) - ctx_sched_out(&cpuctx->ctx, event_type); + ctx_sched_out(&cpuctx->ctx, pmu, event_type); else if (event_type & EVENT_PINNED) - ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE); + ctx_sched_out(&cpuctx->ctx, pmu, EVENT_FLEXIBLE); + + perf_event_sched_in(cpuctx, task_ctx, pmu);
- perf_event_sched_in(cpuctx, task_ctx); + for_each_epc(epc, &cpuctx->ctx, pmu, false) + perf_pmu_enable(epc->pmu);
- perf_ctx_enable(&cpuctx->ctx, false); - if (task_ctx) - perf_ctx_enable(task_ctx, false); + if (task_ctx) { + for_each_epc(epc, task_ctx, pmu, false) + perf_pmu_enable(epc->pmu); + } }
void perf_pmu_resched(struct pmu *pmu) @@@ -2735,7 -2821,7 +2821,7 @@@ struct perf_event_context *task_ctx = cpuctx->task_ctx;
perf_ctx_lock(cpuctx, task_ctx); - ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU); + ctx_resched(cpuctx, task_ctx, pmu, EVENT_ALL|EVENT_CPU); perf_ctx_unlock(cpuctx, task_ctx); }
@@@ -2791,9 -2877,10 +2877,10 @@@ static int __perf_install_in_context(v #endif
if (reprogram) { - ctx_sched_out(ctx, EVENT_TIME); + ctx_time_freeze(cpuctx, ctx); add_event_to_ctx(event, ctx); - ctx_resched(cpuctx, task_ctx, get_event_type(event)); + ctx_resched(cpuctx, task_ctx, event->pmu_ctx->pmu, + get_event_type(event)); } else { add_event_to_ctx(event, ctx); } @@@ -2936,8 -3023,7 +3023,7 @@@ static void __perf_event_enable(struct event->state <= PERF_EVENT_STATE_ERROR) return;
- if (ctx->is_active) - ctx_sched_out(ctx, EVENT_TIME); + ctx_time_freeze(cpuctx, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); perf_cgroup_event_enable(event, ctx); @@@ -2945,25 -3031,21 +3031,21 @@@ if (!ctx->is_active) return;
- if (!event_filter_match(event)) { - ctx_sched_in(ctx, EVENT_TIME); + if (!event_filter_match(event)) return; - }
/* * If the event is in a group and isn't the group leader, * then don't put it on unless the group is on. */ - if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) { - ctx_sched_in(ctx, EVENT_TIME); + if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) return; - }
task_ctx = cpuctx->task_ctx; if (ctx->task) WARN_ON_ONCE(task_ctx != ctx);
- ctx_resched(cpuctx, task_ctx, get_event_type(event)); + ctx_resched(cpuctx, task_ctx, event->pmu_ctx->pmu, get_event_type(event)); }
/* @@@ -3231,7 -3313,7 +3313,7 @@@ static void __pmu_ctx_sched_out(struct struct perf_event *event, *tmp; struct pmu *pmu = pmu_ctx->pmu;
- if (ctx->task && !ctx->is_active) { + if (ctx->task && !(ctx->is_active & EVENT_ALL)) { struct perf_cpu_pmu_context *cpc;
cpc = this_cpu_ptr(pmu->cpu_pmu_context); @@@ -3239,7 -3321,7 +3321,7 @@@ cpc->task_epc = NULL; }
- if (!event_type) + if (!(event_type & EVENT_ALL)) return;
perf_pmu_disable(pmu); @@@ -3265,8 -3347,17 +3347,17 @@@ perf_pmu_enable(pmu); }
+ /* + * Be very careful with the @pmu argument since this will change ctx state. + * The @pmu argument works for ctx_resched(), because that is symmetric in + * ctx_sched_out() / ctx_sched_in() usage and the ctx state ends up invariant. + * + * However, if you were to be asymmetrical, you could end up with messed up + * state, eg. ctx->is_active cleared even though most EPCs would still actually + * be active. + */ static void - ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type) + ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type) { struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); struct perf_event_pmu_context *pmu_ctx; @@@ -3297,34 -3388,36 +3388,36 @@@ * * would only update time for the pinned events. */ - if (is_active & EVENT_TIME) { - /* update (and stop) ctx time */ - update_context_time(ctx); - update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx); + __ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx); + + /* + * CPU-release for the below ->is_active store, + * see __load_acquire() in perf_event_time_now() + */ + barrier(); + ctx->is_active &= ~event_type; + + if (!(ctx->is_active & EVENT_ALL)) { /* - * CPU-release for the below ->is_active store, - * see __load_acquire() in perf_event_time_now() + * For FROZEN, preserve TIME|FROZEN such that perf_event_time_now() + * does not observe a hole. perf_ctx_unlock() will clean up. */ - barrier(); + if (ctx->is_active & EVENT_FROZEN) + ctx->is_active &= EVENT_TIME_FROZEN; + else + ctx->is_active = 0; }
- ctx->is_active &= ~event_type; - if (!(ctx->is_active & EVENT_ALL)) - ctx->is_active = 0; - if (ctx->task) { WARN_ON_ONCE(cpuctx->task_ctx != ctx); - if (!ctx->is_active) + if (!(ctx->is_active & EVENT_ALL)) cpuctx->task_ctx = NULL; }
is_active ^= ctx->is_active; /* changed bits */
- list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { - if (cgroup && !pmu_ctx->nr_cgroups) - continue; + for_each_epc(pmu_ctx, ctx, pmu, cgroup) __pmu_ctx_sched_out(pmu_ctx, is_active); - } }
/* @@@ -3517,12 -3610,17 +3610,17 @@@ perf_event_context_sched_out(struct tas
perf_ctx_disable(ctx, false);
- /* PMIs are disabled; ctx->nr_pending is stable. */ - if (local_read(&ctx->nr_pending) || - local_read(&next_ctx->nr_pending)) { + /* PMIs are disabled; ctx->nr_no_switch_fast is stable. */ + if (local_read(&ctx->nr_no_switch_fast) || + local_read(&next_ctx->nr_no_switch_fast)) { /* * Must not swap out ctx when there's pending * events that rely on the ctx->task relation. + * + * Likewise, when a context contains inherit + + * SAMPLE_READ events they should be switched + * out using the slow path so that they are + * treated as if they were distinct contexts. */ raw_spin_unlock(&next_ctx->lock); rcu_read_unlock(); @@@ -3563,7 -3661,7 +3661,7 @@@ unlock
inside_switch: perf_ctx_sched_task_cb(ctx, false); - task_ctx_sched_out(ctx, EVENT_ALL); + task_ctx_sched_out(ctx, NULL, EVENT_ALL);
perf_ctx_enable(ctx, false); raw_spin_unlock(&ctx->lock); @@@ -3861,29 -3959,22 +3959,22 @@@ static void pmu_groups_sched_in(struct merge_sched_in, &can_add_hw); }
- static void ctx_groups_sched_in(struct perf_event_context *ctx, - struct perf_event_groups *groups, - bool cgroup) + static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx, + enum event_type_t event_type) { - struct perf_event_pmu_context *pmu_ctx; - - list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { - if (cgroup && !pmu_ctx->nr_cgroups) - continue; - pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu); - } - } + struct perf_event_context *ctx = pmu_ctx->ctx;
- static void __pmu_ctx_sched_in(struct perf_event_context *ctx, - struct pmu *pmu) - { - pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu); + if (event_type & EVENT_PINNED) + pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu); + if (event_type & EVENT_FLEXIBLE) + pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu); }
static void - ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type) + ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type) { struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); + struct perf_event_pmu_context *pmu_ctx; int is_active = ctx->is_active; bool cgroup = event_type & EVENT_CGROUP;
@@@ -3907,7 -3998,7 +3998,7 @@@
ctx->is_active |= (event_type | EVENT_TIME); if (ctx->task) { - if (!is_active) + if (!(is_active & EVENT_ALL)) cpuctx->task_ctx = ctx; else WARN_ON_ONCE(cpuctx->task_ctx != ctx); @@@ -3919,12 -4010,16 +4010,16 @@@ * First go through the list and put on any pinned groups * in order to give them the best chance of going on. */ - if (is_active & EVENT_PINNED) - ctx_groups_sched_in(ctx, &ctx->pinned_groups, cgroup); + if (is_active & EVENT_PINNED) { + for_each_epc(pmu_ctx, ctx, pmu, cgroup) + __pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED); + }
/* Then walk through the lower prio flexible groups */ - if (is_active & EVENT_FLEXIBLE) - ctx_groups_sched_in(ctx, &ctx->flexible_groups, cgroup); + if (is_active & EVENT_FLEXIBLE) { + for_each_epc(pmu_ctx, ctx, pmu, cgroup) + __pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE); + } }
static void perf_event_context_sched_in(struct task_struct *task) @@@ -3967,10 -4062,10 +4062,10 @@@ */ if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) { perf_ctx_disable(&cpuctx->ctx, false); - ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE); + ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE); }
- perf_event_sched_in(cpuctx, ctx); + perf_event_sched_in(cpuctx, ctx, NULL);
perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
@@@ -4093,7 -4188,11 +4188,11 @@@ static void perf_adjust_period(struct p period = perf_calculate_period(event, nsec, count);
delta = (s64)(period - hwc->sample_period); - delta = (delta + 7) / 8; /* low pass filter */ + if (delta >= 0) + delta += 7; + else + delta -= 7; + delta /= 8; /* low pass filter */
sample_period = hwc->sample_period + delta;
@@@ -4311,14 -4410,14 +4410,14 @@@ static bool perf_rotate_context(struct update_context_time(&cpuctx->ctx); __pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE); rotate_ctx(&cpuctx->ctx, cpu_event); - __pmu_ctx_sched_in(&cpuctx->ctx, pmu); + __pmu_ctx_sched_in(cpu_epc, EVENT_FLEXIBLE); }
if (task_event) rotate_ctx(task_epc->ctx, task_event);
if (task_event || (task_epc && cpu_event)) - __pmu_ctx_sched_in(task_epc->ctx, pmu); + __pmu_ctx_sched_in(task_epc, EVENT_FLEXIBLE);
perf_pmu_enable(pmu); perf_ctx_unlock(cpuctx, cpuctx->task_ctx); @@@ -4384,7 -4483,7 +4483,7 @@@ static void perf_event_enable_on_exec(s
cpuctx = this_cpu_ptr(&perf_cpu_context); perf_ctx_lock(cpuctx, ctx); - ctx_sched_out(ctx, EVENT_TIME); + ctx_time_freeze(cpuctx, ctx);
list_for_each_entry(event, &ctx->event_list, event_entry) { enabled |= event_enable_on_exec(event, ctx); @@@ -4396,9 -4495,7 +4495,7 @@@ */ if (enabled) { clone_ctx = unclone_ctx(ctx); - ctx_resched(cpuctx, ctx, event_type); - } else { - ctx_sched_in(ctx, EVENT_TIME); + ctx_resched(cpuctx, ctx, NULL, event_type); } perf_ctx_unlock(cpuctx, ctx);
@@@ -4459,16 -4556,24 +4556,24 @@@ struct perf_read_data int ret; };
+ static inline const struct cpumask *perf_scope_cpu_topology_cpumask(unsigned int scope, int cpu); + static int __perf_event_read_cpu(struct perf_event *event, int event_cpu) { + int local_cpu = smp_processor_id(); u16 local_pkg, event_pkg;
if ((unsigned)event_cpu >= nr_cpu_ids) return event_cpu;
- if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) { - int local_cpu = smp_processor_id(); + if (event->group_caps & PERF_EV_CAP_READ_SCOPE) { + const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(event->pmu->scope, event_cpu);
+ if (cpumask && cpumask_test_cpu(local_cpu, cpumask)) + return local_cpu; + } + + if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) { event_pkg = topology_physical_package_id(event_cpu); local_pkg = topology_physical_package_id(local_cpu);
@@@ -4501,10 -4606,7 +4606,7 @@@ static void __perf_event_read(void *inf return;
raw_spin_lock(&ctx->lock); - if (ctx->is_active & EVENT_TIME) { - update_context_time(ctx); - update_cgrp_time_from_event(event); - } + ctx_time_update_event(ctx, event);
perf_event_update_time(event); if (data->group) @@@ -4539,8 -4641,11 +4641,11 @@@ unlock raw_spin_unlock(&ctx->lock); }
- static inline u64 perf_event_count(struct perf_event *event) + static inline u64 perf_event_count(struct perf_event *event, bool self) { + if (self) + return local64_read(&event->count); + return local64_read(&event->count) + atomic64_read(&event->child_count); }
@@@ -4701,10 -4806,7 +4806,7 @@@ again * May read while context is not active (e.g., thread is * blocked), in that case we cannot update context time */ - if (ctx->is_active & EVENT_TIME) { - update_context_time(ctx); - update_cgrp_time_from_event(event); - } + ctx_time_update_event(ctx, event);
perf_event_update_time(event); if (group) @@@ -5205,7 -5307,7 +5307,7 @@@ static void perf_pending_task_sync(stru */ if (task_work_cancel(current, head)) { event->pending_work = 0; - local_dec(&event->ctx->nr_pending); + local_dec(&event->ctx->nr_no_switch_fast); return; }
@@@ -5499,7 -5601,7 +5601,7 @@@ static u64 __perf_event_read_value(stru mutex_lock(&event->child_mutex);
(void)perf_event_read(event, false); - total += perf_event_count(event); + total += perf_event_count(event, false);
*enabled += event->total_time_enabled + atomic64_read(&event->child_total_time_enabled); @@@ -5508,7 -5610,7 +5610,7 @@@
list_for_each_entry(child, &event->child_list, child_list) { (void)perf_event_read(child, false); - total += perf_event_count(child); + total += perf_event_count(child, false); *enabled += child->total_time_enabled; *running += child->total_time_running; } @@@ -5590,14 -5692,14 +5692,14 @@@ static int __perf_read_group_add(struc /* * Write {count,id} tuples for every sibling. */ - values[n++] += perf_event_count(leader); + values[n++] += perf_event_count(leader, false); if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(leader); if (read_format & PERF_FORMAT_LOST) values[n++] = atomic64_read(&leader->lost_samples);
for_each_sibling_event(sub, leader) { - values[n++] += perf_event_count(sub); + values[n++] += perf_event_count(sub, false); if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(sub); if (read_format & PERF_FORMAT_LOST) @@@ -5899,10 -6001,10 +6001,10 @@@ static const struct file_operations per static inline int perf_fget_light(int fd, struct fd *p) { struct fd f = fdget(fd); - if (!f.file) + if (!fd_file(f)) return -EBADF;
- if (f.file->f_op != &perf_fops) { + if (fd_file(f)->f_op != &perf_fops) { fdput(f); return -EBADF; } @@@ -5962,7 -6064,7 +6064,7 @@@ static long _perf_ioctl(struct perf_eve ret = perf_fget_light(arg, &output); if (ret) return ret; - output_event = output.file->private_data; + output_event = fd_file(output)->private_data; ret = perf_event_set_output(event, output_event); fdput(output); } else { @@@ -6177,7 -6279,7 +6279,7 @@@ void perf_event_update_userpage(struct ++userpg->lock; barrier(); userpg->index = perf_event_index(event); - userpg->offset = perf_event_count(event); + userpg->offset = perf_event_count(event, false); if (userpg->index) userpg->offset -= local64_read(&event->hw.prev_count);
@@@ -6874,7 -6976,7 +6976,7 @@@ static void perf_pending_task(struct ca if (event->pending_work) { event->pending_work = 0; perf_sigtrap(event); - local_dec(&event->ctx->nr_pending); + local_dec(&event->ctx->nr_no_switch_fast); rcuwait_wake_up(&event->pending_work_wait); } rcu_read_unlock(); @@@ -7256,7 -7358,7 +7358,7 @@@ static void perf_output_read_one(struc u64 values[5]; int n = 0;
- values[n++] = perf_event_count(event); + values[n++] = perf_event_count(event, has_inherit_and_sample_read(&event->attr)); if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) { values[n++] = enabled + atomic64_read(&event->child_total_time_enabled); @@@ -7274,14 -7376,15 +7376,15 @@@ }
static void perf_output_read_group(struct perf_output_handle *handle, - struct perf_event *event, - u64 enabled, u64 running) + struct perf_event *event, + u64 enabled, u64 running) { struct perf_event *leader = event->group_leader, *sub; u64 read_format = event->attr.read_format; unsigned long flags; u64 values[6]; int n = 0; + bool self = has_inherit_and_sample_read(&event->attr);
/* * Disabling interrupts avoids all counter scheduling @@@ -7301,7 -7404,7 +7404,7 @@@ (leader->state == PERF_EVENT_STATE_ACTIVE)) leader->pmu->read(leader);
- values[n++] = perf_event_count(leader); + values[n++] = perf_event_count(leader, self); if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(leader); if (read_format & PERF_FORMAT_LOST) @@@ -7316,7 -7419,7 +7419,7 @@@ (sub->state == PERF_EVENT_STATE_ACTIVE)) sub->pmu->read(sub);
- values[n++] = perf_event_count(sub); + values[n++] = perf_event_count(sub, self); if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(sub); if (read_format & PERF_FORMAT_LOST) @@@ -7337,6 -7440,10 +7440,10 @@@ * The problem is that its both hard and excessively expensive to iterate the * child list, not to mention that its impossible to IPI the children running * on another CPU, from interrupt/NMI context. + * + * Instead the combination of PERF_SAMPLE_READ and inherit will track per-thread + * counts rather than attempting to accumulate some value across all children on + * all cores. */ static void perf_output_read(struct perf_output_handle *handle, struct perf_event *event) @@@ -8857,7 -8964,7 +8964,7 @@@ got_name mmap_event->event_id.header.size = sizeof(mmap_event->event_id) + size;
if (atomic_read(&nr_build_id_events)) - build_id_parse(vma, mmap_event->build_id, &mmap_event->build_id_size); + build_id_parse_nofault(vma, mmap_event->build_id, &mmap_event->build_id_size);
perf_iterate_sb(perf_event_mmap_output, mmap_event, @@@ -9747,7 -9854,7 +9854,7 @@@ static int __perf_event_overflow(struc if (!event->pending_work && !task_work_add(current, &event->pending_task, notify_mode)) { event->pending_work = pending_id; - local_inc(&event->ctx->nr_pending); + local_inc(&event->ctx->nr_no_switch_fast);
event->pending_addr = 0; if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR)) @@@ -11484,10 -11591,60 +11591,60 @@@ perf_event_mux_interval_ms_store(struc } static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
+ static inline const struct cpumask *perf_scope_cpu_topology_cpumask(unsigned int scope, int cpu) + { + switch (scope) { + case PERF_PMU_SCOPE_CORE: + return topology_sibling_cpumask(cpu); + case PERF_PMU_SCOPE_DIE: + return topology_die_cpumask(cpu); + case PERF_PMU_SCOPE_CLUSTER: + return topology_cluster_cpumask(cpu); + case PERF_PMU_SCOPE_PKG: + return topology_core_cpumask(cpu); + case PERF_PMU_SCOPE_SYS_WIDE: + return cpu_online_mask; + } + + return NULL; + } + + static inline struct cpumask *perf_scope_cpumask(unsigned int scope) + { + switch (scope) { + case PERF_PMU_SCOPE_CORE: + return perf_online_core_mask; + case PERF_PMU_SCOPE_DIE: + return perf_online_die_mask; + case PERF_PMU_SCOPE_CLUSTER: + return perf_online_cluster_mask; + case PERF_PMU_SCOPE_PKG: + return perf_online_pkg_mask; + case PERF_PMU_SCOPE_SYS_WIDE: + return perf_online_sys_mask; + } + + return NULL; + } + + static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr, + char *buf) + { + struct pmu *pmu = dev_get_drvdata(dev); + struct cpumask *mask = perf_scope_cpumask(pmu->scope); + + if (mask) + return cpumap_print_to_pagebuf(true, buf, mask); + return 0; + } + + static DEVICE_ATTR_RO(cpumask); + static struct attribute *pmu_dev_attrs[] = { &dev_attr_type.attr, &dev_attr_perf_event_mux_interval_ms.attr, &dev_attr_nr_addr_filters.attr, + &dev_attr_cpumask.attr, NULL, };
@@@ -11499,6 -11656,10 +11656,10 @@@ static umode_t pmu_dev_is_visible(struc if (n == 2 && !pmu->nr_addr_filters) return 0;
+ /* cpumask */ + if (n == 3 && pmu->scope == PERF_PMU_SCOPE_NONE) + return 0; + return a->mode; }
@@@ -11583,6 -11744,11 +11744,11 @@@ int perf_pmu_register(struct pmu *pmu, goto free_pdc; }
+ if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, "Can not register a pmu with an invalid scope.\n")) { + ret = -EINVAL; + goto free_pdc; + } + pmu->name = name;
if (type >= 0) @@@ -11737,6 -11903,22 +11903,22 @@@ static int perf_try_init_event(struct p event_has_any_exclude_flag(event)) ret = -EINVAL;
+ if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) { + const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu); + struct cpumask *pmu_cpumask = perf_scope_cpumask(pmu->scope); + int cpu; + + if (pmu_cpumask && cpumask) { + cpu = cpumask_any_and(pmu_cpumask, cpumask); + if (cpu >= nr_cpu_ids) + ret = -ENODEV; + else + event->event_caps |= PERF_EV_CAP_READ_SCOPE; + } else { + ret = -ENODEV; + } + } + if (ret && event->destroy) event->destroy(event); } @@@ -12064,10 -12246,12 +12246,12 @@@ perf_event_alloc(struct perf_event_att local64_set(&hwc->period_left, hwc->sample_period);
/* - * We currently do not support PERF_SAMPLE_READ on inherited events. + * We do not support PERF_SAMPLE_READ on inherited events unless + * PERF_SAMPLE_TID is also selected, which allows inherited events to + * collect per-thread samples. * See perf_output_read(). */ - if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)) + if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID)) goto err_ns;
if (!has_branch_stack(event)) @@@ -12481,7 -12665,7 +12665,7 @@@ SYSCALL_DEFINE5(perf_event_open struct perf_event_attr attr; struct perf_event_context *ctx; struct file *event_file = NULL; - struct fd group = {NULL, 0}; + struct fd group = EMPTY_FD; struct task_struct *task = NULL; struct pmu *pmu; int event_fd; @@@ -12556,7 -12740,7 +12740,7 @@@ err = perf_fget_light(group_fd, &group); if (err) goto err_fd; - group_leader = group.file->private_data; + group_leader = fd_file(group)->private_data; if (flags & PERF_FLAG_FD_OUTPUT) output_event = group_leader; if (flags & PERF_FLAG_FD_NO_GROUP) @@@ -13091,7 -13275,7 +13275,7 @@@ static void sync_child_event(struct per perf_event_read_event(child_event, task); }
- child_val = perf_event_count(child_event); + child_val = perf_event_count(child_event, false);
/* * Add back the child's count to the parent's count: @@@ -13182,7 -13366,7 +13366,7 @@@ static void perf_event_exit_task_contex * in. */ raw_spin_lock_irq(&child_ctx->lock); - task_ctx_sched_out(child_ctx, EVENT_ALL); + task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
/* * Now that the context is inactive, destroy the task <-> ctx relation @@@ -13358,15 -13542,6 +13542,15 @@@ const struct perf_event_attr *perf_even return &event->attr; }
+int perf_allow_kernel(struct perf_event_attr *attr) +{ + if (sysctl_perf_event_paranoid > 1 && !perfmon_capable()) + return -EACCES; + + return security_perf_event_open(attr, PERF_SECURITY_KERNEL); +} +EXPORT_SYMBOL_GPL(perf_allow_kernel); + /* * Inherit an event from parent task to child task. * @@@ -13697,6 -13872,12 +13881,12 @@@ static void __init perf_event_init_all_ int cpu;
zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL); + zalloc_cpumask_var(&perf_online_core_mask, GFP_KERNEL); + zalloc_cpumask_var(&perf_online_die_mask, GFP_KERNEL); + zalloc_cpumask_var(&perf_online_cluster_mask, GFP_KERNEL); + zalloc_cpumask_var(&perf_online_pkg_mask, GFP_KERNEL); + zalloc_cpumask_var(&perf_online_sys_mask, GFP_KERNEL); +
for_each_possible_cpu(cpu) { swhash = &per_cpu(swevent_htable, cpu); @@@ -13740,12 -13921,46 +13930,46 @@@ static void __perf_event_exit_context(v struct perf_event *event;
raw_spin_lock(&ctx->lock); - ctx_sched_out(ctx, EVENT_TIME); + ctx_sched_out(ctx, NULL, EVENT_TIME); list_for_each_entry(event, &ctx->event_list, event_entry) __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP); raw_spin_unlock(&ctx->lock); }
+ static void perf_event_clear_cpumask(unsigned int cpu) + { + int target[PERF_PMU_MAX_SCOPE]; + unsigned int scope; + struct pmu *pmu; + + cpumask_clear_cpu(cpu, perf_online_mask); + + for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) { + const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(scope, cpu); + struct cpumask *pmu_cpumask = perf_scope_cpumask(scope); + + target[scope] = -1; + if (WARN_ON_ONCE(!pmu_cpumask || !cpumask)) + continue; + + if (!cpumask_test_and_clear_cpu(cpu, pmu_cpumask)) + continue; + target[scope] = cpumask_any_but(cpumask, cpu); + if (target[scope] < nr_cpu_ids) + cpumask_set_cpu(target[scope], pmu_cpumask); + } + + /* migrate */ + list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) { + if (pmu->scope == PERF_PMU_SCOPE_NONE || + WARN_ON_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE)) + continue; + + if (target[pmu->scope] >= 0 && target[pmu->scope] < nr_cpu_ids) + perf_pmu_migrate_context(pmu, cpu, target[pmu->scope]); + } + } + static void perf_event_exit_cpu_context(int cpu) { struct perf_cpu_context *cpuctx; @@@ -13753,6 -13968,11 +13977,11 @@@
// XXX simplify cpuctx->online mutex_lock(&pmus_lock); + /* + * Clear the cpumasks, and migrate to other CPUs if possible. + * Must be invoked before the __perf_event_exit_context. + */ + perf_event_clear_cpumask(cpu); cpuctx = per_cpu_ptr(&perf_cpu_context, cpu); ctx = &cpuctx->ctx;
@@@ -13760,7 -13980,6 +13989,6 @@@ smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1); cpuctx->online = 0; mutex_unlock(&ctx->mutex); - cpumask_clear_cpu(cpu, perf_online_mask); mutex_unlock(&pmus_lock); } #else @@@ -13769,6 -13988,42 +13997,42 @@@ static void perf_event_exit_cpu_context
#endif
+ static void perf_event_setup_cpumask(unsigned int cpu) + { + struct cpumask *pmu_cpumask; + unsigned int scope; + + cpumask_set_cpu(cpu, perf_online_mask); + + /* + * Early boot stage, the cpumask hasn't been set yet. + * The perf_online_<domain>_masks includes the first CPU of each domain. + * Always uncondifionally set the boot CPU for the perf_online_<domain>_masks. + */ + if (!topology_sibling_cpumask(cpu)) { + for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) { + pmu_cpumask = perf_scope_cpumask(scope); + if (WARN_ON_ONCE(!pmu_cpumask)) + continue; + cpumask_set_cpu(cpu, pmu_cpumask); + } + return; + } + + for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) { + const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(scope, cpu); + + pmu_cpumask = perf_scope_cpumask(scope); + + if (WARN_ON_ONCE(!pmu_cpumask || !cpumask)) + continue; + + if (!cpumask_empty(cpumask) && + cpumask_any_and(pmu_cpumask, cpumask) >= nr_cpu_ids) + cpumask_set_cpu(cpu, pmu_cpumask); + } + } + int perf_event_init_cpu(unsigned int cpu) { struct perf_cpu_context *cpuctx; @@@ -13777,7 -14032,7 +14041,7 @@@ perf_swevent_init_cpu(cpu);
mutex_lock(&pmus_lock); - cpumask_set_cpu(cpu, perf_online_mask); + perf_event_setup_cpumask(cpu); cpuctx = per_cpu_ptr(&perf_cpu_context, cpu); ctx = &cpuctx->ctx;
diff --combined kernel/events/uprobes.c index 5afd00f264314,4b7e590dc428e..86fcb2386ea2f --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@@ -40,6 -40,9 +40,9 @@@ static struct rb_root uprobes_tree = RB #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ + static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock); + + DEFINE_STATIC_SRCU(uprobes_srcu);
#define UPROBES_HASH_SZ 13 /* serialize uprobe->pending_list */ @@@ -57,8 -60,9 +60,9 @@@ struct uprobe struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; struct list_head pending_list; - struct uprobe_consumer *consumers; + struct list_head consumers; struct inode *inode; /* Also hold a ref to inode */ + struct rcu_head rcu; loff_t offset; loff_t ref_ctr_offset; unsigned long flags; @@@ -109,6 -113,11 +113,11 @@@ struct xol_area unsigned long vaddr; /* Page(s) of instruction slots */ };
+ static void uprobe_warn(struct task_struct *t, const char *msg) + { + pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg); + } + /* * valid_vma: Verify if the specified vma is an executable vma * Relax restrictions while unregistering: vm_flags might have @@@ -453,7 -462,7 +462,7 @@@ static int update_ref_ctr(struct uprob * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * - * Called with mm->mmap_lock held for write. + * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, @@@ -587,25 -596,63 +596,63 @@@ set_orig_insn(struct arch_uprobe *aupro *(uprobe_opcode_t *)&auprobe->insn); }
+ /* uprobe should have guaranteed positive refcount */ static struct uprobe *get_uprobe(struct uprobe *uprobe) { refcount_inc(&uprobe->ref); return uprobe; }
+ /* + * uprobe should have guaranteed lifetime, which can be either of: + * - caller already has refcount taken (and wants an extra one); + * - uprobe is RCU protected and won't be freed until after grace period; + * - we are holding uprobes_treelock (for read or write, doesn't matter). + */ + static struct uprobe *try_get_uprobe(struct uprobe *uprobe) + { + if (refcount_inc_not_zero(&uprobe->ref)) + return uprobe; + return NULL; + } + + static inline bool uprobe_is_active(struct uprobe *uprobe) + { + return !RB_EMPTY_NODE(&uprobe->rb_node); + } + + static void uprobe_free_rcu(struct rcu_head *rcu) + { + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); + + kfree(uprobe); + } + static void put_uprobe(struct uprobe *uprobe) { - if (refcount_dec_and_test(&uprobe->ref)) { - /* - * If application munmap(exec_vma) before uprobe_unregister() - * gets called, we don't get a chance to remove uprobe from - * delayed_uprobe_list from remove_breakpoint(). Do it here. - */ - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(uprobe, NULL); - mutex_unlock(&delayed_uprobe_lock); - kfree(uprobe); + if (!refcount_dec_and_test(&uprobe->ref)) + return; + + write_lock(&uprobes_treelock); + + if (uprobe_is_active(uprobe)) { + write_seqcount_begin(&uprobes_seqcount); + rb_erase(&uprobe->rb_node, &uprobes_tree); + write_seqcount_end(&uprobes_seqcount); } + + write_unlock(&uprobes_treelock); + + /* + * If application munmap(exec_vma) before uprobe_unregister() + * gets called, we don't get a chance to remove uprobe from + * delayed_uprobe_list from remove_breakpoint(). Do it here. + */ + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(uprobe, NULL); + mutex_unlock(&delayed_uprobe_lock); + + call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu); }
static __always_inline @@@ -647,62 -694,86 +694,86 @@@ static inline int __uprobe_cmp(struct r return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b)); }
- static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) + /* + * Assumes being inside RCU protected region. + * No refcount is taken on returned uprobe. + */ + static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset) { struct __uprobe_key key = { .inode = inode, .offset = offset, }; - struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); + struct rb_node *node; + unsigned int seq;
- if (node) - return get_uprobe(__node_2_uprobe(node)); + lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); + + do { + seq = read_seqcount_begin(&uprobes_seqcount); + node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key); + /* + * Lockless RB-tree lookups can result only in false negatives. + * If the element is found, it is correct and can be returned + * under RCU protection. If we find nothing, we need to + * validate that seqcount didn't change. If it did, we have to + * try again as we might have missed the element (false + * negative). If seqcount is unchanged, search truly failed. + */ + if (node) + return __node_2_uprobe(node); + } while (read_seqcount_retry(&uprobes_seqcount, seq));
return NULL; }
/* - * Find a uprobe corresponding to a given inode:offset - * Acquires uprobes_treelock + * Attempt to insert a new uprobe into uprobes_tree. + * + * If uprobe already exists (for given inode+offset), we just increment + * refcount of previously existing uprobe. + * + * If not, a provided new instance of uprobe is inserted into the tree (with + * assumed initial refcount == 1). + * + * In any case, we return a uprobe instance that ends up being in uprobes_tree. + * Caller has to clean up new uprobe instance, if it ended up not being + * inserted into the tree. + * + * We assume that uprobes_treelock is held for writing. */ - static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) - { - struct uprobe *uprobe; - - read_lock(&uprobes_treelock); - uprobe = __find_uprobe(inode, offset); - read_unlock(&uprobes_treelock); - - return uprobe; - } - static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; + again: + node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); + if (node) { + struct uprobe *u = __node_2_uprobe(node); + + if (!try_get_uprobe(u)) { + rb_erase(node, &uprobes_tree); + RB_CLEAR_NODE(&u->rb_node); + goto again; + }
- node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); - if (node) - return get_uprobe(__node_2_uprobe(node)); + return u; + }
- /* get access + creation ref */ - refcount_set(&uprobe->ref, 2); - return NULL; + return uprobe; }
/* - * Acquire uprobes_treelock. - * Matching uprobe already exists in rbtree; - * increment (access refcount) and return the matching uprobe. - * - * No matching uprobe; insert the uprobe in rb_tree; - * get a double refcount (access + creation) and return NULL. + * Acquire uprobes_treelock and insert uprobe into uprobes_tree + * (or reuse existing one, see __insert_uprobe() comments above). */ static struct uprobe *insert_uprobe(struct uprobe *uprobe) { struct uprobe *u;
write_lock(&uprobes_treelock); + write_seqcount_begin(&uprobes_seqcount); u = __insert_uprobe(uprobe); + write_seqcount_end(&uprobes_seqcount); write_unlock(&uprobes_treelock);
return u; @@@ -725,18 -796,21 +796,21 @@@ static struct uprobe *alloc_uprobe(stru
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); if (!uprobe) - return NULL; + return ERR_PTR(-ENOMEM);
uprobe->inode = inode; uprobe->offset = offset; uprobe->ref_ctr_offset = ref_ctr_offset; + INIT_LIST_HEAD(&uprobe->consumers); init_rwsem(&uprobe->register_rwsem); init_rwsem(&uprobe->consumer_rwsem); + RB_CLEAR_NODE(&uprobe->rb_node); + refcount_set(&uprobe->ref, 1);
/* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe) { + if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); @@@ -753,32 -827,19 +827,19 @@@ static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(&uprobe->consumer_rwsem); - uc->next = uprobe->consumers; - uprobe->consumers = uc; + list_add_rcu(&uc->cons_node, &uprobe->consumers); up_write(&uprobe->consumer_rwsem); }
/* * For uprobe @uprobe, delete the consumer @uc. - * Return true if the @uc is deleted successfully - * or return false. + * Should never be called with consumer that's not part of @uprobe->consumers. */ - static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc) + static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc) { - struct uprobe_consumer **con; - bool ret = false; - down_write(&uprobe->consumer_rwsem); - for (con = &uprobe->consumers; *con; con = &(*con)->next) { - if (*con == uc) { - *con = uc->next; - ret = true; - break; - } - } + list_del_rcu(&uc->cons_node); up_write(&uprobe->consumer_rwsem); - - return ret; }
static int __copy_insn(struct address_space *mapping, struct file *filp, @@@ -863,21 -924,20 +924,20 @@@ static int prepare_uprobe(struct uprob return ret; }
- static inline bool consumer_filter(struct uprobe_consumer *uc, - enum uprobe_filter_ctx ctx, struct mm_struct *mm) + static inline bool consumer_filter(struct uprobe_consumer *uc, struct mm_struct *mm) { - return !uc->filter || uc->filter(uc, ctx, mm); + return !uc->filter || uc->filter(uc, mm); }
- static bool filter_chain(struct uprobe *uprobe, - enum uprobe_filter_ctx ctx, struct mm_struct *mm) + static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm) { struct uprobe_consumer *uc; bool ret = false;
down_read(&uprobe->consumer_rwsem); - for (uc = uprobe->consumers; uc; uc = uc->next) { - ret = consumer_filter(uc, ctx, mm); + list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, + srcu_read_lock_held(&uprobes_srcu)) { + ret = consumer_filter(uc, mm); if (ret) break; } @@@ -921,27 -981,6 +981,6 @@@ remove_breakpoint(struct uprobe *uprobe return set_orig_insn(&uprobe->arch, mm, vaddr); }
- static inline bool uprobe_is_active(struct uprobe *uprobe) - { - return !RB_EMPTY_NODE(&uprobe->rb_node); - } - /* - * There could be threads that have already hit the breakpoint. They - * will recheck the current insn and restart if find_uprobe() fails. - * See find_active_uprobe(). - */ - static void delete_uprobe(struct uprobe *uprobe) - { - if (WARN_ON(!uprobe_is_active(uprobe))) - return; - - write_lock(&uprobes_treelock); - rb_erase(&uprobe->rb_node, &uprobes_tree); - write_unlock(&uprobes_treelock); - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ - put_uprobe(uprobe); - } - struct map_info { struct map_info *next; struct mm_struct *mm; @@@ -1046,7 -1085,13 +1085,13 @@@ register_for_each_vma(struct uprobe *up
if (err && is_register) goto free; - + /* + * We take mmap_lock for writing to avoid the race with + * find_active_uprobe_rcu() which takes mmap_lock for reading. + * Thus this install_breakpoint() can not make + * is_trap_at_addr() true right after find_uprobe_rcu() + * returns NULL in find_active_uprobe_rcu(). + */ mmap_write_lock(mm); vma = find_vma(mm, info->vaddr); if (!vma || !valid_vma(vma, is_register) || @@@ -1059,12 -1104,10 +1104,10 @@@
if (is_register) { /* consult only the "caller", new consumer. */ - if (consumer_filter(new, - UPROBE_FILTER_REGISTER, mm)) + if (consumer_filter(new, mm)) err = install_breakpoint(uprobe, mm, vma, info->vaddr); } else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) { - if (!filter_chain(uprobe, - UPROBE_FILTER_UNREGISTER, mm)) + if (!filter_chain(uprobe, mm)) err |= remove_breakpoint(uprobe, mm, info->vaddr); }
@@@ -1079,152 -1122,140 +1122,140 @@@ return err; }
- static void - __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) + /** + * uprobe_unregister_nosync - unregister an already registered probe. + * @uprobe: uprobe to remove + * @uc: identify which probe if multiple probes are colocated. + */ + void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc) { int err;
- if (WARN_ON(!consumer_del(uprobe, uc))) - return; - + down_write(&uprobe->register_rwsem); + consumer_del(uprobe, uc); err = register_for_each_vma(uprobe, NULL); - /* TODO : cant unregister? schedule a worker thread */ - if (!uprobe->consumers && !err) - delete_uprobe(uprobe); - } - - /* - * uprobe_unregister - unregister an already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. - */ - void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) - { - struct uprobe *uprobe; + up_write(&uprobe->register_rwsem);
- uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) + /* TODO : cant unregister? schedule a worker thread */ + if (unlikely(err)) { + uprobe_warn(current, "unregister, leaking uprobe"); return; + }
- down_write(&uprobe->register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(&uprobe->register_rwsem); put_uprobe(uprobe); } - EXPORT_SYMBOL_GPL(uprobe_unregister); + EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
- /* - * __uprobe_register - register a probe + void uprobe_unregister_sync(void) + { + /* + * Now that handler_chain() and handle_uretprobe_chain() iterate over + * uprobe->consumers list under RCU protection without holding + * uprobe->register_rwsem, we need to wait for RCU grace period to + * make sure that we can't call into just unregistered + * uprobe_consumer's callbacks anymore. If we don't do that, fast and + * unlucky enough caller can free consumer's memory and cause + * handler_chain() or handle_uretprobe_chain() to do an use-after-free. + */ + synchronize_srcu(&uprobes_srcu); + } + EXPORT_SYMBOL_GPL(uprobe_unregister_sync); + + /** + * uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. + * @ref_ctr_offset: offset of SDT marker / reference counter * @uc: information on howto handle the probe.. * - * Apart from the access refcount, __uprobe_register() takes a creation + * Apart from the access refcount, uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of __uprobe_register() is required to keep @inode + * unregisters. Caller of uprobe_register() is required to keep @inode * (and the containing mount) referenced. * - * Return errno if it cannot successully install probes - * else return 0 (success) + * Return: pointer to the new uprobe on success or an ERR_PTR on failure. */ - static int __uprobe_register(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset, struct uprobe_consumer *uc) + struct uprobe *uprobe_register(struct inode *inode, + loff_t offset, loff_t ref_ctr_offset, + struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret;
/* Uprobe must have at least one set consumer */ if (!uc->handler && !uc->ret_handler) - return -EINVAL; + return ERR_PTR(-EINVAL);
/* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */ if (!inode->i_mapping->a_ops->read_folio && !shmem_mapping(inode->i_mapping)) - return -EIO; + return ERR_PTR(-EIO); /* Racy, just to catch the obvious mistakes */ if (offset > i_size_read(inode)) - return -EINVAL; + return ERR_PTR(-EINVAL);
/* * This ensures that copy_from_page(), copy_to_page() and * __update_ref_ctr() can't cross page boundary. */ if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE)) - return -EINVAL; + return ERR_PTR(-EINVAL); if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) - return -EINVAL; + return ERR_PTR(-EINVAL);
- retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (!uprobe) - return -ENOMEM; if (IS_ERR(uprobe)) - return PTR_ERR(uprobe); + return uprobe;
- /* - * We can race with uprobe_unregister()->delete_uprobe(). - * Check uprobe_is_active() and retry if it is false. - */ down_write(&uprobe->register_rwsem); - ret = -EAGAIN; - if (likely(uprobe_is_active(uprobe))) { - consumer_add(uprobe, uc); - ret = register_for_each_vma(uprobe, uc); - if (ret) - __uprobe_unregister(uprobe, uc); - } + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe);
- if (unlikely(ret == -EAGAIN)) - goto retry; - return ret; - } + if (ret) { + uprobe_unregister_nosync(uprobe, uc); + /* + * Registration might have partially succeeded, so we can have + * this consumer being called right at this time. We need to + * sync here. It's ok, it's unlikely slow path. + */ + uprobe_unregister_sync(); + return ERR_PTR(ret); + }
- int uprobe_register(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc) - { - return __uprobe_register(inode, offset, 0, uc); + return uprobe; } EXPORT_SYMBOL_GPL(uprobe_register);
- int uprobe_register_refctr(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset, struct uprobe_consumer *uc) - { - return __uprobe_register(inode, offset, ref_ctr_offset, uc); - } - EXPORT_SYMBOL_GPL(uprobe_register_refctr); - - /* - * uprobe_apply - unregister an already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. + /** + * uprobe_apply - add or remove the breakpoints according to @uc->filter + * @uprobe: uprobe which "owns" the breakpoint * @uc: consumer which wants to add more or remove some breakpoints * @add: add or remove the breakpoints + * Return: 0 on success or negative error code. */ - int uprobe_apply(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc, bool add) + int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add) { - struct uprobe *uprobe; struct uprobe_consumer *con; - int ret = -ENOENT; - - uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) - return ret; + int ret = -ENOENT, srcu_idx;
down_write(&uprobe->register_rwsem); - for (con = uprobe->consumers; con && con != uc ; con = con->next) - ; - if (con) - ret = register_for_each_vma(uprobe, add ? uc : NULL); + + srcu_idx = srcu_read_lock(&uprobes_srcu); + list_for_each_entry_srcu(con, &uprobe->consumers, cons_node, + srcu_read_lock_held(&uprobes_srcu)) { + if (con == uc) { + ret = register_for_each_vma(uprobe, add ? uc : NULL); + break; + } + } + srcu_read_unlock(&uprobes_srcu, srcu_idx); + up_write(&uprobe->register_rwsem); - put_uprobe(uprobe);
return ret; } @@@ -1305,15 -1336,17 +1336,17 @@@ static void build_probe_list(struct ino u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset < min) break; - list_add(&u->pending_list, head); - get_uprobe(u); + /* if uprobe went away, it's safe to ignore it */ + if (try_get_uprobe(u)) + list_add(&u->pending_list, head); } for (t = n; (t = rb_next(t)); ) { u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset > max) break; - list_add(&u->pending_list, head); - get_uprobe(u); + /* if uprobe went away, it's safe to ignore it */ + if (try_get_uprobe(u)) + list_add(&u->pending_list, head); } } read_unlock(&uprobes_treelock); @@@ -1384,7 -1417,7 +1417,7 @@@ int uprobe_mmap(struct vm_area_struct * */ list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { if (!fatal_signal_pending(current) && - filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) { + filter_chain(uprobe, vma->vm_mm)) { unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset); install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); } @@@ -1482,22 -1515,6 +1515,22 @@@ void * __weak arch_uprobe_trampoline(un return &insn; }
+/* + * uprobe_clear_state - Free the area allocated for slots. + */ +static void uprobe_clear_state(const struct vm_special_mapping *sm, struct vm_area_struct *vma) +{ + struct xol_area *area = container_of(vma->vm_private_data, struct xol_area, xol_mapping); + + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(NULL, vma->vm_mm); + mutex_unlock(&delayed_uprobe_lock); + + put_page(area->pages[0]); + kfree(area->bitmap); + kfree(area); +} + static struct xol_area *__create_xol_area(unsigned long vaddr) { struct mm_struct *mm = current->mm; @@@ -1515,7 -1532,6 +1548,7 @@@ goto free_area;
area->xol_mapping.name = "[uprobes]"; + area->xol_mapping.close = uprobe_clear_state; area->xol_mapping.pages = area->pages; area->pages[0] = alloc_page(GFP_HIGHUSER); if (!area->pages[0]) @@@ -1561,6 -1577,25 +1594,6 @@@ static struct xol_area *get_xol_area(vo return area; }
-/* - * uprobe_clear_state - Free the area allocated for slots. - */ -void uprobe_clear_state(struct mm_struct *mm) -{ - struct xol_area *area = mm->uprobes_state.xol_area; - - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(NULL, mm); - mutex_unlock(&delayed_uprobe_lock); - - if (!area) - return; - - put_page(area->pages[0]); - kfree(area->bitmap); - kfree(area); -} - void uprobe_start_dup_mmap(void) { percpu_down_read(&dup_mmap_sem); @@@ -1768,6 -1803,12 +1801,12 @@@ static int dup_utask(struct task_struc return -ENOMEM;
*n = *o; + /* + * uprobe's refcnt has to be positive at this point, kept by + * utask->return_instances items; return_instances can't be + * removed right now, as task is blocked due to duping; so + * get_uprobe() is safe to use here. + */ get_uprobe(n->uprobe); n->next = NULL;
@@@ -1779,12 -1820,6 +1818,6 @@@ return 0; }
- static void uprobe_warn(struct task_struct *t, const char *msg) - { - pr_warn("uprobe: %s:%d failed to %s\n", - current->comm, current->pid, msg); - } - static void dup_xol_work(struct callback_head *work) { if (current->flags & PF_EXITING) @@@ -1881,9 -1916,13 +1914,13 @@@ static void prepare_uretprobe(struct up return; }
+ /* we need to bump refcount to store uprobe in utask */ + if (!try_get_uprobe(uprobe)) + return; + ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL); if (!ri) - return; + goto fail;
trampoline_vaddr = uprobe_get_trampoline_vaddr(); orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs); @@@ -1910,8 -1949,7 +1947,7 @@@ } orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; } - - ri->uprobe = get_uprobe(uprobe); + ri->uprobe = uprobe; ri->func = instruction_pointer(regs); ri->stack = user_stack_pointer(regs); ri->orig_ret_vaddr = orig_ret_vaddr; @@@ -1922,8 -1960,9 +1958,9 @@@ utask->return_instances = ri;
return; - fail: + fail: kfree(ri); + put_uprobe(uprobe); }
/* Prepare to single-step probed instruction out of line. */ @@@ -1938,9 -1977,14 +1975,14 @@@ pre_ssout(struct uprobe *uprobe, struc if (!utask) return -ENOMEM;
+ if (!try_get_uprobe(uprobe)) + return -EINVAL; + xol_vaddr = xol_get_insn_slot(uprobe); - if (!xol_vaddr) - return -ENOMEM; + if (!xol_vaddr) { + err = -ENOMEM; + goto err_out; + }
utask->xol_vaddr = xol_vaddr; utask->vaddr = bp_vaddr; @@@ -1948,12 -1992,15 +1990,15 @@@ err = arch_uprobe_pre_xol(&uprobe->arch, regs); if (unlikely(err)) { xol_free_insn_slot(current); - return err; + goto err_out; }
utask->active_uprobe = uprobe; utask->state = UTASK_SSTEP; return 0; + err_out: + put_uprobe(uprobe); + return err; }
/* @@@ -2026,13 -2073,7 +2071,7 @@@ static int is_trap_at_addr(struct mm_st if (likely(result == 0)) goto out;
- /* - * The NULL 'tsk' here ensures that any faults that occur here - * will not be accounted to the task. 'mm' *is* current->mm, - * but we treat this as a 'remote' access since it is - * essentially a kernel access to the memory. - */ - result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL); + result = get_user_pages(vaddr, 1, FOLL_FORCE, &page); if (result < 0) return result;
@@@ -2043,7 -2084,8 +2082,8 @@@ return is_trap_insn(&opcode); }
- static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp) + /* assumes being inside RCU protected region */ + static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp) { struct mm_struct *mm = current->mm; struct uprobe *uprobe = NULL; @@@ -2056,7 -2098,7 +2096,7 @@@ struct inode *inode = file_inode(vma->vm_file); loff_t offset = vaddr_to_offset(vma, bp_vaddr);
- uprobe = find_uprobe(inode, offset); + uprobe = find_uprobe_rcu(inode, offset); }
if (!uprobe) @@@ -2077,9 -2119,12 +2117,12 @@@ static void handler_chain(struct uprob struct uprobe_consumer *uc; int remove = UPROBE_HANDLER_REMOVE; bool need_prep = false; /* prepare return uprobe, when needed */ + bool has_consumers = false; + + current->utask->auprobe = &uprobe->arch;
- down_read(&uprobe->register_rwsem); - for (uc = uprobe->consumers; uc; uc = uc->next) { + list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, + srcu_read_lock_held(&uprobes_srcu)) { int rc = 0;
if (uc->handler) { @@@ -2092,16 -2137,24 +2135,24 @@@ need_prep = true;
remove &= rc; + has_consumers = true; } + current->utask->auprobe = NULL;
if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */
- if (remove && uprobe->consumers) { - WARN_ON(!uprobe_is_active(uprobe)); - unapply_uprobe(uprobe, current->mm); + if (remove && has_consumers) { + down_read(&uprobe->register_rwsem); + + /* re-check that removal is still required, this time under lock */ + if (!filter_chain(uprobe, current->mm)) { + WARN_ON(!uprobe_is_active(uprobe)); + unapply_uprobe(uprobe, current->mm); + } + + up_read(&uprobe->register_rwsem); } - up_read(&uprobe->register_rwsem); }
static void @@@ -2109,13 -2162,15 +2160,15 @@@ handle_uretprobe_chain(struct return_in { struct uprobe *uprobe = ri->uprobe; struct uprobe_consumer *uc; + int srcu_idx;
- down_read(&uprobe->register_rwsem); - for (uc = uprobe->consumers; uc; uc = uc->next) { + srcu_idx = srcu_read_lock(&uprobes_srcu); + list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, + srcu_read_lock_held(&uprobes_srcu)) { if (uc->ret_handler) uc->ret_handler(uc, ri->func, regs); } - up_read(&uprobe->register_rwsem); + srcu_read_unlock(&uprobes_srcu, srcu_idx); }
static struct return_instance *find_next_ret_chain(struct return_instance *ri) @@@ -2200,13 -2255,15 +2253,15 @@@ static void handle_swbp(struct pt_regs { struct uprobe *uprobe; unsigned long bp_vaddr; - int is_swbp; + int is_swbp, srcu_idx;
bp_vaddr = uprobe_get_swbp_addr(regs); if (bp_vaddr == uprobe_get_trampoline_vaddr()) return uprobe_handle_trampoline(regs);
- uprobe = find_active_uprobe(bp_vaddr, &is_swbp); + srcu_idx = srcu_read_lock(&uprobes_srcu); + + uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp); if (!uprobe) { if (is_swbp > 0) { /* No matching uprobe; signal SIGTRAP. */ @@@ -2222,7 -2279,7 +2277,7 @@@ */ instruction_pointer_set(regs, bp_vaddr); } - return; + goto out; }
/* change it in advance for ->handler() and restart */ @@@ -2257,12 -2314,12 +2312,12 @@@ if (arch_uprobe_skip_sstep(&uprobe->arch, regs)) goto out;
- if (!pre_ssout(uprobe, regs, bp_vaddr)) - return; + if (pre_ssout(uprobe, regs, bp_vaddr)) + goto out;
- /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */ out: - put_uprobe(uprobe); + /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */ + srcu_read_unlock(&uprobes_srcu, srcu_idx); }
/* diff --combined kernel/fork.c index 0241a2ff1d336,0b71fc9fa750d..cd3f92f0a13d6 --- a/kernel/fork.c +++ b/kernel/fork.c @@@ -832,7 -832,7 +832,7 @@@ static void check_mm(struct mm_struct * pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n", mm_pgtables_bytes(mm));
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) VM_BUG_ON_MM(mm->pmd_huge_pte, mm); #endif } @@@ -1182,7 -1182,7 +1182,7 @@@ static struct task_struct *dup_task_str tsk->active_memcg = NULL; #endif
- #ifdef CONFIG_CPU_SUP_INTEL + #ifdef CONFIG_X86_BUS_LOCK_DETECT tsk->reported_split_lock = 0; #endif
@@@ -1276,7 -1276,7 +1276,7 @@@ static struct mm_struct *mm_init(struc RCU_INIT_POINTER(mm->exe_file, NULL); mmu_notifier_subscriptions_init(mm); init_tlb_flush_pending(mm); -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS) mm->pmd_huge_pte = NULL; #endif mm_init_uprobes_state(mm); @@@ -1338,6 -1338,7 +1338,6 @@@ static inline void __mmput(struct mm_st { VM_BUG_ON(atomic_read(&mm->mm_users));
- uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ @@@ -1753,30 -1754,33 +1753,30 @@@ static int copy_files(unsigned long clo int no_files) { struct files_struct *oldf, *newf; - int error = 0;
/* * A background process may not have any files ... */ oldf = current->files; if (!oldf) - goto out; + return 0;
if (no_files) { tsk->files = NULL; - goto out; + return 0; }
if (clone_flags & CLONE_FILES) { atomic_inc(&oldf->count); - goto out; + return 0; }
- newf = dup_fd(oldf, NR_OPEN_MAX, &error); - if (!newf) - goto out; + newf = dup_fd(oldf, NULL); + if (IS_ERR(newf)) + return PTR_ERR(newf);
tsk->files = newf; - error = 0; -out: - return error; + return 0; }
static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk) @@@ -1857,7 -1861,7 +1857,7 @@@ static int copy_signal(unsigned long cl prev_cputime_init(&sig->prev_cputime);
#ifdef CONFIG_POSIX_TIMERS - INIT_LIST_HEAD(&sig->posix_timers); + INIT_HLIST_HEAD(&sig->posix_timers); hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); sig->real_timer.function = it_real_fn; #endif @@@ -3228,16 -3232,17 +3228,16 @@@ static int unshare_fs(unsigned long uns /* * Unshare file descriptor table if it is being shared */ -int unshare_fd(unsigned long unshare_flags, unsigned int max_fds, - struct files_struct **new_fdp) +static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp) { struct files_struct *fd = current->files; - int error = 0;
if ((unshare_flags & CLONE_FILES) && (fd && atomic_read(&fd->count) > 1)) { - *new_fdp = dup_fd(fd, max_fds, &error); - if (!*new_fdp) - return error; + fd = dup_fd(fd, NULL); + if (IS_ERR(fd)) + return PTR_ERR(fd); + *new_fdp = fd; }
return 0; @@@ -3295,7 -3300,7 +3295,7 @@@ int ksys_unshare(unsigned long unshare_ err = unshare_fs(unshare_flags, &new_fs); if (err) goto bad_unshare_out; - err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd); + err = unshare_fd(unshare_flags, &new_fd); if (err) goto bad_unshare_cleanup_fs; err = unshare_userns(unshare_flags, &new_cred); @@@ -3387,7 -3392,7 +3387,7 @@@ int unshare_files(void struct files_struct *old, *copy = NULL; int error;
- error = unshare_fd(CLONE_FILES, NR_OPEN_MAX, ©); + error = unshare_fd(CLONE_FILES, ©); if (error || !copy) return error;
diff --combined kernel/irq/msi.c index ca6e2ae6d6fc0,1c7e5159064cc..3a24d6b5f559c --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@@ -82,7 -82,7 +82,7 @@@ static struct msi_desc *msi_alloc_desc( desc->dev = dev; desc->nvec_used = nvec; if (affinity) { - desc->affinity = kmemdup(affinity, nvec * sizeof(*desc->affinity), GFP_KERNEL); + desc->affinity = kmemdup_array(affinity, nvec, sizeof(*desc->affinity), GFP_KERNEL); if (!desc->affinity) { kfree(desc); return NULL; @@@ -832,7 -832,7 +832,7 @@@ static void msi_domain_update_chip_ops( struct irq_chip *chip = info->chip;
BUG_ON(!chip || !chip->irq_mask || !chip->irq_unmask); - if (!chip->irq_set_affinity) + if (!chip->irq_set_affinity && !(info->flags & MSI_FLAG_NO_AFFINITY)) chip->irq_set_affinity = msi_domain_set_affinity; }
diff --combined kernel/sched/core.c index 43e701f540130,b4c5d83e54d48..2d35fd27d3dfd --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@@ -163,7 -163,10 +163,10 @@@ static inline int __task_prio(const str if (p->sched_class == &stop_sched_class) /* trumps deadline */ return -2;
- if (rt_prio(p->prio)) /* includes deadline */ + if (p->dl_server) + return -1; /* deadline */ + + if (rt_or_dl_prio(p->prio)) return p->prio; /* [-1, 99] */
if (p->sched_class == &idle_sched_class) @@@ -192,8 -195,24 +195,24 @@@ static inline bool prio_less(const stru if (-pb < -pa) return false;
- if (pa == -1) /* dl_prio() doesn't work because of stop_class above */ - return !dl_time_before(a->dl.deadline, b->dl.deadline); + if (pa == -1) { /* dl_prio() doesn't work because of stop_class above */ + const struct sched_dl_entity *a_dl, *b_dl; + + a_dl = &a->dl; + /* + * Since,'a' and 'b' can be CFS tasks served by DL server, + * __task_prio() can return -1 (for DL) even for those. In that + * case, get to the dl_server's DL entity. + */ + if (a->dl_server) + a_dl = a->dl_server; + + b_dl = &b->dl; + if (b->dl_server) + b_dl = b->dl_server; + + return !dl_time_before(a_dl->deadline, b_dl->deadline); + }
if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */ return cfs_prio_less(a, b, in_fi); @@@ -240,6 -259,9 +259,9 @@@ static inline int rb_sched_core_cmp(con
void sched_core_enqueue(struct rq *rq, struct task_struct *p) { + if (p->se.sched_delayed) + return; + rq->core->core_task_seq++;
if (!p->core_cookie) @@@ -250,6 -272,9 +272,9 @@@
void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags) { + if (p->se.sched_delayed) + return; + rq->core->core_task_seq++;
if (sched_core_enqueued(p)) { @@@ -1269,7 -1294,7 +1294,7 @@@ bool sched_can_stop_tick(struct rq *rq * dequeued by migrating while the constrained task continues to run. * E.g. going from 2->1 without going through pick_next_task(). */ - if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) { + if (__need_bw_check(rq, rq->curr)) { if (cfs_task_bw_constrained(rq->curr)) return false; } @@@ -1672,6 -1697,9 +1697,9 @@@ static inline void uclamp_rq_inc(struc if (unlikely(!p->sched_class->uclamp_enabled)) return;
+ if (p->se.sched_delayed) + return; + for_each_clamp_id(clamp_id) uclamp_rq_inc_id(rq, p, clamp_id);
@@@ -1696,6 -1724,9 +1724,9 @@@ static inline void uclamp_rq_dec(struc if (unlikely(!p->sched_class->uclamp_enabled)) return;
+ if (p->se.sched_delayed) + return; + for_each_clamp_id(clamp_id) uclamp_rq_dec_id(rq, p, clamp_id); } @@@ -1975,14 -2006,21 +2006,21 @@@ void enqueue_task(struct rq *rq, struc psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED)); }
- uclamp_rq_inc(rq, p); p->sched_class->enqueue_task(rq, p, flags); + /* + * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear + * ->sched_delayed. + */ + uclamp_rq_inc(rq, p);
if (sched_core_enabled(rq)) sched_core_enqueue(rq, p); }
- void dequeue_task(struct rq *rq, struct task_struct *p, int flags) + /* + * Must only return false when DEQUEUE_SLEEP. + */ + inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags) { if (sched_core_enabled(rq)) sched_core_dequeue(rq, p, flags); @@@ -1995,8 -2033,12 +2033,12 @@@ psi_dequeue(p, flags & DEQUEUE_SLEEP); }
+ /* + * Must be before ->dequeue_task() because ->dequeue_task() can 'fail' + * and mark the task ->sched_delayed. + */ uclamp_rq_dec(rq, p); - p->sched_class->dequeue_task(rq, p, flags); + return p->sched_class->dequeue_task(rq, p, flags); }
void activate_task(struct rq *rq, struct task_struct *p, int flags) @@@ -2014,12 -2056,25 +2056,25 @@@
void deactivate_task(struct rq *rq, struct task_struct *p, int flags) { - WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING); + SCHED_WARN_ON(flags & DEQUEUE_SLEEP); + + WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING); ASSERT_EXCLUSIVE_WRITER(p->on_rq);
+ /* + * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before* + * dequeue_task() and cleared *after* enqueue_task(). + */ + dequeue_task(rq, p, flags); }
+ static void block_task(struct rq *rq, struct task_struct *p, int flags) + { + if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags)) + __block_task(rq, p); + } + /** * task_curr - is this task currently executing on a CPU? * @p: the task in question. @@@ -2233,6 -2288,12 +2288,12 @@@ void migrate_disable(void struct task_struct *p = current;
if (p->migration_disabled) { + #ifdef CONFIG_DEBUG_PREEMPT + /* + *Warn about overflow half-way through the range. + */ + WARN_ON_ONCE((s16)p->migration_disabled < 0); + #endif p->migration_disabled++; return; } @@@ -2251,14 -2312,20 +2312,20 @@@ void migrate_enable(void .flags = SCA_MIGRATE_ENABLE, };
+ #ifdef CONFIG_DEBUG_PREEMPT + /* + * Check both overflow from migrate_disable() and superfluous + * migrate_enable(). + */ + if (WARN_ON_ONCE((s16)p->migration_disabled <= 0)) + return; + #endif + if (p->migration_disabled > 1) { p->migration_disabled--; return; }
- if (WARN_ON_ONCE(!p->migration_disabled)) - return; - /* * Ensure stop_task runs either before or after this, and that * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule(). @@@ -3607,8 -3674,6 +3674,6 @@@ ttwu_do_activate(struct rq *rq, struct rq->idle_stamp = 0; } #endif - - p->dl_server = NULL; }
/* @@@ -3644,12 -3709,14 +3709,14 @@@ static int ttwu_runnable(struct task_st
rq = __task_rq_lock(p, &rf); if (task_on_rq_queued(p)) { + update_rq_clock(rq); + if (p->se.sched_delayed) + enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED); if (!task_on_cpu(rq, p)) { /* * When on_rq && !on_cpu the task is preempted, see if * it should preempt the task that is current now. */ - update_rq_clock(rq); wakeup_preempt(rq, p, wake_flags); } ttwu_do_wakeup(p); @@@ -4029,11 -4096,16 +4096,16 @@@ int try_to_wake_up(struct task_struct * * case the whole 'p->on_rq && ttwu_runnable()' case below * without taking any locks. * + * Specifically, given current runs ttwu() we must be before + * schedule()'s block_task(), as such this must not observe + * sched_delayed. + * * In particular: * - we rely on Program-Order guarantees for all the ordering, * - we're serialized against set_special_state() by virtue of * it disabling IRQs (this allows not taking ->pi_lock). */ + SCHED_WARN_ON(p->se.sched_delayed); if (!ttwu_state_match(p, state, &success)) goto out;
@@@ -4322,9 -4394,11 +4394,11 @@@ static void __sched_fork(unsigned long p->se.nr_migrations = 0; p->se.vruntime = 0; p->se.vlag = 0; - p->se.slice = sysctl_sched_base_slice; INIT_LIST_HEAD(&p->se.group_node);
+ /* A delayed task cannot be in clone(). */ + SCHED_WARN_ON(p->se.sched_delayed); + #ifdef CONFIG_FAIR_GROUP_SCHED p->se.cfs_rq = NULL; #endif @@@ -4572,6 -4646,8 +4646,8 @@@ int sched_fork(unsigned long clone_flag
p->prio = p->normal_prio = p->static_prio; set_load_weight(p, false); + p->se.custom_slice = 0; + p->se.slice = sysctl_sched_base_slice;
/* * We don't need the reset flag anymore after the fork. It has @@@ -4686,7 -4762,7 +4762,7 @@@ void wake_up_new_task(struct task_struc update_rq_clock(rq); post_init_entity_util_avg(p);
- activate_task(rq, p, ENQUEUE_NOCLOCK); + activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_INITIAL); trace_sched_wakeup_new(p); wakeup_preempt(rq, p, WF_FORK); #ifdef CONFIG_SMP @@@ -5769,8 -5845,8 +5845,8 @@@ static inline void schedule_debug(struc schedstat_inc(this_rq()->sched_count); }
- static void put_prev_task_balance(struct rq *rq, struct task_struct *prev, - struct rq_flags *rf) + static void prev_balance(struct rq *rq, struct task_struct *prev, + struct rq_flags *rf) { #ifdef CONFIG_SMP const struct sched_class *class; @@@ -5787,8 -5863,6 +5863,6 @@@ break; } #endif - - put_prev_task(rq, prev); }
/* @@@ -5800,6 -5874,8 +5874,8 @@@ __pick_next_task(struct rq *rq, struct const struct sched_class *class; struct task_struct *p;
+ rq->dl_server = NULL; + /* * Optimization: we know that if all tasks are in the fair class we can * call that function directly, but only if the @prev task wasn't of a @@@ -5815,35 -5891,28 +5891,28 @@@
/* Assume the next prioritized class is idle_sched_class */ if (!p) { - put_prev_task(rq, prev); - p = pick_next_task_idle(rq); + p = pick_task_idle(rq); + put_prev_set_next_task(rq, prev, p); }
- /* - * This is the fast path; it cannot be a DL server pick; - * therefore even if @p == @prev, ->dl_server must be NULL. - */ - if (p->dl_server) - p->dl_server = NULL; - return p; }
restart: - put_prev_task_balance(rq, prev, rf); - - /* - * We've updated @prev and no longer need the server link, clear it. - * Must be done before ->pick_next_task() because that can (re)set - * ->dl_server. - */ - if (prev->dl_server) - prev->dl_server = NULL; + prev_balance(rq, prev, rf);
for_each_class(class) { - p = class->pick_next_task(rq); - if (p) - return p; + if (class->pick_next_task) { + p = class->pick_next_task(rq, prev); + if (p) + return p; + } else { + p = class->pick_task(rq); + if (p) { + put_prev_set_next_task(rq, prev, p); + return p; + } + } }
BUG(); /* The idle class should always have a runnable task. */ @@@ -5873,6 -5942,8 +5942,8 @@@ static inline struct task_struct *pick_ const struct sched_class *class; struct task_struct *p;
+ rq->dl_server = NULL; + for_each_class(class) { p = class->pick_task(rq); if (p) @@@ -5911,6 -5982,7 +5982,7 @@@ pick_next_task(struct rq *rq, struct ta * another cpu during offline. */ rq->core_pick = NULL; + rq->core_dl_server = NULL; return __pick_next_task(rq, prev, rf); }
@@@ -5929,16 -6001,13 +6001,13 @@@ WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
next = rq->core_pick; - if (next != prev) { - put_prev_task(rq, prev); - set_next_task(rq, next); - } - + rq->dl_server = rq->core_dl_server; rq->core_pick = NULL; - goto out; + rq->core_dl_server = NULL; + goto out_set_next; }
- put_prev_task_balance(rq, prev, rf); + prev_balance(rq, prev, rf);
smt_mask = cpu_smt_mask(cpu); need_sync = !!rq->core->core_cookie; @@@ -5979,6 -6048,7 +6048,7 @@@ next = pick_task(rq); if (!next->core_cookie) { rq->core_pick = NULL; + rq->core_dl_server = NULL; /* * For robustness, update the min_vruntime_fi for * unconstrained picks as well. @@@ -6006,7 -6076,9 +6076,9 @@@ if (i != cpu && (rq_i != rq->core || !core_clock_updated)) update_rq_clock(rq_i);
- p = rq_i->core_pick = pick_task(rq_i); + rq_i->core_pick = p = pick_task(rq_i); + rq_i->core_dl_server = rq_i->dl_server; + if (!max || prio_less(max, p, fi_before)) max = p; } @@@ -6030,6 -6102,7 +6102,7 @@@ }
rq_i->core_pick = p; + rq_i->core_dl_server = NULL;
if (p == rq_i->idle) { if (rq_i->nr_running) { @@@ -6090,6 -6163,7 +6163,7 @@@
if (i == cpu) { rq_i->core_pick = NULL; + rq_i->core_dl_server = NULL; continue; }
@@@ -6098,6 -6172,7 +6172,7 @@@
if (rq_i->curr == rq_i->core_pick) { rq_i->core_pick = NULL; + rq_i->core_dl_server = NULL; continue; }
@@@ -6105,8 -6180,7 +6180,7 @@@ }
out_set_next: - set_next_task(rq, next); - out: + put_prev_set_next_task(rq, prev, next); if (rq->core->core_forceidle_count && next == rq->idle) queue_core_balance(rq);
@@@ -6342,19 -6416,12 +6416,12 @@@ pick_next_task(struct rq *rq, struct ta * Constants for the sched_mode argument of __schedule(). * * The mode argument allows RT enabled kernels to differentiate a - * preemption from blocking on an 'sleeping' spin/rwlock. Note that - * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to - * optimize the AND operation out and just check for zero. + * preemption from blocking on an 'sleeping' spin/rwlock. */ - #define SM_NONE 0x0 - #define SM_PREEMPT 0x1 - #define SM_RTLOCK_WAIT 0x2 - - #ifndef CONFIG_PREEMPT_RT - # define SM_MASK_PREEMPT (~0U) - #else - # define SM_MASK_PREEMPT SM_PREEMPT - #endif + #define SM_IDLE (-1) + #define SM_NONE 0 + #define SM_PREEMPT 1 + #define SM_RTLOCK_WAIT 2
/* * __schedule() is the main scheduler function. @@@ -6395,9 -6462,14 +6462,14 @@@ * * WARNING: must be called with preemption disabled! */ - static void __sched notrace __schedule(unsigned int sched_mode) + static void __sched notrace __schedule(int sched_mode) { struct task_struct *prev, *next; + /* + * On PREEMPT_RT kernel, SM_RTLOCK_WAIT is noted + * as a preemption by schedule_debug() and RCU. + */ + bool preempt = sched_mode > SM_NONE; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; @@@ -6408,13 -6480,13 +6480,13 @@@ rq = cpu_rq(cpu); prev = rq->curr;
- schedule_debug(prev, !!sched_mode); + schedule_debug(prev, preempt);
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL)) hrtick_clear(rq);
local_irq_disable(); - rcu_note_context_switch(!!sched_mode); + rcu_note_context_switch(preempt);
/* * Make sure that signal_pending_state()->signal_pending() below @@@ -6443,22 -6515,32 +6515,32 @@@
switch_count = &prev->nivcsw;
+ /* Task state changes only considers SM_PREEMPT as preemption */ + preempt = sched_mode == SM_PREEMPT; + /* * We must load prev->state once (task_struct::state is volatile), such * that we form a control dependency vs deactivate_task() below. */ prev_state = READ_ONCE(prev->__state); - if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) { + if (sched_mode == SM_IDLE) { + if (!rq->nr_running) { + next = prev; + goto picked; + } + } else if (!preempt && prev_state) { if (signal_pending_state(prev_state, prev)) { WRITE_ONCE(prev->__state, TASK_RUNNING); } else { + int flags = DEQUEUE_NOCLOCK; + prev->sched_contributes_to_load = (prev_state & TASK_UNINTERRUPTIBLE) && !(prev_state & TASK_NOLOAD) && !(prev_state & TASK_FROZEN);
- if (prev->sched_contributes_to_load) - rq->nr_uninterruptible++; + if (unlikely(is_special_task_state(prev_state))) + flags |= DEQUEUE_SPECIAL;
/* * __schedule() ttwu() @@@ -6471,17 -6553,13 +6553,13 @@@ * * After this, schedule() must not care about p->state any more. */ - deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK); - - if (prev->in_iowait) { - atomic_inc(&rq->nr_iowait); - delayacct_blkio_start(); - } + block_task(rq, prev, flags); } switch_count = &prev->nvcsw; }
next = pick_next_task(rq, prev, &rf); + picked: clear_tsk_need_resched(prev); clear_preempt_need_resched(); #ifdef CONFIG_SCHED_DEBUG @@@ -6523,7 -6601,7 +6601,7 @@@ psi_account_irqtime(rq, prev, next); psi_sched_switch(prev, next, !task_on_rq_queued(prev));
- trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state); + trace_sched_switch(preempt, prev, next, prev_state);
/* Also unlocks the rq: */ rq = context_switch(rq, prev, next, &rf); @@@ -6599,7 -6677,7 +6677,7 @@@ static void sched_update_worker(struct } }
- static __always_inline void __schedule_loop(unsigned int sched_mode) + static __always_inline void __schedule_loop(int sched_mode) { do { preempt_disable(); @@@ -6644,7 -6722,7 +6722,7 @@@ void __sched schedule_idle(void */ WARN_ON_ONCE(current->__state); do { - __schedule(SM_NONE); + __schedule(SM_IDLE); } while (need_resched()); }
@@@ -7405,7 -7483,7 +7483,7 @@@ EXPORT_SYMBOL(io_schedule)
void sched_show_task(struct task_struct *p) { - unsigned long free = 0; + unsigned long free; int ppid;
if (!try_get_task_stack(p)) @@@ -7415,7 -7493,9 +7493,7 @@@
if (task_is_running(p)) pr_cont(" running task "); -#ifdef CONFIG_DEBUG_STACK_USAGE free = stack_not_used(p); -#endif ppid = 0; rcu_read_lock(); if (pid_alive(p)) @@@ -8226,8 -8306,6 +8304,6 @@@ void __init sched_init(void #endif /* CONFIG_RT_GROUP_SCHED */ }
- init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime()); - #ifdef CONFIG_SMP init_defrootdomain(); #endif @@@ -8282,8 -8360,13 +8358,13 @@@ init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL); #endif /* CONFIG_FAIR_GROUP_SCHED */
- rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime; #ifdef CONFIG_RT_GROUP_SCHED + /* + * This is required for init cpu because rt.c:__enable_runtime() + * starts working after scheduler_running, which is not the case + * yet. + */ + rq->rt.rt_runtime = global_rt_runtime(); init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL); #endif #ifdef CONFIG_SMP @@@ -8315,10 -8398,12 +8396,12 @@@ #endif /* CONFIG_SMP */ hrtick_rq_init(rq); atomic_set(&rq->nr_iowait, 0); + fair_server_init(rq);
#ifdef CONFIG_SCHED_CORE rq->core = rq; rq->core_pick = NULL; + rq->core_dl_server = NULL; rq->core_enabled = 0; rq->core_tree = RB_ROOT; rq->core_forceidle_count = 0; @@@ -8331,6 -8416,7 +8414,7 @@@ }
set_load_weight(&init_task, false); + init_task.se.slice = sysctl_sched_base_slice,
/* * The boot idle thread does lazy MMU switching as well: @@@ -8546,7 -8632,7 +8630,7 @@@ void normalize_rt_tasks(void schedstat_set(p->stats.sleep_start, 0); schedstat_set(p->stats.block_start, 0);
- if (!dl_task(p) && !rt_task(p)) { + if (!rt_or_dl_task(p)) { /* * Renice negative nice level userspace * tasks back to 0: diff --combined kernel/sched/fair.c index a1b756f927b23,b9784e13e6b6e..03d7c5e59a182 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@@ -511,7 -511,7 +511,7 @@@ static int cfs_rq_is_idle(struct cfs_r
static int se_is_idle(struct sched_entity *se) { - return 0; + return task_has_idle_policy(task_of(se)); }
#endif /* CONFIG_FAIR_GROUP_SCHED */ @@@ -779,8 -779,22 +779,22 @@@ static void update_min_vruntime(struct }
/* ensure we never gain time by being placed backwards. */ - u64_u32_store(cfs_rq->min_vruntime, - __update_min_vruntime(cfs_rq, vruntime)); + cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime); + } + + static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq) + { + struct sched_entity *root = __pick_root_entity(cfs_rq); + struct sched_entity *curr = cfs_rq->curr; + u64 min_slice = ~0ULL; + + if (curr && curr->on_rq) + min_slice = curr->slice; + + if (root) + min_slice = min(min_slice, root->min_slice); + + return min_slice; }
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b) @@@ -799,19 -813,34 +813,34 @@@ static inline void __min_vruntime_updat } }
+ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node) + { + if (node) { + struct sched_entity *rse = __node_2_se(node); + if (rse->min_slice < se->min_slice) + se->min_slice = rse->min_slice; + } + } + /* * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime) */ static inline bool min_vruntime_update(struct sched_entity *se, bool exit) { u64 old_min_vruntime = se->min_vruntime; + u64 old_min_slice = se->min_slice; struct rb_node *node = &se->run_node;
se->min_vruntime = se->vruntime; __min_vruntime_update(se, node->rb_right); __min_vruntime_update(se, node->rb_left);
- return se->min_vruntime == old_min_vruntime; + se->min_slice = se->slice; + __min_slice_update(se, node->rb_right); + __min_slice_update(se, node->rb_left); + + return se->min_vruntime == old_min_vruntime && + se->min_slice == old_min_slice; }
RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity, @@@ -824,6 -853,7 +853,7 @@@ static void __enqueue_entity(struct cfs { avg_vruntime_add(cfs_rq, se); se->min_vruntime = se->vruntime; + se->min_slice = se->slice; rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline, __entity_less, &min_vruntime_cb); } @@@ -974,17 -1004,18 +1004,18 @@@ static void clear_buddies(struct cfs_r * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i * this is probably good enough. */ - static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) + static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) { if ((s64)(se->vruntime - se->deadline) < 0) - return; + return false;
/* * For EEVDF the virtual time slope is determined by w_i (iow. * nice) while the request time r_i is determined by * sysctl_sched_base_slice. */ - se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice;
/* * EEVDF: vd_i = ve_i + r_i / w_i @@@ -994,10 -1025,7 +1025,7 @@@ /* * The task has consumed its request, reschedule. */ - if (cfs_rq->nr_running > 1) { - resched_curr(rq_of(cfs_rq)); - clear_buddies(cfs_rq, se); - } + return true; }
#include "pelt.h" @@@ -1135,6 -1163,38 +1163,38 @@@ static inline void update_curr_task(str dl_server_update(p->dl_server, delta_exec); }
+ static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr) + { + if (!sched_feat(PREEMPT_SHORT)) + return false; + + if (curr->vlag == curr->deadline) + return false; + + return !entity_eligible(cfs_rq, curr); + } + + static inline bool do_preempt_short(struct cfs_rq *cfs_rq, + struct sched_entity *pse, struct sched_entity *se) + { + if (!sched_feat(PREEMPT_SHORT)) + return false; + + if (pse->slice >= se->slice) + return false; + + if (!entity_eligible(cfs_rq, pse)) + return false; + + if (entity_before(pse, se)) + return true; + + if (!entity_eligible(cfs_rq, se)) + return true; + + return false; + } + /* * Used by other classes to account runtime. */ @@@ -1156,23 -1216,44 +1216,44 @@@ s64 update_curr_common(struct rq *rq static void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; + struct rq *rq = rq_of(cfs_rq); s64 delta_exec; + bool resched;
if (unlikely(!curr)) return;
- delta_exec = update_curr_se(rq_of(cfs_rq), curr); + delta_exec = update_curr_se(rq, curr); if (unlikely(delta_exec <= 0)) return;
curr->vruntime += calc_delta_fair(delta_exec, curr); - update_deadline(cfs_rq, curr); + resched = update_deadline(cfs_rq, curr); update_min_vruntime(cfs_rq);
- if (entity_is_task(curr)) - update_curr_task(task_of(curr), delta_exec); + if (entity_is_task(curr)) { + struct task_struct *p = task_of(curr); + + update_curr_task(p, delta_exec); + + /* + * Any fair task that runs outside of fair_server should + * account against fair_server such that it can account for + * this time and possibly avoid running this period. + */ + if (p->dl_server != &rq->fair_server) + dl_server_update(&rq->fair_server, delta_exec); + }
account_cfs_rq_runtime(cfs_rq, delta_exec); + + if (rq->nr_running == 1) + return; + + if (resched || did_preempt_short(cfs_rq, curr)) { + resched_curr(rq); + clear_buddies(cfs_rq, curr); + } }
static void update_curr_fair(struct rq *rq) @@@ -1742,7 -1823,7 +1823,7 @@@ static bool pgdat_free_space_enough(str continue;
if (zone_watermark_ok(zone, 0, - wmark_pages(zone, WMARK_PROMO) + enough_wmark, + promo_wmark_pages(zone) + enough_wmark, ZONE_MOVABLE, 0)) return true; } @@@ -1840,7 -1921,8 +1921,7 @@@ bool should_numa_migrate_memory(struct * The pages in slow memory node should be migrated according * to hot/cold instead of private/shared. */ - if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING && - !node_is_toptier(src_nid)) { + if (folio_use_access_time(folio)) { struct pglist_data *pgdat; unsigned long rate_limit; unsigned int latency, th, def_th; @@@ -3187,15 -3269,6 +3268,15 @@@ static bool vma_is_accessed(struct mm_s return true; }
+ /* + * This vma has not been accessed for a while, and if the number + * the threads in the same process is low, which means no other + * threads can help scan this vma, force a vma scan. + */ + if (READ_ONCE(mm->numa_scan_seq) > + (vma->numab_state->prev_scan_seq + get_nr_threads(current))) + return true; + return false; }
@@@ -5186,7 -5259,8 +5267,8 @@@ place_entity(struct cfs_rq *cfs_rq, str u64 vslice, vruntime = avg_vruntime(cfs_rq); s64 lag = 0;
- se->slice = sysctl_sched_base_slice; + if (!se->custom_slice) + se->slice = sysctl_sched_base_slice; vslice = calc_delta_fair(se->slice, se);
/* @@@ -5267,6 -5341,12 +5349,12 @@@
se->vruntime = vruntime - lag;
+ if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) { + se->deadline += se->vruntime; + se->rel_deadline = 0; + return; + } + /* * When joining the competition; the existing tasks will be, * on average, halfway through their slice, as such start tasks @@@ -5286,6 -5366,9 +5374,9 @@@ static inline int cfs_rq_throttled(stru
static inline bool cfs_bandwidth_used(void);
+ static void + requeue_delayed_entity(struct sched_entity *se); + static void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { @@@ -5373,19 -5456,47 +5464,47 @@@ static void clear_buddies(struct cfs_r
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
- static void + static inline void finish_delayed_dequeue_entity(struct sched_entity *se) + { + se->sched_delayed = 0; + if (sched_feat(DELAY_ZERO) && se->vlag > 0) + se->vlag = 0; + } + + static bool dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) { - int action = UPDATE_TG; + bool sleep = flags & DEQUEUE_SLEEP;
+ update_curr(cfs_rq); + + if (flags & DEQUEUE_DELAYED) { + SCHED_WARN_ON(!se->sched_delayed); + } else { + bool delay = sleep; + /* + * DELAY_DEQUEUE relies on spurious wakeups, special task + * states must not suffer spurious wakeups, excempt them. + */ + if (flags & DEQUEUE_SPECIAL) + delay = false; + + SCHED_WARN_ON(delay && se->sched_delayed); + + if (sched_feat(DELAY_DEQUEUE) && delay && + !entity_eligible(cfs_rq, se)) { + if (cfs_rq->next == se) + cfs_rq->next = NULL; + update_load_avg(cfs_rq, se, 0); + se->sched_delayed = 1; + return false; + } + } + + int action = UPDATE_TG; if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) action |= DO_DETACH;
- /* - * Update run-time statistics of the 'current'. - */ - update_curr(cfs_rq); - /* * When dequeuing a sched_entity, we must: * - Update loads to have both entity and cfs_rq synced with now. @@@ -5403,6 -5514,11 +5522,11 @@@ clear_buddies(cfs_rq, se);
update_entity_lag(cfs_rq, se); + if (sched_feat(PLACE_REL_DEADLINE) && !sleep) { + se->deadline -= se->vruntime; + se->rel_deadline = 1; + } + if (se != cfs_rq->curr) __dequeue_entity(cfs_rq, se); se->on_rq = 0; @@@ -5422,8 -5538,13 +5546,13 @@@ if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE) update_min_vruntime(cfs_rq);
+ if (flags & DEQUEUE_DELAYED) + finish_delayed_dequeue_entity(se); + if (cfs_rq->nr_running == 0) update_idle_cfs_rq_clock_pelt(cfs_rq); + + return true; }
static void @@@ -5449,6 -5570,7 +5578,7 @@@ set_next_entity(struct cfs_rq *cfs_rq, }
update_stats_curr_start(cfs_rq, se); + SCHED_WARN_ON(cfs_rq->curr); cfs_rq->curr = se;
/* @@@ -5469,6 -5591,8 +5599,8 @@@ se->prev_sum_exec_runtime = se->sum_exec_runtime; }
+ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags); + /* * Pick the next process, keeping these things in mind, in this order: * 1) keep things fair between processes/task groups @@@ -5477,16 -5601,26 +5609,26 @@@ * 4) do not run the "skip" process, if something else is available */ static struct sched_entity * - pick_next_entity(struct cfs_rq *cfs_rq) + pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq) { /* * Enabling NEXT_BUDDY will affect latency but not fairness. */ if (sched_feat(NEXT_BUDDY) && - cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) + cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) { + /* ->next will never be delayed */ + SCHED_WARN_ON(cfs_rq->next->sched_delayed); return cfs_rq->next; + }
- return pick_eevdf(cfs_rq); + struct sched_entity *se = pick_eevdf(cfs_rq); + if (se->sched_delayed) { + dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + SCHED_WARN_ON(se->sched_delayed); + SCHED_WARN_ON(se->on_rq); + return NULL; + } + return se; }
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq); @@@ -5510,6 -5644,7 +5652,7 @@@ static void put_prev_entity(struct cfs_ /* in !on_rq case, update occurred at dequeue */ update_load_avg(cfs_rq, prev, 0); } + SCHED_WARN_ON(cfs_rq->curr != prev); cfs_rq->curr = NULL; }
@@@ -5773,6 -5908,7 +5916,7 @@@ static bool throttle_cfs_rq(struct cfs_ struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; long task_delta, idle_task_delta, dequeue = 1; + long rq_h_nr_running = rq->cfs.h_nr_running;
raw_spin_lock(&cfs_b->lock); /* This will start the period timer if necessary */ @@@ -5806,11 -5942,21 +5950,21 @@@ idle_task_delta = cfs_rq->idle_h_nr_running; for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se); + int flags; + /* throttled entity or throttle-on-deactivate */ if (!se->on_rq) goto done;
- dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP); + /* + * Abuse SPECIAL to avoid delayed dequeue in this instance. + * This avoids teaching dequeue_entities() about throttled + * entities and keeps things relatively simple. + */ + flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL; + if (se->sched_delayed) + flags |= DEQUEUE_DELAYED; + dequeue_entity(qcfs_rq, se, flags);
if (cfs_rq_is_idle(group_cfs_rq(se))) idle_task_delta = cfs_rq->h_nr_running; @@@ -5844,6 -5990,9 +5998,9 @@@ /* At this point se is NULL and we are at root level*/ sub_nr_running(rq, task_delta);
+ /* Stop the fair server if throttling resulted in no runnable tasks */ + if (rq_h_nr_running && !rq->cfs.h_nr_running) + dl_server_stop(&rq->fair_server); done: /* * Note: distribution will already see us throttled via the @@@ -5862,6 -6011,7 +6019,7 @@@ void unthrottle_cfs_rq(struct cfs_rq *c struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); struct sched_entity *se; long task_delta, idle_task_delta; + long rq_h_nr_running = rq->cfs.h_nr_running;
se = cfs_rq->tg->se[cpu_of(rq)];
@@@ -5899,8 -6049,10 +6057,10 @@@ for_each_sched_entity(se) { struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- if (se->on_rq) + if (se->on_rq) { + SCHED_WARN_ON(se->sched_delayed); break; + } enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
if (cfs_rq_is_idle(group_cfs_rq(se))) @@@ -5931,6 -6083,10 +6091,10 @@@ goto unthrottle_throttle; }
+ /* Start the fair server if un-throttling resulted in new runnable tasks */ + if (!rq_h_nr_running && rq->cfs.h_nr_running) + dl_server_start(&rq->fair_server); + /* At this point se is NULL and we are at root level*/ add_nr_running(rq, task_delta);
@@@ -6563,7 -6719,7 +6727,7 @@@ static void sched_fair_update_stop_tick { int cpu = cpu_of(rq);
- if (!sched_feat(HZ_BW) || !cfs_bandwidth_used()) + if (!cfs_bandwidth_used()) return;
if (!tick_nohz_full_cpu(cpu)) @@@ -6746,6 -6902,37 +6910,37 @@@ static int sched_idle_cpu(int cpu } #endif
+ static void + requeue_delayed_entity(struct sched_entity *se) + { + struct cfs_rq *cfs_rq = cfs_rq_of(se); + + /* + * se->sched_delayed should imply: se->on_rq == 1. + * Because a delayed entity is one that is still on + * the runqueue competing until elegibility. + */ + SCHED_WARN_ON(!se->sched_delayed); + SCHED_WARN_ON(!se->on_rq); + + if (sched_feat(DELAY_ZERO)) { + update_entity_lag(cfs_rq, se); + if (se->vlag > 0) { + cfs_rq->nr_running--; + if (se != cfs_rq->curr) + __dequeue_entity(cfs_rq, se); + se->vlag = 0; + place_entity(cfs_rq, se, 0); + if (se != cfs_rq->curr) + __enqueue_entity(cfs_rq, se); + cfs_rq->nr_running++; + } + } + + update_load_avg(cfs_rq, se, 0); + se->sched_delayed = 0; + } + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and @@@ -6758,6 -6945,8 +6953,8 @@@ enqueue_task_fair(struct rq *rq, struc struct sched_entity *se = &p->se; int idle_h_nr_running = task_has_idle_policy(p); int task_new = !(flags & ENQUEUE_WAKEUP); + int rq_h_nr_running = rq->cfs.h_nr_running; + u64 slice = 0;
/* * The code below (indirectly) updates schedutil which looks at @@@ -6765,7 -6954,13 +6962,13 @@@ * Let's add the task's estimated utilization to the cfs_rq's * estimated utilization, before we update schedutil. */ - util_est_enqueue(&rq->cfs, p); + if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE)))) + util_est_enqueue(&rq->cfs, p); + + if (flags & ENQUEUE_DELAYED) { + requeue_delayed_entity(se); + return; + }
/* * If in_iowait is set, the code below may not trigger any cpufreq @@@ -6776,10 -6971,24 +6979,24 @@@ cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
for_each_sched_entity(se) { - if (se->on_rq) + if (se->on_rq) { + if (se->sched_delayed) + requeue_delayed_entity(se); break; + } cfs_rq = cfs_rq_of(se); + + /* + * Basically set the slice of group entries to the min_slice of + * their respective cfs_rq. This ensures the group can service + * its entities in the desired time-frame. + */ + if (slice) { + se->slice = slice; + se->custom_slice = 1; + } enqueue_entity(cfs_rq, se, flags); + slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running; @@@ -6801,6 -7010,9 +7018,9 @@@ se_update_runnable(se); update_cfs_group(se);
+ se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + cfs_rq->h_nr_running++; cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@@ -6812,6 -7024,13 +7032,13 @@@ goto enqueue_throttle; }
+ if (!rq_h_nr_running && rq->cfs.h_nr_running) { + /* Account for idle runtime */ + if (!rq->nr_running) + dl_server_update_idle_time(rq, rq->curr); + dl_server_start(&rq->fair_server); + } + /* At this point se is NULL and we are at root level*/ add_nr_running(rq, 1);
@@@ -6841,36 -7060,59 +7068,59 @@@ enqueue_throttle static void set_next_buddy(struct sched_entity *se);
/* - * The dequeue_task method is called before nr_running is - * decreased. We remove the task from the rbtree and - * update the fair scheduling stats: + * Basically dequeue_task_fair(), except it can deal with dequeue_entity() + * failing half-way through and resume the dequeue later. + * + * Returns: + * -1 - dequeue delayed + * 0 - dequeue throttled + * 1 - dequeue complete */ - static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) + static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) { - struct cfs_rq *cfs_rq; - struct sched_entity *se = &p->se; - int task_sleep = flags & DEQUEUE_SLEEP; - int idle_h_nr_running = task_has_idle_policy(p); bool was_sched_idle = sched_idle_rq(rq); + int rq_h_nr_running = rq->cfs.h_nr_running; + bool task_sleep = flags & DEQUEUE_SLEEP; + bool task_delayed = flags & DEQUEUE_DELAYED; + struct task_struct *p = NULL; + int idle_h_nr_running = 0; + int h_nr_running = 0; + struct cfs_rq *cfs_rq; + u64 slice = 0;
- util_est_dequeue(&rq->cfs, p); + if (entity_is_task(se)) { + p = task_of(se); + h_nr_running = 1; + idle_h_nr_running = task_has_idle_policy(p); + } else { + cfs_rq = group_cfs_rq(se); + slice = cfs_rq_min_slice(cfs_rq); + }
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); - dequeue_entity(cfs_rq, se, flags);
- cfs_rq->h_nr_running--; + if (!dequeue_entity(cfs_rq, se, flags)) { + if (p && &p->se == se) + return -1; + + break; + } + + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running;
if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running;
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; + return 0;
/* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { + slice = cfs_rq_min_slice(cfs_rq); + /* Avoid re-evaluating load for this entity: */ se = parent_entity(se); /* @@@ -6882,6 -7124,7 +7132,7 @@@ break; } flags |= DEQUEUE_SLEEP; + flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL); }
for_each_sched_entity(se) { @@@ -6891,28 -7134,61 +7142,61 @@@ se_update_runnable(se); update_cfs_group(se);
- cfs_rq->h_nr_running--; + se->slice = slice; + slice = cfs_rq_min_slice(cfs_rq); + + cfs_rq->h_nr_running -= h_nr_running; cfs_rq->idle_h_nr_running -= idle_h_nr_running;
if (cfs_rq_is_idle(cfs_rq)) - idle_h_nr_running = 1; + idle_h_nr_running = h_nr_running;
/* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - goto dequeue_throttle; - + return 0; }
- /* At this point se is NULL and we are at root level*/ - sub_nr_running(rq, 1); + sub_nr_running(rq, h_nr_running); + + if (rq_h_nr_running && !rq->cfs.h_nr_running) + dl_server_stop(&rq->fair_server);
/* balance early to pull high priority tasks */ if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies;
- dequeue_throttle: - util_est_update(&rq->cfs, p, task_sleep); + if (p && task_delayed) { + SCHED_WARN_ON(!task_sleep); + SCHED_WARN_ON(p->on_rq != 1); + + /* Fix-up what dequeue_task_fair() skipped */ + hrtick_update(rq); + + /* Fix-up what block_task() skipped. */ + __block_task(rq, p); + } + + return 1; + } + + /* + * The dequeue_task method is called before nr_running is + * decreased. We remove the task from the rbtree and + * update the fair scheduling stats: + */ + static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) + { + if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE)))) + util_est_dequeue(&rq->cfs, p); + + if (dequeue_entities(rq, &p->se, flags) < 0) { + util_est_update(&rq->cfs, p, DEQUEUE_SLEEP); + return false; + } + + util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP); hrtick_update(rq); + return true; }
#ifdef CONFIG_SMP @@@ -7810,6 -8086,105 +8094,105 @@@ static unsigned long cpu_util_without(i return cpu_util(cpu, p, -1, 0); }
+ /* + * This function computes an effective utilization for the given CPU, to be + * used for frequency selection given the linear relation: f = u * f_max. + * + * The scheduler tracks the following metrics: + * + * cpu_util_{cfs,rt,dl,irq}() + * cpu_bw_dl() + * + * Where the cfs,rt and dl util numbers are tracked with the same metric and + * synchronized windows and are thus directly comparable. + * + * The cfs,rt,dl utilization are the running times measured with rq->clock_task + * which excludes things like IRQ and steal-time. These latter are then accrued + * in the IRQ utilization. + * + * The DL bandwidth number OTOH is not a measured metric but a value computed + * based on the task model parameters and gives the minimal utilization + * required to meet deadlines. + */ + unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, + unsigned long *min, + unsigned long *max) + { + unsigned long util, irq, scale; + struct rq *rq = cpu_rq(cpu); + + scale = arch_scale_cpu_capacity(cpu); + + /* + * Early check to see if IRQ/steal time saturates the CPU, can be + * because of inaccuracies in how we track these -- see + * update_irq_load_avg(). + */ + irq = cpu_util_irq(rq); + if (unlikely(irq >= scale)) { + if (min) + *min = scale; + if (max) + *max = scale; + return scale; + } + + if (min) { + /* + * The minimum utilization returns the highest level between: + * - the computed DL bandwidth needed with the IRQ pressure which + * steals time to the deadline task. + * - The minimum performance requirement for CFS and/or RT. + */ + *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN)); + + /* + * When an RT task is runnable and uclamp is not used, we must + * ensure that the task will run at maximum compute capacity. + */ + if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)) + *min = max(*min, scale); + } + + /* + * Because the time spend on RT/DL tasks is visible as 'lost' time to + * CFS tasks and we use the same metric to track the effective + * utilization (PELT windows are synchronized) we can directly add them + * to obtain the CPU's actual utilization. + */ + util = util_cfs + cpu_util_rt(rq); + util += cpu_util_dl(rq); + + /* + * The maximum hint is a soft bandwidth requirement, which can be lower + * than the actual utilization because of uclamp_max requirements. + */ + if (max) + *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX)); + + if (util >= scale) + return scale; + + /* + * There is still idle time; further improve the number by using the + * IRQ metric. Because IRQ/steal time is hidden from the task clock we + * need to scale the task numbers: + * + * max - irq + * U' = irq + --------- * U + * max + */ + util = scale_irq_capacity(util, irq, scale); + util += irq; + + return min(scale, util); + } + + unsigned long sched_cpu_util(int cpu) + { + return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL); + } + /* * energy_env - Utilization landscape for energy estimation. * @task_busy_time: Utilization contribution by the task for which we test the @@@ -8294,7 -8669,21 +8677,21 @@@ static void migrate_task_rq_fair(struc
static void task_dead_fair(struct task_struct *p) { - remove_entity_load_avg(&p->se); + struct sched_entity *se = &p->se; + + if (se->sched_delayed) { + struct rq_flags rf; + struct rq *rq; + + rq = task_rq_lock(p, &rf); + if (se->sched_delayed) { + update_rq_clock(rq); + dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + } + task_rq_unlock(rq, p, &rf); + } + + remove_entity_load_avg(se); }
/* @@@ -8330,7 -8719,7 +8727,7 @@@ static void set_cpus_allowed_fair(struc static int balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { - if (rq->nr_running) + if (sched_fair_runnable(rq)) return 1;
return sched_balance_newidle(rq, rf) != 0; @@@ -8389,16 -8778,7 +8786,7 @@@ static void check_preempt_wakeup_fair(s if (test_tsk_need_resched(curr)) return;
- /* Idle tasks are by definition preempted by non-idle tasks. */ - if (unlikely(task_has_idle_policy(curr)) && - likely(!task_has_idle_policy(p))) - goto preempt; - - /* - * Batch and idle tasks do not preempt non-idle tasks (their preemption - * is driven by the tick): - */ - if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION)) + if (!sched_feat(WAKEUP_PREEMPTION)) return;
find_matching_se(&se, &pse); @@@ -8408,7 -8788,7 +8796,7 @@@ pse_is_idle = se_is_idle(pse);
/* - * Preempt an idle group in favor of a non-idle group (and don't preempt + * Preempt an idle entity in favor of a non-idle entity (and don't preempt * in the inverse case). */ if (cse_is_idle && !pse_is_idle) @@@ -8416,11 -8796,26 +8804,26 @@@ if (cse_is_idle != pse_is_idle) return;
+ /* + * BATCH and IDLE tasks do not preempt others. + */ + if (unlikely(p->policy != SCHED_NORMAL)) + return; + cfs_rq = cfs_rq_of(se); update_curr(cfs_rq); + /* + * If @p has a shorter slice than current and @p is eligible, override + * current's slice protection in order to allow preemption. + * + * Note that even if @p does not turn out to be the most eligible + * task at this moment, current's slice protection will be lost. + */ + if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline) + se->vlag = se->deadline + 1;
/* - * XXX pick_eevdf(cfs_rq) != se ? + * If @p has become the most eligible task, force preemption. */ if (pick_eevdf(cfs_rq) == pse) goto preempt; @@@ -8431,7 -8826,6 +8834,6 @@@ preempt resched_curr(rq); }
- #ifdef CONFIG_SMP static struct task_struct *pick_task_fair(struct rq *rq) { struct sched_entity *se; @@@ -8443,95 -8837,58 +8845,58 @@@ again return NULL;
do { - struct sched_entity *curr = cfs_rq->curr; - - /* When we pick for a remote RQ, we'll not have done put_prev_entity() */ - if (curr) { - if (curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; + /* Might not have done put_prev_entity() */ + if (cfs_rq->curr && cfs_rq->curr->on_rq) + update_curr(cfs_rq);
- if (unlikely(check_cfs_rq_runtime(cfs_rq))) - goto again; - } + if (unlikely(check_cfs_rq_runtime(cfs_rq))) + goto again;
- se = pick_next_entity(cfs_rq); + se = pick_next_entity(rq, cfs_rq); + if (!se) + goto again; cfs_rq = group_cfs_rq(se); } while (cfs_rq);
return task_of(se); } - #endif + + static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first); + static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
struct task_struct * pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { - struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se; struct task_struct *p; int new_tasks;
again: - if (!sched_fair_runnable(rq)) + p = pick_task_fair(rq); + if (!p) goto idle; + se = &p->se;
#ifdef CONFIG_FAIR_GROUP_SCHED - if (!prev || prev->sched_class != &fair_sched_class) + if (prev->sched_class != &fair_sched_class) goto simple;
+ __put_prev_set_next_dl_server(rq, prev, p); + /* * Because of the set_next_buddy() in dequeue_task_fair() it is rather * likely that a next task is from the same cgroup as the current. * * Therefore attempt to avoid putting and setting the entire cgroup * hierarchy, only change the part that actually changes. - */ - - do { - struct sched_entity *curr = cfs_rq->curr; - - /* - * Since we got here without doing put_prev_entity() we also - * have to consider cfs_rq->curr. If it is still a runnable - * entity, update_curr() will update its vruntime, otherwise - * forget we've ever seen it. - */ - if (curr) { - if (curr->on_rq) - update_curr(cfs_rq); - else - curr = NULL; - - /* - * This call to check_cfs_rq_runtime() will do the - * throttle and dequeue its entity in the parent(s). - * Therefore the nr_running test will indeed - * be correct. - */ - if (unlikely(check_cfs_rq_runtime(cfs_rq))) { - cfs_rq = &rq->cfs; - - if (!cfs_rq->nr_running) - goto idle; - - goto simple; - } - } - - se = pick_next_entity(cfs_rq); - cfs_rq = group_cfs_rq(se); - } while (cfs_rq); - - p = task_of(se); - - /* + * * Since we haven't yet done put_prev_entity and if the selected task * is a different task than we started out with, try and touch the * least amount of cfs_rqs. */ if (prev != p) { struct sched_entity *pse = &prev->se; + struct cfs_rq *cfs_rq;
while (!(cfs_rq = is_same_group(se, pse))) { int se_depth = se->depth; @@@ -8549,38 -8906,15 +8914,15 @@@
put_prev_entity(cfs_rq, pse); set_next_entity(cfs_rq, se); - } - - goto done; - simple: - #endif - if (prev) - put_prev_task(rq, prev);
- do { - se = pick_next_entity(cfs_rq); - set_next_entity(cfs_rq, se); - cfs_rq = group_cfs_rq(se); - } while (cfs_rq); + __set_next_task_fair(rq, p, true); + }
- p = task_of(se); + return p;
- done: __maybe_unused; - #ifdef CONFIG_SMP - /* - * Move the next running task to the front of - * the list, so our cfs_tasks list becomes MRU - * one. - */ - list_move(&p->se.group_node, &rq->cfs_tasks); + simple: #endif - - if (hrtick_enabled_fair(rq)) - hrtick_start_fair(rq, p); - - update_misfit_status(p, rq); - sched_fair_update_stop_tick(rq, p); - + put_prev_set_next_task(rq, prev, p); return p;
idle: @@@ -8609,15 -8943,34 +8951,34 @@@ return NULL; }
- static struct task_struct *__pick_next_task_fair(struct rq *rq) + static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev) { - return pick_next_task_fair(rq, NULL, NULL); + return pick_next_task_fair(rq, prev, NULL); + } + + static bool fair_server_has_tasks(struct sched_dl_entity *dl_se) + { + return !!dl_se->rq->cfs.nr_running; + } + + static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se) + { + return pick_task_fair(dl_se->rq); + } + + void fair_server_init(struct rq *rq) + { + struct sched_dl_entity *dl_se = &rq->fair_server; + + init_dl_entity(dl_se); + + dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task); }
/* * Account for a descheduled task: */ - static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) + static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next) { struct sched_entity *se = &prev->se; struct cfs_rq *cfs_rq; @@@ -9368,9 -9721,10 +9729,10 @@@ static bool __update_blocked_others(str
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
+ /* hw_pressure doesn't care about invariance */ decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | - update_hw_load_avg(now, rq, hw_pressure) | + update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) | update_irq_load_avg(rq, 0);
if (others_have_blocked(rq)) @@@ -12491,7 -12845,7 +12853,7 @@@ out * - indirectly from a remote scheduler_tick() for NOHZ idle balancing * through the SMP cross-call nohz_csd_func() */ - static __latent_entropy void sched_balance_softirq(struct softirq_action *h) + static __latent_entropy void sched_balance_softirq(void) { struct rq *this_rq = this_rq(); enum cpu_idle_type idle = this_rq->idle_balance; @@@ -12710,22 -13064,7 +13072,7 @@@ static void task_tick_fair(struct rq *r */ static void task_fork_fair(struct task_struct *p) { - struct sched_entity *se = &p->se, *curr; - struct cfs_rq *cfs_rq; - struct rq *rq = this_rq(); - struct rq_flags rf; - - rq_lock(rq, &rf); - update_rq_clock(rq); - set_task_max_allowed_capacity(p); - - cfs_rq = task_cfs_rq(current); - curr = cfs_rq->curr; - if (curr) - update_curr(cfs_rq); - place_entity(cfs_rq, se, ENQUEUE_INITIAL); - rq_unlock(rq, &rf); }
/* @@@ -12837,10 -13176,28 +13184,28 @@@ static void attach_task_cfs_rq(struct t static void switched_from_fair(struct rq *rq, struct task_struct *p) { detach_task_cfs_rq(p); + /* + * Since this is called after changing class, this is a little weird + * and we cannot use DEQUEUE_DELAYED. + */ + if (p->se.sched_delayed) { + /* First, dequeue it from its new class' structures */ + dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP); + /* + * Now, clean up the fair_sched_class side of things + * related to sched_delayed being true and that wasn't done + * due to the generic dequeue not using DEQUEUE_DELAYED. + */ + finish_delayed_dequeue_entity(&p->se); + p->se.rel_deadline = 0; + __block_task(rq, p); + } }
static void switched_to_fair(struct rq *rq, struct task_struct *p) { + SCHED_WARN_ON(p->se.sched_delayed); + attach_task_cfs_rq(p);
set_task_max_allowed_capacity(p); @@@ -12858,12 -13215,7 +13223,7 @@@ } }
- /* Account for a task changing its policy or group. - * - * This routine is mostly called to set cfs_rq->curr field when a task - * migrates between groups/classes. - */ - static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) + static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) { struct sched_entity *se = &p->se;
@@@ -12876,6 -13228,27 +13236,27 @@@ list_move(&se->group_node, &rq->cfs_tasks); } #endif + if (!first) + return; + + SCHED_WARN_ON(se->sched_delayed); + + if (hrtick_enabled_fair(rq)) + hrtick_start_fair(rq, p); + + update_misfit_status(p, rq); + sched_fair_update_stop_tick(rq, p); + } + + /* + * Account for a task changing its policy or group. + * + * This routine is mostly called to set cfs_rq->curr field when a task + * migrates between groups/classes. + */ + static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first) + { + struct sched_entity *se = &p->se;
for_each_sched_entity(se) { struct cfs_rq *cfs_rq = cfs_rq_of(se); @@@ -12884,12 -13257,14 +13265,14 @@@ /* ensure bandwidth has been allocated on our new cfs_rq */ account_cfs_rq_runtime(cfs_rq, 0); } + + __set_next_task_fair(rq, p, first); }
void init_cfs_rq(struct cfs_rq *cfs_rq) { cfs_rq->tasks_timeline = RB_ROOT_CACHED; - u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20))); + cfs_rq->min_vruntime = (u64)(-(1LL << 20)); #ifdef CONFIG_SMP raw_spin_lock_init(&cfs_rq->removed.lock); #endif @@@ -12991,28 -13366,35 +13374,35 @@@ void online_fair_sched_group(struct tas
void unregister_fair_sched_group(struct task_group *tg) { - unsigned long flags; - struct rq *rq; int cpu;
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
for_each_possible_cpu(cpu) { - if (tg->se[cpu]) - remove_entity_load_avg(tg->se[cpu]); + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu]; + struct sched_entity *se = tg->se[cpu]; + struct rq *rq = cpu_rq(cpu); + + if (se) { + if (se->sched_delayed) { + guard(rq_lock_irqsave)(rq); + if (se->sched_delayed) { + update_rq_clock(rq); + dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED); + } + list_del_leaf_cfs_rq(cfs_rq); + } + remove_entity_load_avg(se); + }
/* * Only empty task groups can be destroyed; so we can speculatively * check on_list without danger of it being re-added. */ - if (!tg->cfs_rq[cpu]->on_list) - continue; - - rq = cpu_rq(cpu); - - raw_spin_rq_lock_irqsave(rq, flags); - list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); - raw_spin_rq_unlock_irqrestore(rq, flags); + if (cfs_rq->on_list) { + guard(rq_lock_irqsave)(rq); + list_del_leaf_cfs_rq(cfs_rq); + } } }
@@@ -13202,13 -13584,13 +13592,13 @@@ DEFINE_SCHED_CLASS(fair) =
.wakeup_preempt = check_preempt_wakeup_fair,
+ .pick_task = pick_task_fair, .pick_next_task = __pick_next_task_fair, .put_prev_task = put_prev_task_fair, .set_next_task = set_next_task_fair,
#ifdef CONFIG_SMP .balance = balance_fair, - .pick_task = pick_task_fair, .select_task_rq = select_task_rq_fair, .migrate_task_rq = migrate_task_rq_fair,
diff --combined kernel/signal.c index cc5d87cfa7c0a,6f3a5aa39b091..4344860ffcac7 --- a/kernel/signal.c +++ b/kernel/signal.c @@@ -618,20 -618,18 +618,18 @@@ static int __dequeue_signal(struct sigp }
/* - * Dequeue a signal and return the element to the caller, which is - * expected to free it. - * - * All callers have to hold the siglock. + * Try to dequeue a signal. If a deliverable signal is found fill in the + * caller provided siginfo and return the signal number. Otherwise return + * 0. */ - int dequeue_signal(struct task_struct *tsk, sigset_t *mask, - kernel_siginfo_t *info, enum pid_type *type) + int dequeue_signal(sigset_t *mask, kernel_siginfo_t *info, enum pid_type *type) { + struct task_struct *tsk = current; bool resched_timer = false; int signr;
- /* We only dequeue private signals from ourselves, we don't let - * signalfd steal them - */ + lockdep_assert_held(&tsk->sighand->siglock); + *type = PIDTYPE_PID; signr = __dequeue_signal(&tsk->pending, mask, info, &resched_timer); if (!signr) { @@@ -1940,10 -1938,11 +1938,11 @@@ struct sigqueue *sigqueue_alloc(void
void sigqueue_free(struct sigqueue *q) { - unsigned long flags; spinlock_t *lock = ¤t->sighand->siglock; + unsigned long flags;
- BUG_ON(!(q->flags & SIGQUEUE_PREALLOC)); + if (WARN_ON_ONCE(!(q->flags & SIGQUEUE_PREALLOC))) + return; /* * We must hold ->siglock while testing q->list * to serialize with collect_signal() or with @@@ -1971,7 -1970,10 +1970,10 @@@ int send_sigqueue(struct sigqueue *q, s unsigned long flags; int ret, result;
- BUG_ON(!(q->flags & SIGQUEUE_PREALLOC)); + if (WARN_ON_ONCE(!(q->flags & SIGQUEUE_PREALLOC))) + return 0; + if (WARN_ON_ONCE(q->info.si_code != SI_TIMER)) + return 0;
ret = -1; rcu_read_lock(); @@@ -2006,7 -2008,6 +2008,6 @@@ * If an SI_TIMER entry is already queue just increment * the overrun count. */ - BUG_ON(q->info.si_code != SI_TIMER); q->info.si_overrun++; result = TRACE_SIGNAL_ALREADY_PENDING; goto out; @@@ -2793,8 -2794,7 +2794,7 @@@ relock type = PIDTYPE_PID; signr = dequeue_synchronous_signal(&ksig->info); if (!signr) - signr = dequeue_signal(current, ¤t->blocked, - &ksig->info, &type); + signr = dequeue_signal(¤t->blocked, &ksig->info, &type);
if (!signr) break; /* will return 0 */ @@@ -3648,7 -3648,7 +3648,7 @@@ static int do_sigtimedwait(const sigset signotset(&mask);
spin_lock_irq(&tsk->sighand->siglock); - sig = dequeue_signal(tsk, &mask, info, &type); + sig = dequeue_signal(&mask, info, &type); if (!sig && timeout) { /* * None ready, temporarily unblock those we're interested @@@ -3667,7 -3667,7 +3667,7 @@@ spin_lock_irq(&tsk->sighand->siglock); __set_task_blocked(tsk, &tsk->real_blocked); sigemptyset(&tsk->real_blocked); - sig = dequeue_signal(tsk, &mask, info, &type); + sig = dequeue_signal(&mask, info, &type); } spin_unlock_irq(&tsk->sighand->siglock);
@@@ -3922,11 -3922,11 +3922,11 @@@ SYSCALL_DEFINE4(pidfd_send_signal, int return -EINVAL;
f = fdget(pidfd); - if (!f.file) + if (!fd_file(f)) return -EBADF;
/* Is this a pidfd? */ - pid = pidfd_to_pid(f.file); + pid = pidfd_to_pid(fd_file(f)); if (IS_ERR(pid)) { ret = PTR_ERR(pid); goto err; @@@ -3939,7 -3939,7 +3939,7 @@@ switch (flags) { case 0: /* Infer scope from the type of pidfd. */ - if (f.file->f_flags & PIDFD_THREAD) + if (fd_file(f)->f_flags & PIDFD_THREAD) type = PIDTYPE_PID; else type = PIDTYPE_TGID; diff --combined kernel/sys.c index a4be1e568ff5c,b7e096e1c3a13..4da31f28fda81 --- a/kernel/sys.c +++ b/kernel/sys.c @@@ -1916,10 -1916,10 +1916,10 @@@ static int prctl_set_mm_exe_file(struc int err;
exe = fdget(fd); - if (!exe.file) + if (!fd_file(exe)) return -EBADF;
- inode = file_inode(exe.file); + inode = file_inode(fd_file(exe));
/* * Because the original mm->exe_file points to executable file, make @@@ -1927,14 -1927,14 +1927,14 @@@ * overall picture. */ err = -EACCES; - if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path)) + if (!S_ISREG(inode->i_mode) || path_noexec(&fd_file(exe)->f_path)) goto exit;
- err = file_permission(exe.file, MAY_EXEC); + err = file_permission(fd_file(exe), MAY_EXEC); if (err) goto exit;
- err = replace_mm_exe_file(mm, exe.file); + err = replace_mm_exe_file(mm, fd_file(exe)); exit: fdput(exe); return err; @@@ -2557,6 -2557,8 +2557,8 @@@ SYSCALL_DEFINE5(prctl, int, option, uns error = current->timer_slack_ns; break; case PR_SET_TIMERSLACK: + if (rt_or_dl_task_policy(current)) + break; if (arg2 <= 0) current->timer_slack_ns = current->default_timer_slack_ns; diff --combined kernel/trace/bpf_trace.c index b3283c250e273,ac0a01cc8634a..a582cd25ca876 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@@ -24,6 -24,7 +24,6 @@@ #include <linux/key.h> #include <linux/verification.h> #include <linux/namei.h> -#include <linux/fileattr.h>
#include <net/bpf_sk_storage.h>
@@@ -797,6 -798,29 +797,6 @@@ const struct bpf_func_proto bpf_task_pt .ret_btf_id = &bpf_task_pt_regs_ids[0], };
-BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, map, u32, idx) -{ - struct bpf_array *array = container_of(map, struct bpf_array, map); - struct cgroup *cgrp; - - if (unlikely(idx >= array->map.max_entries)) - return -E2BIG; - - cgrp = READ_ONCE(array->ptrs[idx]); - if (unlikely(!cgrp)) - return -EAGAIN; - - return task_under_cgroup_hierarchy(current, cgrp); -} - -static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = { - .func = bpf_current_task_under_cgroup, - .gpl_only = false, - .ret_type = RET_INTEGER, - .arg1_type = ARG_CONST_MAP_PTR, - .arg2_type = ARG_ANYTHING, -}; - struct send_signal_irq_work { struct irq_work irq_work; struct task_struct *task; @@@ -1202,8 -1226,7 +1202,8 @@@ static const struct bpf_func_proto bpf_ .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_CTX, .arg2_type = ARG_ANYTHING, - .arg3_type = ARG_PTR_TO_LONG, + .arg3_type = ARG_PTR_TO_FIXED_SIZE_MEM | MEM_UNINIT | MEM_ALIGNED, + .arg3_size = sizeof(u64), };
BPF_CALL_2(get_func_ret, void *, ctx, u64 *, value) @@@ -1219,8 -1242,7 +1219,8 @@@ static const struct bpf_func_proto bpf_ .func = get_func_ret, .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_CTX, - .arg2_type = ARG_PTR_TO_LONG, + .arg2_type = ARG_PTR_TO_FIXED_SIZE_MEM | MEM_UNINIT | MEM_ALIGNED, + .arg2_size = sizeof(u64), };
BPF_CALL_1(get_func_arg_cnt, void *, ctx) @@@ -1417,6 -1439,73 +1417,6 @@@ static int __init bpf_key_sig_kfuncs_in late_initcall(bpf_key_sig_kfuncs_init); #endif /* CONFIG_KEYS */
-/* filesystem kfuncs */ -__bpf_kfunc_start_defs(); - -/** - * bpf_get_file_xattr - get xattr of a file - * @file: file to get xattr from - * @name__str: name of the xattr - * @value_p: output buffer of the xattr value - * - * Get xattr *name__str* of *file* and store the output in *value_ptr*. - * - * For security reasons, only *name__str* with prefix "user." is allowed. - * - * Return: 0 on success, a negative value on error. - */ -__bpf_kfunc int bpf_get_file_xattr(struct file *file, const char *name__str, - struct bpf_dynptr *value_p) -{ - struct bpf_dynptr_kern *value_ptr = (struct bpf_dynptr_kern *)value_p; - struct dentry *dentry; - u32 value_len; - void *value; - int ret; - - if (strncmp(name__str, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) - return -EPERM; - - value_len = __bpf_dynptr_size(value_ptr); - value = __bpf_dynptr_data_rw(value_ptr, value_len); - if (!value) - return -EINVAL; - - dentry = file_dentry(file); - ret = inode_permission(&nop_mnt_idmap, dentry->d_inode, MAY_READ); - if (ret) - return ret; - return __vfs_getxattr(dentry, dentry->d_inode, name__str, value, value_len); -} - -__bpf_kfunc_end_defs(); - -BTF_KFUNCS_START(fs_kfunc_set_ids) -BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS) -BTF_KFUNCS_END(fs_kfunc_set_ids) - -static int bpf_get_file_xattr_filter(const struct bpf_prog *prog, u32 kfunc_id) -{ - if (!btf_id_set8_contains(&fs_kfunc_set_ids, kfunc_id)) - return 0; - - /* Only allow to attach from LSM hooks, to avoid recursion */ - return prog->type != BPF_PROG_TYPE_LSM ? -EACCES : 0; -} - -static const struct btf_kfunc_id_set bpf_fs_kfunc_set = { - .owner = THIS_MODULE, - .set = &fs_kfunc_set_ids, - .filter = bpf_get_file_xattr_filter, -}; - -static int __init bpf_fs_kfuncs_init(void) -{ - return register_btf_kfunc_id_set(BPF_PROG_TYPE_LSM, &bpf_fs_kfunc_set); -} - -late_initcall(bpf_fs_kfuncs_init); - static const struct bpf_func_proto * bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@@ -1459,6 -1548,8 +1459,6 @@@ return &bpf_get_numa_node_id_proto; case BPF_FUNC_perf_event_read: return &bpf_perf_event_read_proto; - case BPF_FUNC_current_task_under_cgroup: - return &bpf_current_task_under_cgroup_proto; case BPF_FUNC_get_prandom_u32: return &bpf_get_prandom_u32_proto; case BPF_FUNC_probe_write_user: @@@ -1487,8 -1578,6 +1487,8 @@@ return &bpf_cgrp_storage_get_proto; case BPF_FUNC_cgrp_storage_delete: return &bpf_cgrp_storage_delete_proto; + case BPF_FUNC_current_task_under_cgroup: + return &bpf_current_task_under_cgroup_proto; #endif case BPF_FUNC_send_signal: return &bpf_send_signal_proto; @@@ -1509,8 -1598,7 +1509,8 @@@ case BPF_FUNC_jiffies64: return &bpf_jiffies64_proto; case BPF_FUNC_get_task_stack: - return &bpf_get_task_stack_proto; + return prog->sleepable ? &bpf_get_task_stack_sleepable_proto + : &bpf_get_task_stack_proto; case BPF_FUNC_copy_from_user: return &bpf_copy_from_user_proto; case BPF_FUNC_copy_from_user_task: @@@ -1566,7 -1654,7 +1566,7 @@@ kprobe_prog_func_proto(enum bpf_func_i case BPF_FUNC_get_stackid: return &bpf_get_stackid_proto; case BPF_FUNC_get_stack: - return &bpf_get_stack_proto; + return prog->sleepable ? &bpf_get_stack_sleepable_proto : &bpf_get_stack_proto; #ifdef CONFIG_BPF_KPROBE_OVERRIDE case BPF_FUNC_override_return: return &bpf_override_return_proto; @@@ -3072,6 -3160,7 +3072,7 @@@ struct bpf_uprobe loff_t offset; unsigned long ref_ctr_offset; u64 cookie; + struct uprobe *uprobe; struct uprobe_consumer consumer; };
@@@ -3090,15 -3179,15 +3091,15 @@@ struct bpf_uprobe_multi_run_ctx struct bpf_uprobe *uprobe; };
- static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, - u32 cnt) + static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt) { u32 i;
- for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset, - &uprobes[i].consumer); - } + for (i = 0; i < cnt; i++) + uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer); + + if (cnt) + uprobe_unregister_sync(); }
static void bpf_uprobe_multi_link_release(struct bpf_link *link) @@@ -3106,7 -3195,7 +3107,7 @@@ struct bpf_uprobe_multi_link *umulti_link;
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link); - bpf_uprobe_unregister(&umulti_link->path, umulti_link->uprobes, umulti_link->cnt); + bpf_uprobe_unregister(umulti_link->uprobes, umulti_link->cnt); if (umulti_link->task) put_task_struct(umulti_link->task); path_put(&umulti_link->path); @@@ -3210,7 -3299,7 +3211,7 @@@ static int uprobe_prog_run(struct bpf_u struct bpf_run_ctx *old_run_ctx; int err = 0;
- if (link->task && current->mm != link->task->mm) + if (link->task && !same_thread_group(current, link->task)) return 0;
if (sleepable) @@@ -3234,8 -3323,7 +3235,7 @@@ }
static bool - uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx, - struct mm_struct *mm) + uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm) { struct bpf_uprobe *uprobe;
@@@ -3392,22 -3480,26 +3392,26 @@@ int bpf_uprobe_multi_link_attach(const &bpf_uprobe_multi_link_lops, prog);
for (i = 0; i < cnt; i++) { - err = uprobe_register_refctr(d_real_inode(link->path.dentry), - uprobes[i].offset, - uprobes[i].ref_ctr_offset, - &uprobes[i].consumer); - if (err) { - bpf_uprobe_unregister(&path, uprobes, i); - goto error_free; + uprobes[i].uprobe = uprobe_register(d_real_inode(link->path.dentry), + uprobes[i].offset, + uprobes[i].ref_ctr_offset, + &uprobes[i].consumer); + if (IS_ERR(uprobes[i].uprobe)) { + err = PTR_ERR(uprobes[i].uprobe); + link->cnt = i; + goto error_unregister; } }
err = bpf_link_prime(&link->link, &link_primer); if (err) - goto error_free; + goto error_unregister;
return bpf_link_settle(&link_primer);
+ error_unregister: + bpf_uprobe_unregister(uprobes, link->cnt); + error_free: kvfree(uprobes); kfree(link); diff --combined lib/Kconfig.debug index 59200dc5504cc,26354671b37df..29bcf83ae865e --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@@ -97,7 -97,7 +97,7 @@@ config BOOT_PRINTK_DELA using "boot_delay=N".
It is likely that you would also need to use "lpj=M" to preset - the "loops per jiffie" value. + the "loops per jiffy" value. See a previous boot log for the "lpj" value to use for your system, and then set "lpj=M" before setting "boot_delay=N". NOTE: Using this option may adversely affect SMP systems. @@@ -379,15 -379,13 +379,15 @@@ config DEBUG_INFO_BT depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST depends on BPF_SYSCALL - depends on !DEBUG_INFO_DWARF5 || PAHOLE_VERSION >= 121 + depends on PAHOLE_VERSION >= 116 + depends on DEBUG_INFO_DWARF4 || PAHOLE_VERSION >= 121 # pahole uses elfutils, which does not have support for Hexagon relocations depends on !HEXAGON help Generate deduplicated BTF type information from DWARF debug info. - Turning this on expects presence of pahole tool, which will convert - DWARF type info into equivalent deduplicated BTF type info. + Turning this on requires pahole v1.16 or later (v1.21 or later to + support DWARF 5), which will convert DWARF type info into equivalent + deduplicated BTF type info.
config PAHOLE_HAS_SPLIT_BTF def_bool PAHOLE_VERSION >= 119 @@@ -573,21 -571,6 +573,21 @@@ config VMLINUX_MA pieces of code get eliminated with CONFIG_LD_DEAD_CODE_DATA_ELIMINATION.
+config BUILTIN_MODULE_RANGES + bool "Generate address range information for builtin modules" + depends on !LTO + depends on VMLINUX_MAP + help + When modules are built into the kernel, there will be no module name + associated with its symbols in /proc/kallsyms. Tracers may want to + identify symbols by module name and symbol name regardless of whether + the module is configured as loadable or not. + + This option generates modules.builtin.ranges in the build tree with + offset ranges (per ELF section) for the module(s) they belong to. + It also records an anchor symbol to determine the load address of the + section. + config DEBUG_FORCE_WEAK_PER_CPU bool "Force weak per-cpu definitions" depends on DEBUG_KERNEL @@@ -1532,7 -1515,7 +1532,7 @@@ config LOCKDEP_BIT config LOCKDEP_CHAINS_BITS int "Bitsize for MAX_LOCKDEP_CHAINS" depends on LOCKDEP && !LOCKDEP_SMALL - range 10 30 + range 10 21 default 16 help Try increasing this value if you hit "BUG: MAX_LOCKDEP_CHAINS too low!" message. @@@ -2036,7 -2019,7 +2036,7 @@@ config FAULT_INJECTIO depends on DEBUG_KERNEL help Provide fault-injection framework. - For more details, see Documentation/fault-injection/. + For more details, see Documentation/dev-tools/fault-injection/.
config FAILSLAB bool "Fault-injection capability for kmalloc" @@@ -2190,6 -2173,14 +2190,14 @@@ config KCOV_IRQ_AREA_SIZ soft interrupts. This specifies the size of those areas in the number of unsigned long words.
+ config KCOV_SELFTEST + bool "Perform short selftests on boot" + depends on KCOV + help + Run short KCOV coverage collection selftests on boot. + On test failure, causes the kernel to panic. Recommended to be + enabled, ensuring critical functionality works as intended. + menuconfig RUNTIME_TESTING_MENU bool "Runtime Testing" default y @@@ -2242,7 -2233,7 +2250,7 @@@ config LKDT called lkdtm.
Documentation on how to use the module can be found in - Documentation/fault-injection/provoke-crashes.rst + Documentation/dev-tools/fault-injection/provoke-crashes.rst
config CPUMASK_KUNIT_TEST tristate "KUnit test for cpumask" if !KUNIT_ALL_TESTS @@@ -2297,16 -2288,6 +2305,16 @@@ config TEST_DIV6
If unsure, say N.
+config TEST_MULDIV64 + tristate "mul_u64_u64_div_u64() test" + depends on DEBUG_KERNEL || m + help + Enable this to turn on 'mul_u64_u64_div_u64()' function test. + This test is executed only once during system boot (so affects + only boot time), or at module load time. + + If unsure, say N. + config TEST_IOV_ITER tristate "Test iov_iter operation" if !KUNIT_ALL_TESTS depends on KUNIT @@@ -2643,7 -2624,6 +2651,7 @@@ config RESOURCE_KUNIT_TES tristate "KUnit test for resource API" if !KUNIT_ALL_TESTS depends on KUNIT default KUNIT_ALL_TESTS + select GET_FREE_REGION help This builds the resource API unit test. Tests the logic of API provided by resource.c and ioport.h. diff --combined mm/page-writeback.c index f5448311c89eb,7a04cb1918fd5..fcd4c1439cb9c --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@@ -418,7 -418,7 +418,7 @@@ static void domain_dirty_limits(struct bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
tsk = current; - if (rt_task(tsk)) { + if (rt_or_dl_task(tsk)) { bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32; thresh += thresh / 4 + global_wb_domain.dirty_limit / 32; } @@@ -477,7 -477,7 +477,7 @@@ static unsigned long node_dirty_limit(s else dirty = vm_dirty_ratio * node_memory / 100;
- if (rt_task(tsk)) + if (rt_or_dl_task(tsk)) dirty += dirty / 4;
/* @@@ -2612,7 -2612,7 +2612,7 @@@ struct folio *writeback_iter(struct add
done: if (wbc->range_cyclic) - mapping->writeback_index = folio->index + folio_nr_pages(folio); + mapping->writeback_index = folio_next_index(folio); folio_batch_release(&wbc->fbatch); return NULL; } diff --combined mm/page_alloc.c index aebc4529d5fcd,0aefae4a26b20..0f33dab6d344f --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@@ -286,7 -286,9 +286,7 @@@ EXPORT_SYMBOL(nr_online_nodes) #endif
static bool page_contains_unaccepted(struct page *page, unsigned int order); -static void accept_page(struct page *page, unsigned int order); static bool cond_accept_memory(struct zone *zone, unsigned int order); -static inline bool has_unaccepted_memory(void); static bool __free_unaccepted(struct page *page);
int page_group_by_mobility_disabled __read_mostly; @@@ -320,11 -322,6 +320,11 @@@ static inline bool deferred_pages_enabl { return false; } + +static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order) +{ + return false; +} #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
/* Return a pointer to the bitmap storing bits affecting a block of pages */ @@@ -961,9 -958,8 +961,9 @@@ static int free_tail_page_prepare(struc break; case 2: /* the second tail page: deferred_list overlaps ->mapping */ - if (unlikely(!list_empty(&folio->_deferred_list))) { - bad_page(page, "on deferred list"); + if (unlikely(!list_empty(&folio->_deferred_list) && + folio_test_partially_mapped(folio))) { + bad_page(page, "partially mapped folio on deferred list"); goto out; } break; @@@ -1091,11 -1087,8 +1091,11 @@@ __always_inline bool free_pages_prepare (page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; } } - if (PageMappingFlags(page)) + if (PageMappingFlags(page)) { + if (PageAnon(page)) + mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1); page->mapping = NULL; + } if (is_check_pages_enabled()) { if (free_page_is_bad(page)) bad++; @@@ -1206,39 -1199,17 +1206,39 @@@ static void free_pcppages_bulk(struct z spin_unlock_irqrestore(&zone->lock, flags); }
+/* Split a multi-block free page into its individual pageblocks. */ +static void split_large_buddy(struct zone *zone, struct page *page, + unsigned long pfn, int order, fpi_t fpi) +{ + unsigned long end = pfn + (1 << order); + + VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order)); + /* Caller removed page from freelist, buddy info cleared! */ + VM_WARN_ON_ONCE(PageBuddy(page)); + + if (order > pageblock_order) + order = pageblock_order; + + while (pfn != end) { + int mt = get_pfnblock_migratetype(page, pfn); + + __free_one_page(page, pfn, zone, order, mt, fpi); + pfn += 1 << order; + page = pfn_to_page(pfn); + } +} + static void free_one_page(struct zone *zone, struct page *page, unsigned long pfn, unsigned int order, fpi_t fpi_flags) { unsigned long flags; - int migratetype;
spin_lock_irqsave(&zone->lock, flags); - migratetype = get_pfnblock_migratetype(page, pfn); - __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); + split_large_buddy(zone, page, pfn, order, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); + + __count_vm_events(PGFREE, 1 << order); }
static void __free_pages_ok(struct page *page, unsigned int order, @@@ -1247,8 -1218,12 +1247,8 @@@ unsigned long pfn = page_to_pfn(page); struct zone *zone = page_zone(page);
- if (!free_pages_prepare(page, order)) - return; - - free_one_page(zone, page, pfn, order, fpi_flags); - - __count_vm_events(PGFREE, 1 << order); + if (free_pages_prepare(page, order)) + free_one_page(zone, page, pfn, order, fpi_flags); }
void __meminit __free_pages_core(struct page *page, unsigned int order, @@@ -1295,7 -1270,7 +1295,7 @@@ if (order == MAX_PAGE_ORDER && __free_unaccepted(page)) return;
- accept_page(page, order); + accept_memory(page_to_phys(page), PAGE_SIZE << order); }
/* @@@ -1371,11 -1346,11 +1371,11 @@@ struct page *__pageblock_pfn_to_page(un * * -- nyc */ -static inline void expand(struct zone *zone, struct page *page, - int low, int high, int migratetype) +static inline unsigned int expand(struct zone *zone, struct page *page, int low, + int high, int migratetype) { - unsigned long size = 1 << high; - unsigned long nr_added = 0; + unsigned int size = 1 << high; + unsigned int nr_added = 0;
while (high > low) { high--; @@@ -1395,19 -1370,7 +1395,19 @@@ set_buddy_order(&page[size], high); nr_added += size; } - account_freepages(zone, nr_added, migratetype); + + return nr_added; +} + +static __always_inline void page_del_and_expand(struct zone *zone, + struct page *page, int low, + int high, int migratetype) +{ + int nr_pages = 1 << high; + + __del_page_from_free_list(page, zone, high, migratetype); + nr_pages -= expand(zone, page, low, high, migratetype); + account_freepages(zone, -nr_pages, migratetype); }
static void check_new_page_bad(struct page *page) @@@ -1577,9 -1540,8 +1577,9 @@@ struct page *__rmqueue_smallest(struct page = get_page_from_free_area(area, migratetype); if (!page) continue; - del_page_from_free_list(page, zone, current_order, migratetype); - expand(zone, page, order, current_order, migratetype); + + page_del_and_expand(zone, page, order, current_order, + migratetype); trace_mm_page_alloc_zone_locked(page, order, migratetype, pcp_allowed_order(order) && migratetype < MIGRATE_PCPTYPES); @@@ -1738,6 -1700,27 +1738,6 @@@ static unsigned long find_large_buddy(u return start_pfn; }
-/* Split a multi-block free page into its individual pageblocks */ -static void split_large_buddy(struct zone *zone, struct page *page, - unsigned long pfn, int order) -{ - unsigned long end_pfn = pfn + (1 << order); - - VM_WARN_ON_ONCE(order <= pageblock_order); - VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1)); - - /* Caller removed page from freelist, buddy info cleared! */ - VM_WARN_ON_ONCE(PageBuddy(page)); - - while (pfn != end_pfn) { - int mt = get_pfnblock_migratetype(page, pfn); - - __free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE); - pfn += pageblock_nr_pages; - page = pfn_to_page(pfn); - } -} - /** * move_freepages_block_isolate - move free pages in block for page isolation * @zone: the zone @@@ -1778,7 -1761,7 +1778,7 @@@ bool move_freepages_block_isolate(struc del_page_from_free_list(buddy, zone, order, get_pfnblock_migratetype(buddy, pfn)); set_pageblock_migratetype(page, migratetype); - split_large_buddy(zone, buddy, pfn, order); + split_large_buddy(zone, buddy, pfn, order, FPI_NONE); return true; }
@@@ -1789,7 -1772,7 +1789,7 @@@ del_page_from_free_list(page, zone, order, get_pfnblock_migratetype(page, pfn)); set_pageblock_migratetype(page, migratetype); - split_large_buddy(zone, page, pfn, order); + split_large_buddy(zone, page, pfn, order, FPI_NONE); return true; } move: @@@ -1909,12 -1892,9 +1909,12 @@@ steal_suitable_fallback(struct zone *zo
/* Take ownership for orders >= pageblock_order */ if (current_order >= pageblock_order) { + unsigned int nr_added; + del_page_from_free_list(page, zone, current_order, block_type); change_pageblock_range(page, current_order, start_type); - expand(zone, page, order, current_order, start_type); + nr_added = expand(zone, page, order, current_order, start_type); + account_freepages(zone, nr_added, start_type); return page; }
@@@ -1967,7 -1947,8 +1967,7 @@@ }
single_page: - del_page_from_free_list(page, zone, current_order, block_type); - expand(zone, page, order, current_order, block_type); + page_del_and_expand(zone, page, order, current_order, block_type); return page; }
@@@ -2236,43 -2217,6 +2236,43 @@@ do_steal return page; }
+#ifdef CONFIG_CMA +/* + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok + * again without ALLOC_CMA to see if to use CMA first. + */ +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) +{ + unsigned long watermark; + bool cma_first = false; + + watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + /* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */ + if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) { + /* + * Balance movable allocations between regular and CMA areas by + * allocating from CMA when over half of the zone's free memory + * is in the CMA area. + */ + cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) > + zone_page_state(zone, NR_FREE_PAGES) / 2); + } else { + /* + * watermark failed means UNMOVABLE & RECLAIMBLE is not enough + * now, we should use cma first to keep them stay around the + * corresponding watermark + */ + cma_first = true; + } + return cma_first; +} +#else +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags) +{ + return false; +} +#endif /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. @@@ -2286,11 -2230,12 +2286,11 @@@ __rmqueue(struct zone *zone, unsigned i if (IS_ENABLED(CONFIG_CMA)) { /* * Balance movable allocations between regular and CMA areas by - * allocating from CMA when over half of the zone's free memory - * is in the CMA area. + * allocating from CMA base on judging zone_watermark_ok again + * to see if the latest check got pass via the help of CMA */ if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { + use_cma_first(zone, order, alloc_flags)) { page = __rmqueue_cma_fallback(zone, order); if (page) return page; @@@ -2819,7 -2764,7 +2819,7 @@@ void split_page(struct page *page, unsi for (i = 1; i < (1 << order); i++) set_page_refcounted(page + i); split_page_owner(page, order, 0); - pgalloc_tag_split(page, 1 << order); + pgalloc_tag_split(page_folio(page), order, 0); split_page_memcg(page, order, 0); } EXPORT_SYMBOL_GPL(split_page); @@@ -3088,6 -3033,12 +3088,6 @@@ struct page *rmqueue(struct zone *prefe { struct page *page;
- /* - * We most definitely don't want callers attempting to - * allocate greater than order-1 page units with __GFP_NOFAIL. - */ - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); - if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); @@@ -3406,7 -3357,7 +3406,7 @@@ retry }
if (no_fallback && nr_online_nodes > 1 && - zone != ac->preferred_zoneref->zone) { + zone != zonelist_zone(ac->preferred_zoneref)) { int local_nid;
/* @@@ -3414,7 -3365,7 +3414,7 @@@ * fragmenting fallbacks. Locality is more important * than fragmentation avoidance. */ - local_nid = zone_to_nid(ac->preferred_zoneref->zone); + local_nid = zonelist_node_idx(ac->preferred_zoneref); if (zone_to_nid(zone) != local_nid) { alloc_flags &= ~ALLOC_NOFRAGMENT; goto retry; @@@ -3451,6 -3402,7 +3451,6 @@@ check_alloc_wmark if (cond_accept_memory(zone, order)) goto try_this_zone;
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* * Watermark failed for this zone, but see if we can * grow this zone if it contains deferred pages. @@@ -3459,13 -3411,14 +3459,13 @@@ if (_deferred_grow_zone(zone, order)) goto try_this_zone; } -#endif /* Checked here to keep the fast path fast */ BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); if (alloc_flags & ALLOC_NO_WATERMARKS) goto try_this_zone;
if (!node_reclaim_enabled() || - !zone_allows_reclaim(ac->preferred_zoneref->zone, zone)) + !zone_allows_reclaim(zonelist_zone(ac->preferred_zoneref), zone)) continue;
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order); @@@ -3487,7 -3440,7 +3487,7 @@@ }
try_this_zone: - page = rmqueue(ac->preferred_zoneref->zone, zone, order, + page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order, gfp_mask, alloc_flags, ac->migratetype); if (page) { prep_new_page(page, order, gfp_mask, alloc_flags); @@@ -3504,11 -3457,13 +3504,11 @@@ if (cond_accept_memory(zone, order)) goto try_this_zone;
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT /* Try again if zone has deferred pages */ if (deferred_pages_enabled()) { if (_deferred_grow_zone(zone, order)) goto try_this_zone; } -#endif } }
@@@ -4049,7 -4004,7 +4049,7 @@@ gfp_to_alloc_flags(gfp_t gfp_mask, unsi */ if (alloc_flags & ALLOC_MIN_RESERVE) alloc_flags &= ~ALLOC_CPUSET; - } else if (unlikely(rt_task(current)) && in_task()) + } else if (unlikely(rt_or_dl_task(current)) && in_task()) alloc_flags |= ALLOC_MIN_RESERVE;
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags); @@@ -4145,11 -4100,6 +4145,11 @@@ should_reclaim_retry(gfp_t gfp_mask, un unsigned long min_wmark = min_wmark_pages(zone); bool wmark;
+ if (cpusets_enabled() && + (alloc_flags & ALLOC_CPUSET) && + !__cpuset_zone_allowed(zone, gfp_mask)) + continue; + available = reclaimable = zone_reclaimable_pages(zone); available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@@ -4225,7 -4175,6 +4225,7 @@@ __alloc_pages_slowpath(gfp_t gfp_mask, { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; bool can_compact = gfp_compaction_allowed(gfp_mask); + bool nofail = gfp_mask & __GFP_NOFAIL; const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; unsigned int alloc_flags; @@@ -4238,25 -4187,6 +4238,25 @@@ unsigned int zonelist_iter_cookie; int reserve_flags;
+ if (unlikely(nofail)) { + /* + * We most definitely don't want callers attempting to + * allocate greater than order-1 page units with __GFP_NOFAIL. + */ + WARN_ON_ONCE(order > 1); + /* + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, + * otherwise, we may result in lockup. + */ + WARN_ON_ONCE(!can_direct_reclaim); + /* + * PF_MEMALLOC request from this context is rather bizarre + * because we cannot reclaim anything and only can loop waiting + * for somebody to do a work for us. + */ + WARN_ON_ONCE(current->flags & PF_MEMALLOC); + } + restart: compaction_retries = 0; no_progress_loops = 0; @@@ -4279,7 -4209,7 +4279,7 @@@ */ ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, ac->nodemask); - if (!ac->preferred_zoneref->zone) + if (!zonelist_zone(ac->preferred_zoneref)) goto nopage;
/* @@@ -4291,7 -4221,7 +4291,7 @@@ struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx, &cpuset_current_mems_allowed); - if (!z->zone) + if (!zonelist_zone(z)) goto nopage; }
@@@ -4474,15 -4404,29 +4474,15 @@@ nopage * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry */ - if (gfp_mask & __GFP_NOFAIL) { + if (unlikely(nofail)) { /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually require GFP_NOWAIT + * Lacking direct_reclaim we can't do anything to reclaim memory, + * we disregard these unreasonable nofail requests and still + * return NULL */ - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) + if (!can_direct_reclaim) goto fail;
- /* - * PF_MEMALLOC request from this context is rather bizarre - * because we cannot reclaim anything and only can loop waiting - * for somebody to do a work for us - */ - WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask); - - /* - * non failing costly orders are a hard requirement which we - * are not prepared for much so let's warn about these users - * so that we can identify them and convert them to something - * else. - */ - WARN_ON_ONCE_GFP(costly_order, gfp_mask); - /* * Help non-failing allocations by giving some access to memory * reserves normally used for high priority non-blocking @@@ -4634,28 -4578,17 +4634,28 @@@ unsigned long alloc_pages_bulk_noprof(g continue; }
- if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone && - zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) { + if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) && + zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) { goto failed; }
+ cond_accept_memory(zone, 0); +retry_this_zone: mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages; if (zone_watermark_fast(zone, 0, mark, zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp)) { break; } + + if (cond_accept_memory(zone, 0)) + goto retry_this_zone; + + /* Try again if zone has deferred pages */ + if (deferred_pages_enabled()) { + if (_deferred_grow_zone(zone, 0)) + goto retry_this_zone; + } }
/* @@@ -4705,7 -4638,7 +4705,7 @@@ pcp_trylock_finish(UP_flags);
__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account); - zone_statistics(ac.preferred_zoneref->zone, zone, nr_account); + zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
out: return nr_populated; @@@ -4763,7 -4696,7 +4763,7 @@@ struct page *__alloc_pages_noprof(gfp_ * Forbid the first pass from falling back to types that fragment * memory until all local zones are considered. */ - alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp); + alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
/* First allocation attempt */ page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); @@@ -5017,7 -4950,7 +5017,7 @@@ static void *make_alloc_exact(unsigned struct page *last = page + nr;
split_page_owner(page, order, 0); - pgalloc_tag_split(page, 1 << order); + pgalloc_tag_split(page_folio(page), order, 0); split_page_memcg(page, order, 0); while (page < --last) set_page_refcounted(last); @@@ -5368,7 -5301,7 +5368,7 @@@ int local_memory_node(int node z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL), gfp_zone(GFP_KERNEL), NULL); - return zone_to_nid(z->zone); + return zonelist_node_idx(z); } #endif
@@@ -6500,31 -6433,6 +6500,31 @@@ int __alloc_contig_migrate_range(struc return (ret < 0) ? ret : 0; }
+static void split_free_pages(struct list_head *list) +{ + int order; + + for (order = 0; order < NR_PAGE_ORDERS; order++) { + struct page *page, *next; + int nr_pages = 1 << order; + + list_for_each_entry_safe(page, next, &list[order], lru) { + int i; + + post_alloc_hook(page, order, __GFP_MOVABLE); + if (!order) + continue; + + split_page(page, order); + + /* Add all subpages to the order-0 head, in sequence. */ + list_del(&page->lru); + for (i = 0; i < nr_pages; i++) + list_add_tail(&page[i].lru, &list[0]); + } + } +} + /** * alloc_contig_range() -- tries to allocate given range of pages * @start: start PFN to allocate @@@ -6637,25 -6545,12 +6637,25 @@@ int alloc_contig_range_noprof(unsigned goto done; }
- /* Free head and tail (if any) */ - if (start != outer_start) - free_contig_range(outer_start, start - outer_start); - if (end != outer_end) - free_contig_range(end, outer_end - end); + if (!(gfp_mask & __GFP_COMP)) { + split_free_pages(cc.freepages);
+ /* Free head and tail (if any) */ + if (start != outer_start) + free_contig_range(outer_start, start - outer_start); + if (end != outer_end) + free_contig_range(end, outer_end - end); + } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { + struct page *head = pfn_to_page(start); + int order = ilog2(end - start); + + check_new_pages(head, order); + prep_new_page(head, order, gfp_mask, 0); + } else { + ret = -EINVAL; + WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n", + start, end, outer_start, outer_end); + } done: undo_isolate_page_range(start, end, migratetype); return ret; @@@ -6764,18 -6659,6 +6764,18 @@@ struct page *alloc_contig_pages_noprof( void free_contig_range(unsigned long pfn, unsigned long nr_pages) { unsigned long count = 0; + struct folio *folio = pfn_folio(pfn); + + if (folio_test_large(folio)) { + int expected = folio_nr_pages(folio); + + if (nr_pages == expected) + folio_put(folio); + else + WARN(true, "PFN %lu: nr_pages %lu != expected %d\n", + pfn, nr_pages, expected); + return; + }
for (; nr_pages--; pfn++) { struct page *page = pfn_to_page(pfn); @@@ -7044,50 -6927,23 +7044,50 @@@ early_param("accept_memory", accept_mem static bool page_contains_unaccepted(struct page *page, unsigned int order) { phys_addr_t start = page_to_phys(page); - phys_addr_t end = start + (PAGE_SIZE << order);
- return range_contains_unaccepted_memory(start, end); + return range_contains_unaccepted_memory(start, PAGE_SIZE << order); }
-static void accept_page(struct page *page, unsigned int order) +static void __accept_page(struct zone *zone, unsigned long *flags, + struct page *page) { - phys_addr_t start = page_to_phys(page); + bool last; + + list_del(&page->lru); + last = list_empty(&zone->unaccepted_pages); + + account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); + __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); + __ClearPageUnaccepted(page); + spin_unlock_irqrestore(&zone->lock, *flags); + + accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER); + + __free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
- accept_memory(start, start + (PAGE_SIZE << order)); + if (last) + static_branch_dec(&zones_with_unaccepted_pages); +} + +void accept_page(struct page *page) +{ + struct zone *zone = page_zone(page); + unsigned long flags; + + spin_lock_irqsave(&zone->lock, flags); + if (!PageUnaccepted(page)) { + spin_unlock_irqrestore(&zone->lock, flags); + return; + } + + /* Unlocks zone->lock */ + __accept_page(zone, &flags, page); }
static bool try_to_accept_memory_one(struct zone *zone) { unsigned long flags; struct page *page; - bool last;
spin_lock_irqsave(&zone->lock, flags); page = list_first_entry_or_null(&zone->unaccepted_pages, @@@ -7097,17 -6953,23 +7097,17 @@@ return false; }
- list_del(&page->lru); - last = list_empty(&zone->unaccepted_pages); - - account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); - __mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES); - spin_unlock_irqrestore(&zone->lock, flags); - - accept_page(page, MAX_PAGE_ORDER); - - __free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL); - - if (last) - static_branch_dec(&zones_with_unaccepted_pages); + /* Unlocks zone->lock */ + __accept_page(zone, &flags, page);
return true; }
+static inline bool has_unaccepted_memory(void) +{ + return static_branch_unlikely(&zones_with_unaccepted_pages); +} + static bool cond_accept_memory(struct zone *zone, unsigned int order) { long to_accept; @@@ -7119,8 -6981,8 +7119,8 @@@ if (list_empty(&zone->unaccepted_pages)) return false;
- /* How much to accept to get to high watermark? */ - to_accept = high_wmark_pages(zone) - + /* How much to accept to get to promo watermark? */ + to_accept = promo_wmark_pages(zone) - (zone_page_state(zone, NR_FREE_PAGES) - __zone_watermark_unusable_free(zone, order, 0) - zone_page_state(zone, NR_UNACCEPTED)); @@@ -7135,6 -6997,11 +7135,6 @@@ return ret; }
-static inline bool has_unaccepted_memory(void) -{ - return static_branch_unlikely(&zones_with_unaccepted_pages); -} - static bool __free_unaccepted(struct page *page) { struct zone *zone = page_zone(page); @@@ -7149,7 -7016,6 +7149,7 @@@ list_add_tail(&page->lru, &zone->unaccepted_pages); account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE); __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES); + __SetPageUnaccepted(page); spin_unlock_irqrestore(&zone->lock, flags);
if (first) @@@ -7165,11 -7031,20 +7165,11 @@@ static bool page_contains_unaccepted(st return false; }
-static void accept_page(struct page *page, unsigned int order) -{ -} - static bool cond_accept_memory(struct zone *zone, unsigned int order) { return false; }
-static inline bool has_unaccepted_memory(void) -{ - return false; -} - static bool __free_unaccepted(struct page *page) { BUILD_BUG(); diff --combined net/core/dev.c index 1e740faf9e783,e4d5e9bdd09e2..cd479f5f22f61 --- a/net/core/dev.c +++ b/net/core/dev.c @@@ -158,10 -158,8 +158,10 @@@ #include <net/page_pool/types.h> #include <net/page_pool/helpers.h> #include <net/rps.h> +#include <linux/phy_link_topology.h>
#include "dev.h" +#include "devmem.h" #include "net-sysfs.h"
static DEFINE_SPINLOCK(ptype_lock); @@@ -3312,10 -3310,6 +3312,10 @@@ int skb_checksum_help(struct sk_buff *s return -EINVAL; }
+ if (!skb_frags_readable(skb)) { + return -EFAULT; + } + /* Before computing a checksum, we should make sure no frag could * be modified by an external entity : checksum could be wrong. */ @@@ -3392,7 -3386,6 +3392,7 @@@ int skb_crc32c_csum_help(struct sk_buf out: return ret; } +EXPORT_SYMBOL(skb_crc32c_csum_help);
__be16 skb_network_protocol(struct sk_buff *skb, int *depth) { @@@ -3438,9 -3431,8 +3438,9 @@@ static int illegal_highdma(struct net_d if (!(dev->features & NETIF_F_HIGHDMA)) { for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag);
- if (PageHighMem(skb_frag_page(frag))) + if (page && PageHighMem(page)) return 1; } } @@@ -3713,7 -3705,7 +3713,7 @@@ struct sk_buff *validate_xmit_skb_list( next = skb->next; skb_mark_not_on_list(skb);
- /* in case skb wont be segmented, point to itself */ + /* in case skb won't be segmented, point to itself */ skb->prev = skb;
skb = validate_xmit_skb(skb, dev, again); @@@ -4253,6 -4245,13 +4253,6 @@@ u16 dev_pick_tx_zero(struct net_device } EXPORT_SYMBOL(dev_pick_tx_zero);
-u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb, - struct net_device *sb_dev) -{ - return (u16)raw_smp_processor_id() % dev->real_num_tx_queues; -} -EXPORT_SYMBOL(dev_pick_tx_cpu_id); - u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb, struct net_device *sb_dev) { @@@ -5249,7 -5248,7 +5249,7 @@@ int netif_rx(struct sk_buff *skb } EXPORT_SYMBOL(netif_rx);
- static __latent_entropy void net_tx_action(struct softirq_action *h) + static __latent_entropy void net_tx_action(void) { struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@@ -5726,9 -5725,10 +5726,9 @@@ static void __netif_receive_skb_list_co struct packet_type *pt_curr = NULL; /* Current (common) orig_dev of sublist */ struct net_device *od_curr = NULL; - struct list_head sublist; struct sk_buff *skb, *next; + LIST_HEAD(sublist);
- INIT_LIST_HEAD(&sublist); list_for_each_entry_safe(skb, next, head, list) { struct net_device *orig_dev = skb->dev; struct packet_type *pt_prev = NULL; @@@ -5866,8 -5866,9 +5866,8 @@@ static int netif_receive_skb_internal(s void netif_receive_skb_list_internal(struct list_head *head) { struct sk_buff *skb, *next; - struct list_head sublist; + LIST_HEAD(sublist);
- INIT_LIST_HEAD(&sublist); list_for_each_entry_safe(skb, next, head, list) { net_timestamp_check(READ_ONCE(net_hotdata.tstamp_prequeue), skb); @@@ -6920,7 -6921,7 +6920,7 @@@ static int napi_threaded_poll(void *dat return 0; }
- static __latent_entropy void net_rx_action(struct softirq_action *h) + static __latent_entropy void net_rx_action(void) { struct softnet_data *sd = this_cpu_ptr(&softnet_data); unsigned long time_limit = jiffies + @@@ -9271,7 -9272,7 +9271,7 @@@ EXPORT_SYMBOL(netdev_port_same_parent_i */ int dev_change_proto_down(struct net_device *dev, bool proto_down) { - if (!(dev->priv_flags & IFF_CHANGE_PROTO_DOWN)) + if (!dev->change_proto_down) return -EOPNOTSUPP; if (!netif_device_present(dev)) return -ENODEV; @@@ -9368,20 -9369,6 +9368,20 @@@ u8 dev_xdp_prog_count(struct net_devic } EXPORT_SYMBOL_GPL(dev_xdp_prog_count);
+int dev_xdp_propagate(struct net_device *dev, struct netdev_bpf *bpf) +{ + if (!dev->netdev_ops->ndo_bpf) + return -EOPNOTSUPP; + + if (dev_get_min_mp_channel_count(dev)) { + NL_SET_ERR_MSG(bpf->extack, "unable to propagate XDP to device using memory provider"); + return -EBUSY; + } + + return dev->netdev_ops->ndo_bpf(dev, bpf); +} +EXPORT_SYMBOL_GPL(dev_xdp_propagate); + u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode) { struct bpf_prog *prog = dev_xdp_prog(dev, mode); @@@ -9410,11 -9397,6 +9410,11 @@@ static int dev_xdp_install(struct net_d struct netdev_bpf xdp; int err;
+ if (dev_get_min_mp_channel_count(dev)) { + NL_SET_ERR_MSG(extack, "unable to install XDP to device using memory provider"); + return -EBUSY; + } + memset(&xdp, 0, sizeof(xdp)); xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG; xdp.extack = extack; @@@ -9839,20 -9821,6 +9839,20 @@@ err_out return err; }
+u32 dev_get_min_mp_channel_count(const struct net_device *dev) +{ + int i; + + ASSERT_RTNL(); + + for (i = dev->real_num_rx_queues - 1; i >= 0; i--) + if (dev->_rx[i].mp_params.mp_priv) + /* The channel count is the idx plus 1. */ + return i + 1; + + return 0; +} + /** * dev_index_reserve() - allocate an ifindex in a namespace * @net: the applicable net namespace @@@ -10353,17 -10321,6 +10353,17 @@@ static void netdev_do_free_pcpu_stats(s } }
+static void netdev_free_phy_link_topology(struct net_device *dev) +{ + struct phy_link_topology *topo = dev->link_topo; + + if (IS_ENABLED(CONFIG_PHYLIB) && topo) { + xa_destroy(&topo->phys); + kfree(topo); + dev->link_topo = NULL; + } +} + /** * register_netdevice() - register a network device * @dev: device to register @@@ -10911,7 -10868,7 +10911,7 @@@ noinline void netdev_core_stats_inc(str return; }
- field = (__force unsigned long __percpu *)((__force void *)p + offset); + field = (unsigned long __percpu *)((void __percpu *)p + offset); this_cpu_inc(*field); } EXPORT_SYMBOL_GPL(netdev_core_stats_inc); @@@ -11142,7 -11099,6 +11142,7 @@@ struct net_device *alloc_netdev_mqs(in #ifdef CONFIG_NET_SCHED hash_init(dev->qdisc_hash); #endif + dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM; setup(dev);
@@@ -11164,7 -11120,7 +11164,7 @@@ if (!dev->ethtool) goto free_all;
- strcpy(dev->name, name); + strscpy(dev->name, name); dev->name_assign_type = name_assign_type; dev->group = INIT_NETDEV_GROUP; if (!dev->ethtool_ops) @@@ -11235,8 -11191,6 +11235,8 @@@ void free_netdev(struct net_device *dev free_percpu(dev->xdp_bulkq); dev->xdp_bulkq = NULL;
+ netdev_free_phy_link_topology(dev); + /* Compatibility with error handling in drivers */ if (dev->reg_state == NETREG_UNINITIALIZED || dev->reg_state == NETREG_DUMMY) { @@@ -11389,7 -11343,6 +11389,7 @@@ void unregister_netdevice_many_notify(s dev_tcx_uninstall(dev); dev_xdp_uninstall(dev); bpf_dev_bound_netdev_unregister(dev); + dev_dmabuf_uninstall(dev);
netdev_offload_xstats_disable_all(dev);
@@@ -11454,7 -11407,7 +11454,7 @@@ * @head: list of devices * * Note: As most callers use a stack allocated list_head, - * we force a list_del() to make sure stack wont be corrupted later. + * we force a list_del() to make sure stack won't be corrupted later. */ void unregister_netdevice_many(struct list_head *head) { @@@ -11509,10 -11462,10 +11509,10 @@@ int __dev_change_net_namespace(struct n
/* Don't allow namespace local devices to be moved. */ err = -EINVAL; - if (dev->features & NETIF_F_NETNS_LOCAL) + if (dev->netns_local) goto out;
- /* Ensure the device has been registrered */ + /* Ensure the device has been registered */ if (dev->reg_state != NETREG_REGISTERED) goto out;
@@@ -11891,7 -11844,7 +11891,7 @@@ static void __net_exit default_device_e char fb_name[IFNAMSIZ];
/* Ignore unmoveable devices (i.e. loopback) */ - if (dev->features & NETIF_F_NETNS_LOCAL) + if (dev->netns_local) continue;
/* Leave virtual devices for the generic cleanup */ @@@ -11952,7 -11905,7 +11952,7 @@@ static struct pernet_operations __net_i static void __init net_dev_struct_check(void) { /* TX read-mostly hotpath */ - CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, priv_flags); + CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, priv_flags_fast); CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, netdev_ops); CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, header_ops); CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, _tx); diff --combined tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 9649e7f09fc90,1fc16657cf425..8835761d9a126 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@@ -17,7 -17,6 +17,7 @@@ #include <linux/in.h> #include <linux/in6.h> #include <linux/un.h> +#include <linux/filter.h> #include <net/sock.h> #include <linux/namei.h> #include "bpf_testmod.h" @@@ -142,12 -141,13 +142,12 @@@ bpf_testmod_test_mod_kfunc(int i
__bpf_kfunc int bpf_iter_testmod_seq_new(struct bpf_iter_testmod_seq *it, s64 value, int cnt) { - if (cnt < 0) { - it->cnt = 0; + it->cnt = cnt; + + if (cnt < 0) return -EINVAL; - }
it->value = value; - it->cnt = cnt;
return 0; } @@@ -162,14 -162,6 +162,14 @@@ __bpf_kfunc s64 *bpf_iter_testmod_seq_n return &it->value; }
+__bpf_kfunc s64 bpf_iter_testmod_seq_value(int val, struct bpf_iter_testmod_seq* it__iter) +{ + if (it__iter->cnt < 0) + return 0; + + return val + it__iter->value; +} + __bpf_kfunc void bpf_iter_testmod_seq_destroy(struct bpf_iter_testmod_seq *it) { it->cnt = 0; @@@ -184,36 -176,6 +184,36 @@@ __bpf_kfunc void bpf_kfunc_dynptr_test( { }
+__bpf_kfunc struct sk_buff *bpf_kfunc_nested_acquire_nonzero_offset_test(struct sk_buff_head *ptr) +{ + return NULL; +} + +__bpf_kfunc struct sk_buff *bpf_kfunc_nested_acquire_zero_offset_test(struct sock_common *ptr) +{ + return NULL; +} + +__bpf_kfunc void bpf_kfunc_nested_release_test(struct sk_buff *ptr) +{ +} + +__bpf_kfunc void bpf_kfunc_trusted_vma_test(struct vm_area_struct *ptr) +{ +} + +__bpf_kfunc void bpf_kfunc_trusted_task_test(struct task_struct *ptr) +{ +} + +__bpf_kfunc void bpf_kfunc_trusted_num_test(int *ptr) +{ +} + +__bpf_kfunc void bpf_kfunc_rcu_task_test(struct task_struct *ptr) +{ +} + __bpf_kfunc struct bpf_testmod_ctx * bpf_testmod_ctx_create(int *err) { @@@ -394,8 -356,6 +394,8 @@@ bpf_testmod_test_read(struct file *file if (bpf_testmod_loop_test(101) > 100) trace_bpf_testmod_test_read(current, &ctx);
+ trace_bpf_testmod_test_nullable_bare(NULL); + /* Magic number to enable writable tp */ if (len == 64) { struct bpf_testmod_test_writable_ctx writable = { @@@ -472,7 -432,7 +472,7 @@@ uprobe_ret_handler(struct uprobe_consum
struct testmod_uprobe { struct path path; - loff_t offset; + struct uprobe *uprobe; struct uprobe_consumer consumer; };
@@@ -486,25 -446,25 +486,25 @@@ static int testmod_register_uprobe(loff { int err = -EBUSY;
- if (uprobe.offset) + if (uprobe.uprobe) return -EBUSY;
mutex_lock(&testmod_uprobe_mutex);
- if (uprobe.offset) + if (uprobe.uprobe) goto out;
err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path); if (err) goto out;
- err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry), - offset, 0, &uprobe.consumer); - if (err) + uprobe.uprobe = uprobe_register(d_real_inode(uprobe.path.dentry), + offset, 0, &uprobe.consumer); + if (IS_ERR(uprobe.uprobe)) { + err = PTR_ERR(uprobe.uprobe); path_put(&uprobe.path); - else - uprobe.offset = offset; - + uprobe.uprobe = NULL; + } out: mutex_unlock(&testmod_uprobe_mutex); return err; @@@ -514,10 -474,11 +514,11 @@@ static void testmod_unregister_uprobe(v { mutex_lock(&testmod_uprobe_mutex);
- if (uprobe.offset) { - uprobe_unregister(d_real_inode(uprobe.path.dentry), - uprobe.offset, &uprobe.consumer); - uprobe.offset = 0; + if (uprobe.uprobe) { + uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer); + uprobe_unregister_sync(); + path_put(&uprobe.path); + uprobe.uprobe = NULL; }
mutex_unlock(&testmod_uprobe_mutex); @@@ -571,16 -532,8 +572,16 @@@ BTF_KFUNCS_START(bpf_testmod_common_kfu BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_destroy, KF_ITER_DESTROY) +BTF_ID_FLAGS(func, bpf_iter_testmod_seq_value) BTF_ID_FLAGS(func, bpf_kfunc_common_test) BTF_ID_FLAGS(func, bpf_kfunc_dynptr_test) +BTF_ID_FLAGS(func, bpf_kfunc_nested_acquire_nonzero_offset_test, KF_ACQUIRE) +BTF_ID_FLAGS(func, bpf_kfunc_nested_acquire_zero_offset_test, KF_ACQUIRE) +BTF_ID_FLAGS(func, bpf_kfunc_nested_release_test, KF_RELEASE) +BTF_ID_FLAGS(func, bpf_kfunc_trusted_vma_test, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_kfunc_trusted_task_test, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_kfunc_trusted_num_test, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_kfunc_rcu_task_test, KF_RCU) BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE) BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids) @@@ -968,51 -921,6 +969,51 @@@ out return err; }
+static DEFINE_MUTEX(st_ops_mutex); +static struct bpf_testmod_st_ops *st_ops; + +__bpf_kfunc int bpf_kfunc_st_ops_test_prologue(struct st_ops_args *args) +{ + int ret = -1; + + mutex_lock(&st_ops_mutex); + if (st_ops && st_ops->test_prologue) + ret = st_ops->test_prologue(args); + mutex_unlock(&st_ops_mutex); + + return ret; +} + +__bpf_kfunc int bpf_kfunc_st_ops_test_epilogue(struct st_ops_args *args) +{ + int ret = -1; + + mutex_lock(&st_ops_mutex); + if (st_ops && st_ops->test_epilogue) + ret = st_ops->test_epilogue(args); + mutex_unlock(&st_ops_mutex); + + return ret; +} + +__bpf_kfunc int bpf_kfunc_st_ops_test_pro_epilogue(struct st_ops_args *args) +{ + int ret = -1; + + mutex_lock(&st_ops_mutex); + if (st_ops && st_ops->test_pro_epilogue) + ret = st_ops->test_pro_epilogue(args); + mutex_unlock(&st_ops_mutex); + + return ret; +} + +__bpf_kfunc int bpf_kfunc_st_ops_inc10(struct st_ops_args *args) +{ + args->a += 10; + return args->a; +} + BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids) BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc) BTF_ID_FLAGS(func, bpf_kfunc_call_test1) @@@ -1049,10 -957,6 +1050,10 @@@ BTF_ID_FLAGS(func, bpf_kfunc_call_kerne BTF_ID_FLAGS(func, bpf_kfunc_call_sock_sendmsg, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_kfunc_call_kernel_getsockname, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_kfunc_call_kernel_getpeername, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_prologue, KF_TRUSTED_ARGS | KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_pro_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_inc10, KF_TRUSTED_ARGS) BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf) @@@ -1121,11 -1025,6 +1122,11 @@@ static void bpf_testmod_test_2(int a, i { }
+static int bpf_testmod_tramp(int value) +{ + return 0; +} + static int bpf_testmod_ops__test_maybe_null(int dummy, struct task_struct *task__nullable) { @@@ -1172,144 -1071,6 +1173,144 @@@ struct bpf_struct_ops bpf_testmod_ops2 .owner = THIS_MODULE, };
+static int bpf_test_mod_st_ops__test_prologue(struct st_ops_args *args) +{ + return 0; +} + +static int bpf_test_mod_st_ops__test_epilogue(struct st_ops_args *args) +{ + return 0; +} + +static int bpf_test_mod_st_ops__test_pro_epilogue(struct st_ops_args *args) +{ + return 0; +} + +static int st_ops_gen_prologue(struct bpf_insn *insn_buf, bool direct_write, + const struct bpf_prog *prog) +{ + struct bpf_insn *insn = insn_buf; + + if (strcmp(prog->aux->attach_func_name, "test_prologue") && + strcmp(prog->aux->attach_func_name, "test_pro_epilogue")) + return 0; + + /* r6 = r1[0]; // r6 will be "struct st_ops *args". r1 is "u64 *ctx". + * r7 = r6->a; + * r7 += 1000; + * r6->a = r7; + */ + *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0); + *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_6, offsetof(struct st_ops_args, a)); + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1000); + *insn++ = BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_7, offsetof(struct st_ops_args, a)); + *insn++ = prog->insnsi[0]; + + return insn - insn_buf; +} + +static int st_ops_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog, + s16 ctx_stack_off) +{ + struct bpf_insn *insn = insn_buf; + + if (strcmp(prog->aux->attach_func_name, "test_epilogue") && + strcmp(prog->aux->attach_func_name, "test_pro_epilogue")) + return 0; + + /* r1 = stack[ctx_stack_off]; // r1 will be "u64 *ctx" + * r1 = r1[0]; // r1 will be "struct st_ops *args" + * r6 = r1->a; + * r6 += 10000; + * r1->a = r6; + * r0 = r6; + * r0 *= 2; + * BPF_EXIT; + */ + *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off); + *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0); + *insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, offsetof(struct st_ops_args, a)); + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 10000); + *insn++ = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, offsetof(struct st_ops_args, a)); + *insn++ = BPF_MOV64_REG(BPF_REG_0, BPF_REG_6); + *insn++ = BPF_ALU64_IMM(BPF_MUL, BPF_REG_0, 2); + *insn++ = BPF_EXIT_INSN(); + + return insn - insn_buf; +} + +static int st_ops_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, + int off, int size) +{ + if (off < 0 || off + size > sizeof(struct st_ops_args)) + return -EACCES; + return 0; +} + +static const struct bpf_verifier_ops st_ops_verifier_ops = { + .is_valid_access = bpf_testmod_ops_is_valid_access, + .btf_struct_access = st_ops_btf_struct_access, + .gen_prologue = st_ops_gen_prologue, + .gen_epilogue = st_ops_gen_epilogue, + .get_func_proto = bpf_base_func_proto, +}; + +static struct bpf_testmod_st_ops st_ops_cfi_stubs = { + .test_prologue = bpf_test_mod_st_ops__test_prologue, + .test_epilogue = bpf_test_mod_st_ops__test_epilogue, + .test_pro_epilogue = bpf_test_mod_st_ops__test_pro_epilogue, +}; + +static int st_ops_reg(void *kdata, struct bpf_link *link) +{ + int err = 0; + + mutex_lock(&st_ops_mutex); + if (st_ops) { + pr_err("st_ops has already been registered\n"); + err = -EEXIST; + goto unlock; + } + st_ops = kdata; + +unlock: + mutex_unlock(&st_ops_mutex); + return err; +} + +static void st_ops_unreg(void *kdata, struct bpf_link *link) +{ + mutex_lock(&st_ops_mutex); + st_ops = NULL; + mutex_unlock(&st_ops_mutex); +} + +static int st_ops_init(struct btf *btf) +{ + return 0; +} + +static int st_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + +static struct bpf_struct_ops testmod_st_ops = { + .verifier_ops = &st_ops_verifier_ops, + .init = st_ops_init, + .init_member = st_ops_init_member, + .reg = st_ops_reg, + .unreg = st_ops_unreg, + .cfi_stubs = &st_ops_cfi_stubs, + .name = "bpf_testmod_st_ops", + .owner = THIS_MODULE, +}; + extern int bpf_fentry_test1(int a);
static int bpf_testmod_init(void) @@@ -1320,17 -1081,14 +1321,17 @@@ .kfunc_btf_id = bpf_testmod_dtor_ids[1] }, }; + void **tramp; int ret;
ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &bpf_testmod_common_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_testmod_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_testmod_kfunc_set); ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_testmod_kfunc_set); + ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_testmod_kfunc_set); ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops); ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops2, bpf_testmod_ops2); + ret = ret ?: register_bpf_struct_ops(&testmod_st_ops, bpf_testmod_st_ops); ret = ret ?: register_btf_id_dtor_kfuncs(bpf_testmod_dtors, ARRAY_SIZE(bpf_testmod_dtors), THIS_MODULE); @@@ -1346,14 -1104,6 +1347,14 @@@ ret = register_bpf_testmod_uprobe(); if (ret < 0) return ret; + + /* Ensure nothing is between tramp_1..tramp_40 */ + BUILD_BUG_ON(offsetof(struct bpf_testmod_ops, tramp_1) + 40 * sizeof(long) != + offsetofend(struct bpf_testmod_ops, tramp_40)); + tramp = (void **)&__bpf_testmod_ops.tramp_1; + while (tramp <= (void **)&__bpf_testmod_ops.tramp_40) + *tramp++ = bpf_testmod_tramp; + return 0; }
diff --combined tools/testing/selftests/mm/Makefile index d7a85059c27bd,4ea188be0588a..02e1204971b0a --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@@ -90,6 -90,7 +90,7 @@@ CAN_BUILD_X86_64 := $(shell ./../x86/ch CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie)
VMTARGETS := protection_keys + VMTARGETS += pkey_sighandler_tests BINARIES_32 := $(VMTARGETS:%=%_32) BINARIES_64 := $(VMTARGETS:%=%_64)
@@@ -106,13 -107,13 +107,13 @@@ TEST_GEN_FILES += $(BINARIES_64 endif else
-ifneq (,$(findstring $(ARCH),powerpc)) +ifneq (,$(filter $(ARCH),arm64 powerpc)) TEST_GEN_FILES += protection_keys endif
endif
-ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) +ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390)) TEST_GEN_FILES += va_high_addr_switch TEST_GEN_FILES += virtual_address_range TEST_GEN_FILES += write_to_hugetlbfs diff --combined tools/testing/selftests/mm/pkey-helpers.h index 15608350fc017,4d31a309a46b5..9ab6a3ee153b5 --- a/tools/testing/selftests/mm/pkey-helpers.h +++ b/tools/testing/selftests/mm/pkey-helpers.h @@@ -79,7 -79,18 +79,18 @@@ extern void abort_hooks(void) } \ } while (0)
- __attribute__((noinline)) int read_ptr(int *ptr); + #define barrier() __asm__ __volatile__("": : :"memory") + #ifndef noinline + # define noinline __attribute__((noinline)) + #endif + + noinline int read_ptr(int *ptr) + { + /* Keep GCC from optimizing this away somehow */ + barrier(); + return *ptr; + } + void expected_pkey_fault(int pkey); int sys_pkey_alloc(unsigned long flags, unsigned long init_val); int sys_pkey_free(unsigned long pkey); @@@ -91,17 -102,12 +102,17 @@@ void record_pkey_malloc(void *ptr, lon #include "pkey-x86.h" #elif defined(__powerpc64__) /* arch */ #include "pkey-powerpc.h" +#elif defined(__aarch64__) /* arch */ +#include "pkey-arm64.h" #else /* arch */ #error Architecture not supported #endif /* arch */
+#ifndef PKEY_MASK #define PKEY_MASK (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE) +#endif
+#ifndef set_pkey_bits static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags) { u32 shift = pkey_bit_position(pkey); @@@ -111,9 -117,7 +122,9 @@@ reg |= (flags & PKEY_MASK) << shift; return reg; } +#endif
+#ifndef get_pkey_bits static inline u64 get_pkey_bits(u64 reg, int pkey) { u32 shift = pkey_bit_position(pkey); @@@ -123,7 -127,6 +134,7 @@@ */ return ((reg >> shift) & PKEY_MASK); } +#endif
extern u64 shadow_pkey_reg;
diff --combined tools/testing/selftests/mm/protection_keys.c index 0789981b72b95,cc6de1644360a..4990f7ab4cb72 --- a/tools/testing/selftests/mm/protection_keys.c +++ b/tools/testing/selftests/mm/protection_keys.c @@@ -147,7 -147,7 +147,7 @@@ void abort_hooks(void * will then fault, which makes sure that the fault code handles * execute-only memory properly. */ -#ifdef __powerpc64__ +#if defined(__powerpc64__) || defined(__aarch64__) /* This way, both 4K and 64K alignment are maintained */ __attribute__((__aligned__(65536))) #else @@@ -212,6 -212,7 +212,6 @@@ void pkey_disable_set(int pkey, int fla unsigned long syscall_flags = 0; int ret; int pkey_rights; - u64 orig_pkey_reg = read_pkey_reg();
dprintf1("START->%s(%d, 0x%x)\n", __func__, pkey, flags); @@@ -241,6 -242,8 +241,6 @@@
dprintf1("%s(%d) pkey_reg: 0x%016llx\n", __func__, pkey, read_pkey_reg()); - if (flags) - pkey_assert(read_pkey_reg() >= orig_pkey_reg); dprintf1("END<---%s(%d, 0x%x)\n", __func__, pkey, flags); } @@@ -250,6 -253,7 +250,6 @@@ void pkey_disable_clear(int pkey, int f unsigned long syscall_flags = 0; int ret; int pkey_rights = hw_pkey_get(pkey, syscall_flags); - u64 orig_pkey_reg = read_pkey_reg();
pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
@@@ -269,6 -273,8 +269,6 @@@
dprintf1("%s(%d) pkey_reg: 0x%016llx\n", __func__, pkey, read_pkey_reg()); - if (flags) - assert(read_pkey_reg() <= orig_pkey_reg); }
void pkey_write_allow(int pkey) @@@ -308,9 -314,7 +308,9 @@@ void signal_handler(int signum, siginfo ucontext_t *uctxt = vucontext; int trapno; unsigned long ip; +#ifdef MCONTEXT_FPREGS char *fpregs; +#endif #if defined(__i386__) || defined(__x86_64__) /* arch */ u32 *pkey_reg_ptr; int pkey_reg_offset; @@@ -324,11 -328,9 +324,11 @@@ __func__, __LINE__, __read_pkey_reg(), shadow_pkey_reg);
- trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO]; - ip = uctxt->uc_mcontext.gregs[REG_IP_IDX]; + trapno = MCONTEXT_TRAPNO(uctxt->uc_mcontext); + ip = MCONTEXT_IP(uctxt->uc_mcontext); +#ifdef MCONTEXT_FPREGS fpregs = (char *) uctxt->uc_mcontext.fpregs; +#endif
dprintf2("%s() trapno: %d ip: 0x%016lx info->si_code: %s/%d\n", __func__, trapno, ip, si_code_str(si->si_code), @@@ -357,9 -359,7 +357,9 @@@ #endif /* arch */
dprintf1("siginfo: %p\n", si); +#ifdef MCONTEXT_FPREGS dprintf1(" fpregs: %p\n", fpregs); +#endif
if ((si->si_code == SEGV_MAPERR) || (si->si_code == SEGV_ACCERR) || @@@ -389,8 -389,6 +389,8 @@@ #elif defined(__powerpc64__) /* arch */ /* restore access and let the faulting instruction continue */ pkey_access_allow(siginfo_pkey); +#elif defined(__aarch64__) + aarch64_write_signal_pkey(uctxt, PKEY_ALLOW_ALL); #endif /* arch */ pkey_faults++; dprintf1("<<<<==================================================\n"); @@@ -904,9 -902,7 +904,9 @@@ void expected_pkey_fault(int pkey * test program continue. We now have to restore it. */ if (__read_pkey_reg() != 0) -#else /* arch */ +#elif defined(__aarch64__) + if (__read_pkey_reg() != PKEY_ALLOW_ALL) +#else if (__read_pkey_reg() != shadow_pkey_reg) #endif /* arch */ pkey_assert(0); @@@ -954,16 -950,6 +954,6 @@@ void close_test_fds(void nr_test_fds = 0; }
- #define barrier() __asm__ __volatile__("": : :"memory") - __attribute__((noinline)) int read_ptr(int *ptr) - { - /* - * Keep GCC from optimizing this away somehow - */ - barrier(); - return *ptr; - } - void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey) { int i, err; @@@ -1496,11 -1482,6 +1486,11 @@@ void test_executing_on_unreadable_memor lots_o_noops_around_write(&scratch); do_not_expect_pkey_fault("executing on PROT_EXEC memory"); expect_fault_on_read_execonly_key(p1, pkey); + + // Reset back to PROT_EXEC | PROT_READ for architectures that support + // non-PKEY execute-only permissions. + ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC | PROT_READ, (u64)pkey); + pkey_assert(!ret); }
void test_implicit_mprotect_exec_only_memory(int *ptr, u16 pkey) @@@ -1674,84 -1655,6 +1664,84 @@@ void test_ptrace_modifies_pkru(int *ptr } #endif
+#if defined(__aarch64__) +void test_ptrace_modifies_pkru(int *ptr, u16 pkey) +{ + pid_t child; + int status, ret; + struct iovec iov; + u64 trace_pkey; + /* Just a random pkey value.. */ + u64 new_pkey = (POE_X << PKEY_BITS_PER_PKEY * 2) | + (POE_NONE << PKEY_BITS_PER_PKEY) | + POE_RWX; + + child = fork(); + pkey_assert(child >= 0); + dprintf3("[%d] fork() ret: %d\n", getpid(), child); + if (!child) { + ptrace(PTRACE_TRACEME, 0, 0, 0); + + /* Stop and allow the tracer to modify PKRU directly */ + raise(SIGSTOP); + + /* + * need __read_pkey_reg() version so we do not do shadow_pkey_reg + * checking + */ + if (__read_pkey_reg() != new_pkey) + exit(1); + + raise(SIGSTOP); + + exit(0); + } + + pkey_assert(child == waitpid(child, &status, 0)); + dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status); + pkey_assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); + + iov.iov_base = &trace_pkey; + iov.iov_len = 8; + ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov); + pkey_assert(ret == 0); + pkey_assert(trace_pkey == read_pkey_reg()); + + trace_pkey = new_pkey; + + ret = ptrace(PTRACE_SETREGSET, child, (void *)NT_ARM_POE, &iov); + pkey_assert(ret == 0); + + /* Test that the modification is visible in ptrace before any execution */ + memset(&trace_pkey, 0, sizeof(trace_pkey)); + ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov); + pkey_assert(ret == 0); + pkey_assert(trace_pkey == new_pkey); + + /* Execute the tracee */ + ret = ptrace(PTRACE_CONT, child, 0, 0); + pkey_assert(ret == 0); + + /* Test that the tracee saw the PKRU value change */ + pkey_assert(child == waitpid(child, &status, 0)); + dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status); + pkey_assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); + + /* Test that the modification is visible in ptrace after execution */ + memset(&trace_pkey, 0, sizeof(trace_pkey)); + ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov); + pkey_assert(ret == 0); + pkey_assert(trace_pkey == new_pkey); + + ret = ptrace(PTRACE_CONT, child, 0, 0); + pkey_assert(ret == 0); + pkey_assert(child == waitpid(child, &status, 0)); + dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status); + pkey_assert(WIFEXITED(status)); + pkey_assert(WEXITSTATUS(status) == 0); +} +#endif + void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey) { int size = PAGE_SIZE; @@@ -1787,7 -1690,7 +1777,7 @@@ void (*pkey_tests[])(int *ptr, u16 pkey test_pkey_syscalls_bad_args, test_pkey_alloc_exhaust, test_pkey_alloc_free_attach_pkey0, -#if defined(__i386__) || defined(__x86_64__) +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__) test_ptrace_modifies_pkru, #endif };