[linux-next] LinuxNextTracking branch, master, updated. next-20240913 - linux-merge - lists.open-mesh.org

13 Sep 2024

The following commit has been merged in the master branch:
commit 37855b042c032bbcd1ec025dacba7bfd8d58c12c
Merge: 7d4502a36f95ffa3f9d805ea2ec56f6440bd5001 4c293d0fa3159f5062483ef981d77f455589cfda
Author: Stephen Rothwell sfr@canb.auug.org.au
Date:   Fri Sep 13 15:51:15 2024 +1000
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
# Conflicts:
    #       arch/arm64/Kconfig
diff --combined Documentation/admin-guide/media/vivid.rst
index ac233b142a279,c9d301ab46a38..034ca7c77fb97

--- a/Documentation/admin-guide/media/vivid.rst
+++ b/Documentation/admin-guide/media/vivid.rst
@@@ -328,7 -328,7 +328,7 @@@ and an HDMI input, one input for each i
  detail below.
Special attention has been given to the rate at which new frames become
- available. The jitter will be around 1 jiffie (that depends on the HZ
+ available. The jitter will be around 1 jiffy (that depends on the HZ
  configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second),
  but the long-term behavior is exactly following the framerate. So a
  framerate of 59.94 Hz is really different from 60 Hz. If the framerate
@@@ -1343,7 -1343,7 +1343,7 @@@ Some Future Improvement
  Just as a reminder and in no particular order:
- Add a virtual alsa driver to test audio
 -- Add virtual sub-devices and media controller support
 +- Add virtual sub-devices
  - Some support for testing compressed video
  - Add support to loop raw VBI output to raw VBI input
  - Add support to loop teletext sliced VBI output to VBI input
@@@ -1358,4 -1358,4 +1358,4 @@@
  - Make a thread for the RDS generation, that would help in particular for the
    "Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated
    in real-time.
 -- Changing the EDID should cause hotplug detect emulation to happen.
 +- Changing the EDID doesn't wait 100 ms before setting the HPD signal.
diff --combined Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst
index c146e5bba8818,731c266beb1a1..dc728c739e28d
--- a/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst
+++ b/Documentation/translations/sp_SP/scheduler/sched-design-CFS.rst
@@@ -14,10 -14,10 +14,10 @@@ Gestor de tareas CF
CFS viene de las siglas en inglés de "Gestor de tareas totalmente justo"
  ("Completely Fair Scheduler"), y es el nuevo gestor de tareas de escritorio
 -implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto de
 -el previo gestor de tareas SCHED_OTHER.
 -
 -Nota: El planificador EEVDF fue incorporado más recientemente al kernel.
 +implementado por Ingo Molnar e integrado en Linux 2.6.23. Es el sustituto
 +del previo gestor de tareas SCHED_OTHER. Hoy en día se está abriendo camino
 +para el gestor de tareas EEVDF, cuya documentación se puede ver en
 +Documentation/scheduler/sched-eevdf.rst
El 80% del diseño de CFS puede ser resumido en una única frase: CFS
  básicamente modela una "CPU ideal, precisa y multi-tarea" sobre hardware
@@@ -109,7 -109,7 +109,7 @@@ para que se ejecute, y la tarea en ejec
  ==================================
CFS usa una granularidad de nanosegundos y no depende de ningún
- jiffie o detalles como HZ. De este modo, el gestor de tareas CFS no tiene
+ jiffy o detalles como HZ. De este modo, el gestor de tareas CFS no tiene
  noción de "ventanas de tiempo" de la forma en que tenía el gestor de
  tareas previo, y tampoco tiene heurísticos. Únicamente hay un parámetro
  central ajustable (se ha de cambiar en CONFIG_SCHED_DEBUG):
diff --combined MAINTAINERS
index 694fc003d13e6,352ec3c2d7142..d894832bdb3ba
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@@ -334,7 -334,6 +334,7 @@@ L:	linux-acpi@vger.kernel.or
  L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
  S:	Maintained
  F:	drivers/acpi/arm64
 +F:	include/linux/acpi_iort.h
ACPI FOR RISC-V (ACPI/riscv)
  M:	Sunil V L sunilvl@ventanamicro.com
@@@ -538,17 -537,6 +538,17 @@@ F:	drivers/leds/leds-adp5520.
  F:	drivers/mfd/adp5520.c
  F:	drivers/video/backlight/adp5520_bl.c
+ADP5585 GPIO EXPANDER, PWM AND KEYPAD CONTROLLER DRIVER
 +M:	Laurent Pinchart laurent.pinchart@ideasonboard.com
 +L:	linux-gpio@vger.kernel.org
 +L:	linux-pwm@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/*/adi,adp5585*.yaml
 +F:	drivers/gpio/gpio-adp5585.c
 +F:	drivers/mfd/adp5585.c
 +F:	drivers/pwm/pwm-adp5585.c
 +F:	include/linux/mfd/adp5585.h
 +
  ADP5588 QWERTY KEYPAD AND IO EXPANDER DRIVER (ADP5588/ADP5587)
  M:	Michael Hennerich michael.hennerich@analog.com
  S:	Supported
@@@ -1025,13 -1013,6 +1025,13 @@@ S:	Supporte
  T:	git https://gitlab.freedesktop.org/agd5f/linux.git
  F:	drivers/gpu/drm/amd/display/
+AMD DISPLAY CORE - DML
 +M:	Chaitanya Dhere chaitanya.dhere@amd.com
 +M:	Jun Lei jun.lei@amd.com
 +S:	Supported
 +F:	drivers/gpu/drm/amd/display/dc/dml/
 +F:	drivers/gpu/drm/amd/display/dc/dml2/
 +
  AMD FAM15H PROCESSOR POWER MONITORING DRIVER
  M:	Huang Rui ray.huang@amd.com
  L:	linux-hwmon@vger.kernel.org
@@@ -1172,13 -1153,6 +1172,13 @@@ S:	Supporte
  F:	arch/arm64/boot/dts/amd/amd-seattle-xgbe*.dtsi
  F:	drivers/net/ethernet/amd/xgbe/
+AMLOGIC BLUETOOTH DRIVER
 +M:	Yang Li yang.li@amlogic.com
 +L:	linux-bluetooth@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/net/bluetooth/amlogic,w155s2-bt.yaml
 +F:	drivers/bluetooth/hci_aml.c
 +
  AMLOGIC DDR PMU DRIVER
  M:	Jiucheng Xu jiucheng.xu@amlogic.com
  L:	linux-amlogic@lists.infradead.org
@@@ -1228,13 -1202,6 +1228,13 @@@ W:	https://ez.analog.com/linux-software
  F:	Documentation/devicetree/bindings/iio/dac/adi,ad3552r.yaml
  F:	drivers/iio/dac/ad3552r.c
+ANALOG DEVICES INC AD4000 DRIVER
 +M:	Marcelo Schmitt marcelo.schmitt@analog.com
 +L:	linux-iio@vger.kernel.org
 +S:	Supported
 +W:	https://ez.analog.com/linux-software-drivers
 +F:	Documentation/devicetree/bindings/iio/adc/adi,ad4000.yaml
 +
  ANALOG DEVICES INC AD4130 DRIVER
  M:	Cosmin Tanislav cosmin.tanislav@analog.com
  L:	linux-iio@vger.kernel.org
@@@ -1642,14 -1609,6 +1642,14 @@@ F:	Documentation/admin-guide/perf/xgene
  F:	Documentation/devicetree/bindings/perf/apm-xgene-pmu.txt
  F:	drivers/perf/xgene_pmu.c
+APPLIED MICRO QT2025 PHY DRIVER
 +M:	FUJITA Tomonori fujita.tomonori@gmail.com
 +R:	Trevor Gross tmgross@umich.edu
 +L:	netdev@vger.kernel.org
 +L:	rust-for-linux@vger.kernel.org
 +S:	Maintained
 +F:	drivers/net/phy/qt2025.rs
 +
  APTINA CAMERA SENSOR PLL
  M:	Laurent Pinchart Laurent.pinchart@ideasonboard.com
  L:	linux-media@vger.kernel.org
@@@ -1778,17 -1737,6 +1778,17 @@@ F:	drivers/mtd/maps/physmap-versatile.
  F:	drivers/power/reset/arm-versatile-reboot.c
  F:	drivers/soc/versatile/
+ARM INTERCONNECT PMU DRIVERS
 +M:	Robin Murphy robin.murphy@arm.com
 +S:	Supported
 +F:	Documentation/admin-guide/perf/arm-cmn.rst
 +F:	Documentation/admin-guide/perf/arm-ni.rst
 +F:	Documentation/devicetree/bindings/perf/arm,cmn.yaml
 +F:	Documentation/devicetree/bindings/perf/arm,ni.yaml
 +F:	drivers/perf/arm-cmn.c
 +F:	drivers/perf/arm-ni.c
 +F:	tools/perf/pmu-events/arch/arm64/arm/cmn/
 +
  ARM KOMEDA DRM-KMS DRIVER
  M:	Liviu Dudau liviu.dudau@arm.com
  S:	Supported
@@@ -1806,7 -1754,6 +1806,7 @@@ L:	dri-devel@lists.freedesktop.or
  S:	Supported
  T:	git https://gitlab.freedesktop.org/drm/misc/kernel.git
  F:	Documentation/gpu/panfrost.rst
 +F:	drivers/gpu/drm/ci/xfails/panfrost*
  F:	drivers/gpu/drm/panfrost/
  F:	include/uapi/drm/panfrost_drm.h
@@@ -2485,7 -2432,6 +2485,7 @@@ N:	lpc18x
ARM/LPC32XX SOC SUPPORT
  M:	Vladimir Zapolskiy vz@mleia.com
 +M:	Piotr Wojtaszczyk piotr.wojtaszczyk@timesys.com
  L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
  S:	Maintained
  T:	git git://github.com/vzapolskiy/linux-lpc32xx.git
@@@ -2498,14 -2444,6 +2498,14 @@@ F:	drivers/usb/host/ohci-nxp.
  F:	drivers/watchdog/pnx4008_wdt.c
  N:	lpc32xx
+LPC32XX DMAMUX SUPPORT
 +M:	J.M.B. Downing jonathan.downing@nautel.com
 +M:	Piotr Wojtaszczyk piotr.wojtaszczyk@timesys.com
 +R:	Vladimir Zapolskiy vz@mleia.com
 +L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/dma/nxp,lpc3220-dmamux.yaml
 +
  ARM/Marvell Dove/MV78xx0/Orion SOC support
  M:	Andrew Lunn andrew@lunn.ch
  M:	Sebastian Hesselbarth sebastian.hesselbarth@gmail.com
@@@ -3848,9 -3786,10 +3848,9 @@@ F:	Documentation/filesystems/befs.rs
  F:	fs/befs/
BFQ I/O SCHEDULER
 -M:	Paolo Valente paolo.valente@unimore.it
 -M:	Jens Axboe axboe@kernel.dk
 +M:	Yu Kuai yukuai3@huawei.com
  L:	linux-block@vger.kernel.org
 -S:	Maintained
 +S:	Odd Fixes
  F:	Documentation/block/bfq-iosched.rst
  F:	block/bfq-*
@@@ -3997,7 -3936,7 +3997,7 @@@ F:	Documentation/devicetree/bindings/ii
  F:	drivers/iio/imu/bmi323/
BPF JIT for ARC
 -M:	Shahab Vahedi shahab@synopsys.com
 +M:	Shahab Vahedi list+bpf@vahedi.org
  L:	bpf@vger.kernel.org
  S:	Maintained
  F:	arch/arc/net/
@@@ -4164,7 -4103,6 +4164,7 @@@ F:	include/uapi/linux/btf
  F:	include/uapi/linux/filter.h
  F:	kernel/bpf/
  F:	kernel/trace/bpf_trace.c
 +F:	lib/buildid.c
  F:	lib/test_bpf.c
  F:	net/bpf/
  F:	net/core/filter.c
@@@ -4285,7 -4223,6 +4285,7 @@@ L:	bpf@vger.kernel.or
  S:	Maintained
  F:	kernel/bpf/stackmap.c
  F:	kernel/trace/bpf_trace.c
 +F:	lib/buildid.c
BROADCOM ASP 2.0 ETHERNET DRIVER
  M:	Justin Chen justin.chen@broadcom.com
@@@ -5164,8 -5101,10 +5164,8 @@@ F:	Documentation/devicetree/bindings/me
  F:	drivers/media/cec/platform/cec-gpio/
CELL BROADBAND ENGINE ARCHITECTURE
 -M:	Arnd Bergmann arnd@arndb.de
  L:	linuxppc-dev@lists.ozlabs.org
 -S:	Supported
 -W:	http://www.ibm.com/developerworks/power/cell/
 +S:	Orphan
  F:	arch/powerpc/include/asm/cell*.h
  F:	arch/powerpc/include/asm/spu*.h
  F:	arch/powerpc/include/uapi/asm/spu*.h
@@@ -5258,7 -5197,7 +5258,7 @@@ F:	Documentation/dev-tools/checkpatch.r
CHINESE DOCUMENTATION
  M:	Alex Shi alexs@kernel.org
 -M:	Yanteng Si siyanteng@loongson.cn
 +M:	Yanteng Si si.yanteng@linux.dev
  S:	Maintained
  F:	Documentation/translations/zh_CN/
@@@ -5885,9 -5824,6 +5885,9 @@@ CPU POWER MONITORING SUBSYSTE
  M:	Thomas Renninger trenn@suse.com
  M:	Shuah Khan shuah@kernel.org
  M:	Shuah Khan skhan@linuxfoundation.org
 +M:	John B. Wyatt IV jwyatt@redhat.com
 +M:	John B. Wyatt IV sageofredondo@gmail.com
 +M:	John Kacur jkacur@redhat.com
  L:	linux-pm@vger.kernel.org
  S:	Maintained
  F:	tools/power/cpupower/
@@@ -6570,7 -6506,6 +6570,7 @@@ F:	Documentation/devicetree/bindings/re
  F:	Documentation/devicetree/bindings/regulator/dlg,da9*.yaml
  F:	Documentation/devicetree/bindings/regulator/dlg,slg51000.yaml
  F:	Documentation/devicetree/bindings/sound/da[79]*.txt
 +F:	Documentation/devicetree/bindings/sound/dlg,da7213.yaml
  F:	Documentation/devicetree/bindings/thermal/dlg,da9062-thermal.yaml
  F:	Documentation/devicetree/bindings/watchdog/dlg,da9062-watchdog.yaml
  F:	Documentation/hwmon/da90??.rst
@@@ -6731,7 -6666,6 +6731,7 @@@ F:	drivers/dma-buf/dma-heap.
  F:	drivers/dma-buf/heaps/*
  F:	include/linux/dma-heap.h
  F:	include/uapi/linux/dma-heap.h
 +F:	tools/testing/selftests/dmabuf-heaps/
DMC FREQUENCY DRIVER FOR SAMSUNG EXYNOS5422
  M:	Lukasz Luba lukasz.luba@arm.com
@@@ -6783,7 -6717,6 +6783,7 @@@ DOCUMENTATION PROCES
  M:	Jonathan Corbet corbet@lwn.net
  L:	workflows@vger.kernel.org
  S:	Maintained
 +F:	Documentation/dev-tools/
  F:	Documentation/maintainer/
  F:	Documentation/process/
@@@ -6791,7 -6724,6 +6791,7 @@@ DOCUMENTATION REPORTING ISSUE
  M:	Thorsten Leemhuis linux@leemhuis.info
  L:	linux-doc@vger.kernel.org
  S:	Maintained
 +F:	Documentation/admin-guide/bug-bisect.rst
  F:	Documentation/admin-guide/quickly-build-trimmed-linux.rst
  F:	Documentation/admin-guide/reporting-issues.rst
  F:	Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
@@@ -7406,10 -7338,10 +7406,10 @@@ F:	drivers/gpu/drm/udl
DRM DRIVER FOR VIRTUAL KERNEL MODESETTING (VKMS)
  M:	Rodrigo Siqueira rodrigosiqueiramelo@gmail.com
 -M:	Melissa Wen melissa.srw@gmail.com
  M:	Maíra Canal mairacanal@riseup.net
  R:	Haneen Mohammed hamohammed.sa@gmail.com
 -R:	Daniel Vetter daniel@ffwll.ch
 +R:	Simona Vetter simona@ffwll.ch
 +R:	Melissa Wen melissa.srw@gmail.com
  L:	dri-devel@lists.freedesktop.org
  S:	Maintained
  T:	git https://gitlab.freedesktop.org/drm/misc/kernel.git
@@@ -7442,7 -7374,7 +7442,7 @@@ F:	drivers/gpu/drm/panel/panel-widechip
DRM DRIVERS
  M:	David Airlie airlied@gmail.com
 -M:	Daniel Vetter daniel@ffwll.ch
 +M:	Simona Vetter simona@ffwll.ch
  L:	dri-devel@lists.freedesktop.org
  S:	Maintained
  B:	https://gitlab.freedesktop.org/drm
@@@ -7538,6 -7470,7 +7538,6 @@@ M:	Kyungmin Park <kyungmin.park@samsung
  L:	dri-devel@lists.freedesktop.org
  S:	Supported
  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos.git
 -F:	Documentation/devicetree/bindings/display/exynos/
  F:	Documentation/devicetree/bindings/display/samsung/
  F:	drivers/gpu/drm/exynos/
  F:	include/uapi/drm/exynos_drm.h
@@@ -8412,7 -8345,6 +8412,7 @@@ F:	include/linux/mii.
  F:	include/linux/of_net.h
  F:	include/linux/phy.h
  F:	include/linux/phy_fixed.h
 +F:	include/linux/phy_link_topology.h
  F:	include/linux/phylib_stubs.h
  F:	include/linux/platform_data/mdio-bcm-unimac.h
  F:	include/linux/platform_data/mdio-gpio.h
@@@ -8428,7 -8360,6 +8428,7 @@@ L:	netdev@vger.kernel.or
  L:	rust-for-linux@vger.kernel.org
  S:	Maintained
  F:	rust/kernel/net/phy.rs
 +F:	rust/kernel/net/phy/reg.rs
EXEC & BINFMT API, ELF
  R:	Eric Biederman ebiederm@xmission.com
@@@ -8536,13 -8467,6 +8536,13 @@@ F:	lib/bootconfig.
  F:	tools/bootconfig/*
  F:	tools/bootconfig/scripts/*
+EXTRON DA HD 4K PLUS CEC DRIVER
 +M:	Hans Verkuil hverkuil@xs4all.nl
 +L:	linux-media@vger.kernel.org
 +S:	Maintained
 +T:	git git://linuxtv.org/media_tree.git
 +F:	drivers/media/cec/usb/extron-da-hd-4k-plus/
 +
  EXYNOS DP DRIVER
  M:	Jingoo Han jingoohan1@gmail.com
  L:	dri-devel@lists.freedesktop.org
@@@ -8616,9 -8540,8 +8616,9 @@@ F:	drivers/net/wan/farsync.
  FAULT INJECTION SUPPORT
  M:	Akinobu Mita akinobu.mita@gmail.com
  S:	Supported
 -F:	Documentation/fault-injection/
 +F:	Documentation/dev-tools/fault-injection/
  F:	lib/fault-inject.c
 +F:	tools/testing/fault-injection/
FBTFT Framebuffer drivers
  L:	dri-devel@lists.freedesktop.org
@@@ -8680,7 -8603,6 +8680,7 @@@ M:	Christian Brauner <brauner@kernel.or
  R:	Jan Kara jack@suse.cz
  L:	linux-fsdevel@vger.kernel.org
  S:	Maintained
 +T:	git https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
  F:	fs/*
  F:	include/linux/fs.h
  F:	include/linux/fs_types.h
@@@ -8893,7 -8815,7 +8893,7 @@@ W:	https://floatingpoint.billm.au
  F:	arch/x86/math-emu/
FRAMEBUFFER CORE
 -M:	Daniel Vetter daniel@ffwll.ch
 +M:	Simona Vetter simona@ffwll.ch
  S:	Odd Fixes
  T:	git https://gitlab.freedesktop.org/drm/misc/kernel.git
  F:	drivers/video/fbdev/core/
@@@ -9090,7 -9012,6 +9090,7 @@@ M:	Herve Codina <herve.codina@bootlin.c
  L:	linuxppc-dev@lists.ozlabs.org
  S:	Maintained
  F:	Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-scc-qmc.yaml
 +F:	Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,qe-ucc-qmc.yaml
  F:	drivers/soc/fsl/qe/qmc.c
  F:	include/soc/fsl/qe/qmc.h
@@@ -9106,11 -9027,9 +9106,11 @@@ M:	Herve Codina <herve.codina@bootlin.c
  L:	linuxppc-dev@lists.ozlabs.org
  S:	Maintained
  F:	Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,cpm1-tsa.yaml
 +F:	Documentation/devicetree/bindings/soc/fsl/cpm_qe/fsl,qe-tsa.yaml
  F:	drivers/soc/fsl/qe/tsa.c
  F:	drivers/soc/fsl/qe/tsa.h
  F:	include/dt-bindings/soc/cpm1-fsl,tsa.h
 +F:	include/dt-bindings/soc/qe-fsl,tsa.h
FREESCALE QUICC ENGINE UCC ETHERNET DRIVER
  L:	netdev@vger.kernel.org
@@@ -11060,7 -10979,6 +11060,7 @@@ T:	git https://gitlab.freedesktop.org/d
  F:	Documentation/devicetree/bindings/gpu/img,powervr-rogue.yaml
  F:	Documentation/devicetree/bindings/gpu/img,powervr-sgx.yaml
  F:	Documentation/gpu/imagination/
 +F:	drivers/gpu/drm/ci/xfails/powervr*
  F:	drivers/gpu/drm/imagination/
  F:	include/uapi/drm/pvr_drm.h
@@@ -11186,17 -11104,10 +11186,17 @@@ F:	Documentation/devicetree/bindings/se
  F:	Documentation/input/
  F:	drivers/input/
  F:	include/dt-bindings/input/
 +F:	include/linux/gameport.h
 +F:	include/linux/i8042.h
  F:	include/linux/input.h
  F:	include/linux/input/
 +F:	include/linux/libps2.h
 +F:	include/linux/serio.h
 +F:	include/uapi/linux/gameport.h
  F:	include/uapi/linux/input-event-codes.h
  F:	include/uapi/linux/input.h
 +F:	include/uapi/linux/serio.h
 +F:	include/uapi/linux/uinput.h
INPUT MULTITOUCH (MT) PROTOCOL
  M:	Henrik Rydberg rydberg@bitmath.org
@@@ -11223,16 -11134,6 +11223,16 @@@ T:	git git://git.kernel.org/pub/scm/lin
  F:	security/integrity/
  F:	security/integrity/ima/
+INTEGRITY POLICY ENFORCEMENT (IPE)
 +M:	Fan Wu wufan@linux.microsoft.com
 +L:	linux-security-module@vger.kernel.org
 +S:	Supported
 +T:	git https://github.com/microsoft/ipe.git
 +F:	Documentation/admin-guide/LSM/ipe.rst
 +F:	Documentation/security/ipe.rst
 +F:	scripts/ipe/
 +F:	security/ipe/
 +
  INTEL 810/815 FRAMEBUFFER DRIVER
  M:	Antonino Daplas adaplas@gmail.com
  L:	linux-fbdev@vger.kernel.org
@@@ -11255,12 -11156,12 +11255,12 @@@ F:	drivers/gpio/gpio-i8255.
INTEL ASoC DRIVERS
  M:	Cezary Rojewski cezary.rojewski@intel.com
 -M:	Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com
  M:	Liam Girdwood liam.r.girdwood@linux.intel.com
  M:	Peter Ujfalusi peter.ujfalusi@linux.intel.com
  M:	Bard Liao yung-chuan.liao@linux.intel.com
  M:	Ranjani Sridharan ranjani.sridharan@linux.intel.com
  M:	Kai Vehmanen kai.vehmanen@linux.intel.com
 +R:	Pierre-Louis Bossart pierre-louis.bossart@linux.dev
  L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
  S:	Supported
  F:	sound/soc/intel/
@@@ -11837,7 -11738,6 +11837,7 @@@ T:	git git://git.kernel.org/pub/scm/lin
  F:	drivers/iommu/dma-iommu.c
  F:	drivers/iommu/dma-iommu.h
  F:	drivers/iommu/iova.c
 +F:	include/linux/iommu-dma.h
  F:	include/linux/iova.h
IOMMU SUBSYSTEM
@@@ -12408,7 -12308,6 +12408,7 @@@ L:	kvm@vger.kernel.or
  L:	loongarch@lists.linux.dev
  S:	Maintained
  T:	git git://git.kernel.org/pub/scm/virt/kvm/kvm.git
 +F:	Documentation/virt/kvm/loongarch/
  F:	arch/loongarch/include/asm/kvm*
  F:	arch/loongarch/include/uapi/asm/kvm*
  F:	arch/loongarch/kvm/
@@@ -13618,7 -13517,7 +13618,7 @@@ S:	Maintaine
  F:	Documentation/devicetree/bindings/mfd/marvell,88pm886-a1.yaml
  F:	drivers/input/misc/88pm886-onkey.c
  F:	drivers/mfd/88pm886.c
 -F:	drivers/regulators/88pm886-regulator.c
 +F:	drivers/regulator/88pm886-regulator.c
  F:	include/linux/mfd/88pm886.h
MARVELL ARMADA 3700 PHY DRIVERS
@@@ -14329,8 -14228,8 +14329,8 @@@ M:	Sean Wang <sean.wang@mediatek.com
  L:	linux-bluetooth@vger.kernel.org
  L:	linux-mediatek@lists.infradead.org (moderated for non-subscribers)
  S:	Maintained
 +F:	Documentation/devicetree/bindings/net/bluetooth/mediatek,bluetooth.txt
  F:	Documentation/devicetree/bindings/net/bluetooth/mediatek,mt7921s-bluetooth.yaml
 -F:	Documentation/devicetree/bindings/net/mediatek-bluetooth.txt
  F:	drivers/bluetooth/btmtkuart.c
MEDIATEK BOARD LEVEL SHUTDOWN DRIVERS
@@@ -14608,7 -14507,7 +14608,7 @@@ MELLANOX ETHERNET DRIVER (mlx4_en
  M:	Tariq Toukan tariqt@nvidia.com
  L:	netdev@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlx4/en_*
@@@ -14617,7 -14516,7 +14617,7 @@@ M:	Saeed Mahameed <saeedm@nvidia.com
  M:	Tariq Toukan tariqt@nvidia.com
  L:	netdev@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlx5/core/en_*
@@@ -14625,7 -14524,7 +14625,7 @@@ MELLANOX ETHERNET INNOVA DRIVER
  R:	Boris Pismenny borisp@nvidia.com
  L:	netdev@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlx5/core/en_accel/*
  F:	drivers/net/ethernet/mellanox/mlx5/core/fpga/*
@@@ -14636,7 -14535,7 +14636,7 @@@ M:	Ido Schimmel <idosch@nvidia.com
  M:	Petr Machata petrm@nvidia.com
  L:	netdev@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlxsw/
  F:	tools/testing/selftests/drivers/net/mlxsw/
@@@ -14645,7 -14544,7 +14645,7 @@@ MELLANOX FIRMWARE FLASH LIBRARY (mlxfw
  M:	mlxsw@nvidia.com
  L:	netdev@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlxfw/
@@@ -14664,7 -14563,7 +14664,7 @@@ M:	Tariq Toukan <tariqt@nvidia.com
  L:	netdev@vger.kernel.org
  L:	linux-rdma@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	drivers/net/ethernet/mellanox/mlx4/
  F:	include/linux/mlx4/
@@@ -14673,7 -14572,7 +14673,7 @@@ MELLANOX MLX4 IB drive
  M:	Yishai Hadas yishaih@nvidia.com
  L:	linux-rdma@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	http://patchwork.kernel.org/project/linux-rdma/list/
  F:	drivers/infiniband/hw/mlx4/
  F:	include/linux/mlx4/
@@@ -14686,7 -14585,7 +14686,7 @@@ M:	Tariq Toukan <tariqt@nvidia.com
  L:	netdev@vger.kernel.org
  L:	linux-rdma@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	https://patchwork.kernel.org/project/netdevbpf/list/
  F:	Documentation/networking/device_drivers/ethernet/mellanox/
  F:	drivers/net/ethernet/mellanox/mlx5/core/
@@@ -14696,7 -14595,7 +14696,7 @@@ MELLANOX MLX5 IB drive
  M:	Leon Romanovsky leonro@nvidia.com
  L:	linux-rdma@vger.kernel.org
  S:	Supported
 -W:	http://www.mellanox.com
 +W:	https://www.nvidia.com/networking/
  Q:	http://patchwork.kernel.org/project/linux-rdma/list/
  F:	drivers/infiniband/hw/mlx5/
  F:	include/linux/mlx5/
@@@ -14924,7 -14823,6 +14924,7 @@@ M:	Alexander Duyck <alexanderduyck@fb.c
  M:	Jakub Kicinski kuba@kernel.org
  R:	kernel-team@meta.com
  S:	Supported
 +F:	Documentation/networking/device_drivers/ethernet/meta/
  F:	drivers/net/ethernet/meta/
METHODE UDPU SUPPORT
@@@ -15071,13 -14969,6 +15071,13 @@@ L:	netdev@vger.kernel.or
  S:	Maintained
  F:	drivers/net/ethernet/microchip/lan743x_*
+MICROCHIP LAN8650/1 10BASE-T1S MACPHY ETHERNET DRIVER
 +M:	Parthiban Veerasooran parthiban.veerasooran@microchip.com
 +L:	netdev@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/net/microchip,lan8650.yaml
 +F:	drivers/net/ethernet/microchip/lan865x/lan865x.c
 +
  MICROCHIP LAN87xx/LAN937x T1 PHY DRIVER
  M:	Arun Ramadoss arun.ramadoss@microchip.com
  R:	UNGLinuxDriver@microchip.com
@@@ -15327,12 -15218,6 +15327,12 @@@ S:	Maintaine
  F:	Documentation/hwmon/surface_fan.rst
  F:	drivers/hwmon/surface_fan.c
+MICROSOFT SURFACE SENSOR THERMAL DRIVER
 +M:	Maximilian Luz luzmaximilian@gmail.com
 +L:	linux-hwmon@vger.kernel.org
 +S:	Maintained
 +F:	drivers/hwmon/surface_temp.c
 +
  MICROSOFT SURFACE GPE LID SUPPORT DRIVER
  M:	Maximilian Luz luzmaximilian@gmail.com
  L:	platform-driver-x86@vger.kernel.org
@@@ -15585,9 -15470,6 +15585,9 @@@ F:	include/dt-bindings/clock/mobileye,e
MODULE SUPPORT
  M:	Luis Chamberlain mcgrof@kernel.org
 +R:	Petr Pavlu petr.pavlu@suse.com
 +R:	Sami Tolvanen samitolvanen@google.com
 +R:	Daniel Gomez da.gomez@samsung.com
  L:	linux-modules@vger.kernel.org
  L:	linux-kernel@vger.kernel.org
  S:	Maintained
@@@ -15906,7 -15788,6 +15906,7 @@@ M:	Breno Leitao <leitao@debian.org
  S:	Maintained
  F:	Documentation/networking/netconsole.rst
  F:	drivers/net/netconsole.c
 +F:	tools/testing/selftests/drivers/net/netcons_basic.sh
NETDEVSIM
  M:	Jakub Kicinski kuba@kernel.org
@@@ -16950,7 -16831,6 +16950,7 @@@ OMNIVISION OG01A1B SENSOR DRIVE
  M:	Sakari Ailus sakari.ailus@linux.intel.com
  L:	linux-media@vger.kernel.org
  S:	Maintained
 +F:	Documentation/devicetree/bindings/media/i2c/ovti,og01a1b.yaml
  F:	drivers/media/i2c/og01a1b.c
OMNIVISION OV01A10 SENSOR DRIVER
@@@ -17221,14 -17101,6 +17221,14 @@@ L:	linux-rdma@vger.kernel.or
  S:	Supported
  F:	drivers/infiniband/ulp/opa_vnic
+OPEN ALLIANCE 10BASE-T1S MACPHY SERIAL INTERFACE FRAMEWORK
 +M:	Parthiban Veerasooran parthiban.veerasooran@microchip.com
 +L:	netdev@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/networking/oa-tc6-framework.rst
 +F:	drivers/include/linux/oa_tc6.h
 +F:	drivers/net/ethernet/oa_tc6.c
 +
  OPEN FIRMWARE AND FLATTENED DEVICE TREE
  M:	Rob Herring robh@kernel.org
  M:	Saravana Kannan saravanak@google.com
@@@ -17540,7 -17412,7 +17540,7 @@@ PCI DRIVER FOR ALTERA PCIE I
  M:	Joyce Ooi joyce.ooi@intel.com
  L:	linux-pci@vger.kernel.org
  S:	Supported
 -F:	Documentation/devicetree/bindings/pci/altera-pcie.txt
 +F:	Documentation/devicetree/bindings/pci/altr,pcie-root-port.yaml
  F:	drivers/pci/controller/pcie-altera.c
PCI DRIVER FOR APPLIEDMICRO XGENE
@@@ -17772,7 -17644,7 +17772,7 @@@ PCI MSI DRIVER FOR ALTERA MSI I
  M:	Joyce Ooi joyce.ooi@intel.com
  L:	linux-pci@vger.kernel.org
  S:	Supported
 -F:	Documentation/devicetree/bindings/pci/altera-pcie-msi.txt
 +F:	Documentation/devicetree/bindings/pci/altr,msi-controller.yaml
  F:	drivers/pci/controller/pcie-altera-msi.c
PCI MSI DRIVER FOR APPLIEDMICRO XGENE
@@@ -18535,7 -18407,6 +18535,7 @@@ L:	netdev@vger.kernel.or
  S:	Maintained
  F:	Documentation/devicetree/bindings/net/pse-pd/
  F:	drivers/net/pse-pd/
 +F:	net/ethtool/pse-pd.c
PSTORE FILESYSTEM
  M:	Kees Cook kees@kernel.org
@@@ -18916,7 -18787,7 +18916,7 @@@ M:	Bryan O'Donoghue <bryan.odonoghue@li
  L:	linux-media@vger.kernel.org
  S:	Maintained
  F:	Documentation/admin-guide/media/qcom_camss.rst
 -F:	Documentation/devicetree/bindings/media/*camss*
 +F:	Documentation/devicetree/bindings/media/qcom,*camss*
  F:	drivers/media/platform/qcom/camss/
QUALCOMM CLOCK DRIVERS
@@@ -18931,6 -18802,7 +18931,6 @@@ F:	include/dt-bindings/clock/qcom,
  QUALCOMM CLOUD AI (QAIC) DRIVER
  M:	Jeffrey Hugo quic_jhugo@quicinc.com
  R:	Carl Vanderlip quic_carlv@quicinc.com
 -R:	Pranjal Ramajor Asha Kanojiya quic_pkanojiy@quicinc.com
  L:	linux-arm-msm@vger.kernel.org
  L:	dri-devel@lists.freedesktop.org
  S:	Supported
@@@ -19025,7 -18897,6 +19025,7 @@@ L:	linux-arm-msm@vger.kernel.or
  S:	Maintained
  F:	Documentation/devicetree/bindings/interconnect/qcom,msm8998-bwmon.yaml
  F:	drivers/soc/qcom/icc-bwmon.c
 +F:	drivers/soc/qcom/trace_icc-bwmon.h
QUALCOMM IOMMU
  M:	Rob Clark robdclark@gmail.com
@@@ -19365,7 -19236,6 +19365,7 @@@ S:	Supporte
  W:	https://oss.oracle.com/projects/rds/
  F:	Documentation/networking/rds.rst
  F:	net/rds/
 +F:	tools/testing/selftests/net/rds/
RDT - RESOURCE ALLOCATION
  M:	Fenghua Yu fenghua.yu@intel.com
@@@ -19825,10 -19695,12 +19825,10 @@@ L:	linux-riscv@lists.infradead.or
  S:	Maintained
  Q:	https://patchwork.kernel.org/project/linux-riscv/list/
  T:	git https://git.kernel.org/pub/scm/linux/kernel/git/conor/linux.git/
 -F:	Documentation/devicetree/bindings/riscv/
 -F:	arch/riscv/boot/dts/
 -X:	arch/riscv/boot/dts/allwinner/
 -X:	arch/riscv/boot/dts/renesas/
 -X:	arch/riscv/boot/dts/sophgo/
 -X:	arch/riscv/boot/dts/thead/
 +F:	arch/riscv/boot/dts/canaan/
 +F:	arch/riscv/boot/dts/microchip/
 +F:	arch/riscv/boot/dts/sifive/
 +F:	arch/riscv/boot/dts/starfive/
RISC-V PMU DRIVERS
  M:	Atish Patra atishp@atishpatra.org
@@@ -19866,14 -19738,6 +19866,14 @@@ F:	Documentation/ABI/*/sysfs-driver-hid
  F:	drivers/hid/hid-roccat*
  F:	include/linux/hid-roccat*
+ROCKCHIP CAN-FD DRIVER
 +M:	Marc Kleine-Budde mkl@pengutronix.de
 +R:	kernel@pengutronix.de
 +L:	linux-can@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/net/can/rockchip,rk3568v2-canfd.yaml
 +F:	drivers/net/can/rockchip/
 +
  ROCKCHIP CRYPTO DRIVERS
  M:	Corentin Labbe clabbe@baylibre.com
  L:	linux-crypto@vger.kernel.org
@@@ -19900,13 -19764,6 +19900,13 @@@ F:	Documentation/userspace-api/media/v4
  F:	drivers/media/platform/rockchip/rkisp1
  F:	include/uapi/linux/rkisp1-config.h
+ROCKCHIP RK3568 RANDOM NUMBER GENERATOR SUPPORT
 +M:	Daniel Golle daniel@makrotopia.org
 +M:	Aurelien Jarno aurelien@aurel32.net
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/rng/rockchip,rk3568-rng.yaml
 +F:	drivers/char/hw_random/rockchip-rng.c
 +
  ROCKCHIP RASTER 2D GRAPHIC ACCELERATION UNIT DRIVER
  M:	Jacob Chen jacob-chen@iotwrt.com
  M:	Ezequiel Garcia ezequiel@vanguardiasur.com.ar
@@@ -20023,26 -19880,12 +20023,26 @@@ T:	git git://linuxtv.org/media_tree.gi
  F:	Documentation/devicetree/bindings/media/allwinner,sun8i-a83t-de2-rotate.yaml
  F:	drivers/media/platform/sunxi/sun8i-rotate/
+RPMB SUBSYSTEM
 +M:	Jens Wiklander jens.wiklander@linaro.org
 +L:	linux-kernel@vger.kernel.org
 +S:	Supported
 +F:	drivers/misc/rpmb-core.c
 +F:	include/linux/rpmb.h
 +
  RPMSG TTY DRIVER
  M:	Arnaud Pouliquen arnaud.pouliquen@foss.st.com
  L:	linux-remoteproc@vger.kernel.org
  S:	Maintained
  F:	drivers/tty/rpmsg_tty.c
+RTASE ETHERNET DRIVER
 +M:	Justin Lai justinlai0215@realtek.com
 +M:	Larry Chiu larry.chiu@realtek.com
 +L:	netdev@vger.kernel.org
 +S:	Maintained
 +F:	drivers/net/ethernet/realtek/rtase/
 +
  RTL2830 MEDIA DRIVER
  L:	linux-media@vger.kernel.org
  S:	Orphan
@@@ -20307,16 -20150,6 +20307,16 @@@ B:	mailto:linux-samsung-soc@vger.kernel
  F:	Documentation/devicetree/bindings/sound/samsung*
  F:	sound/soc/samsung/
+SAMSUNG EXYNOS850 SoC SUPPORT
 +M:	Sam Protsenko semen.protsenko@linaro.org
 +L:	linux-arm-kernel@lists.infradead.org (moderated for non-subscribers)
 +L:	linux-samsung-soc@vger.kernel.org
 +S:	Maintained
 +F:	Documentation/devicetree/bindings/clock/samsung,exynos850-clock.yaml
 +F:	arch/arm64/boot/dts/exynos/exynos850*
 +F:	drivers/clk/samsung/clk-exynos850.c
 +F:	include/dt-bindings/clock/exynos850.h
 +
  SAMSUNG EXYNOS PSEUDO RANDOM NUMBER GENERATOR (RNG) DRIVER
  M:	Krzysztof Kozlowski krzk@kernel.org
  L:	linux-crypto@vger.kernel.org
@@@ -21557,13 -21390,13 +21557,13 @@@ S:	Maintaine
  F:	tools/sound/dapm-graph
SOUND - SOUND OPEN FIRMWARE (SOF) DRIVERS
 -M:	Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com
  M:	Liam Girdwood lgirdwood@gmail.com
  M:	Peter Ujfalusi peter.ujfalusi@linux.intel.com
  M:	Bard Liao yung-chuan.liao@linux.intel.com
  M:	Ranjani Sridharan ranjani.sridharan@linux.intel.com
  M:	Daniel Baluta daniel.baluta@nxp.com
  R:	Kai Vehmanen kai.vehmanen@linux.intel.com
 +R:	Pierre-Louis Bossart pierre-louis.bossart@linux.dev
  L:	sound-open-firmware@alsa-project.org (moderated for non-subscribers)
  S:	Supported
  W:	https://github.com/thesofproject/linux/
@@@ -21572,7 -21405,7 +21572,7 @@@ F:	sound/soc/sof
  SOUNDWIRE SUBSYSTEM
  M:	Vinod Koul vkoul@kernel.org
  M:	Bard Liao yung-chuan.liao@linux.intel.com
 -R:	Pierre-Louis Bossart pierre-louis.bossart@linux.intel.com
 +R:	Pierre-Louis Bossart pierre-louis.bossart@linux.dev
  R:	Sanyog Kale sanyog.r.kale@intel.com
  L:	alsa-devel@alsa-project.org (moderated for non-subscribers)
  S:	Supported
@@@ -21704,8 -21537,10 +21704,8 @@@ F:	include/linux/spmi.
  F:	include/trace/events/spmi.h
SPU FILE SYSTEM
 -M:	Jeremy Kerr jk@ozlabs.org
  L:	linuxppc-dev@lists.ozlabs.org
 -S:	Supported
 -W:	http://www.ibm.com/developerworks/power/cell/
 +S:	Orphan
  F:	Documentation/filesystems/spufs/spufs.rst
  F:	arch/powerpc/platforms/cell/spufs/
@@@ -22634,7 -22469,6 +22634,7 @@@ M:	Jens Wiklander <jens.wiklander@linar
  R:	Sumit Garg sumit.garg@linaro.org
  L:	op-tee@lists.trustedfirmware.org
  S:	Maintained
 +F:	Documentation/ABI/testing/sysfs-class-tee
  F:	Documentation/driver-api/tee.rst
  F:	Documentation/tee/
  F:	Documentation/userspace-api/tee.rst
@@@ -22680,7 -22514,6 +22680,7 @@@ M:	Thierry Reding <thierry.reding@gmail
  R:	Krishna Reddy vdumpa@nvidia.com
  L:	linux-tegra@vger.kernel.org
  S:	Supported
 +F:	drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
  F:	drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c
  F:	drivers/iommu/tegra*
@@@ -22786,11 -22619,12 +22786,11 @@@ F:	Documentation/devicetree/bindings/so
  F:	Documentation/devicetree/bindings/sound/ti,tas2562.yaml
  F:	Documentation/devicetree/bindings/sound/ti,tas2770.yaml
  F:	Documentation/devicetree/bindings/sound/ti,tas27xx.yaml
 +F:	Documentation/devicetree/bindings/sound/ti,tpa6130a2.yaml
  F:	Documentation/devicetree/bindings/sound/ti,pcm1681.yaml
  F:	Documentation/devicetree/bindings/sound/ti,pcm3168a.yaml
  F:	Documentation/devicetree/bindings/sound/ti,tlv320*.yaml
  F:	Documentation/devicetree/bindings/sound/ti,tlv320adcx140.yaml
 -F:	Documentation/devicetree/bindings/sound/tlv320aic31xx.txt
 -F:	Documentation/devicetree/bindings/sound/tpa6130a2.txt
  F:	include/sound/tas2*.h
  F:	include/sound/tlv320*.h
  F:	include/sound/tpa6130a2-plat.h
@@@ -23368,7 -23202,6 +23368,7 @@@ Q:	https://patchwork.kernel.org/project
  T:	git git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git
  F:	Documentation/devicetree/bindings/tpm/
  F:	drivers/char/tpm/
 +F:	tools/testing/selftests/tpm2/
TPS546D24 DRIVER
  M:	Duke Du dukedu83@gmail.com
@@@ -23381,8 -23214,9 +23381,8 @@@ TQ SYSTEMS BOARD & DRIVER SUPPOR
  L:	linux@ew.tq-group.com
  S:	Supported
  W:	https://www.tq-group.com/en/products/tq-embedded/
 -F:	arch/arm/boot/dts/imx*mba*.dts*
 -F:	arch/arm/boot/dts/imx*tqma*.dts*
 -F:	arch/arm/boot/dts/mba*.dtsi
 +F:	arch/arm/boot/dts/nxp/imx/*mba*.dts*
 +F:	arch/arm/boot/dts/nxp/imx/*tqma*.dts*
  F:	arch/arm64/boot/dts/freescale/fsl-*tqml*.dts*
  F:	arch/arm64/boot/dts/freescale/imx*mba*.dts*
  F:	arch/arm64/boot/dts/freescale/imx*tqma*.dts*
@@@ -24590,20 -24424,6 +24590,20 @@@ F:	include/uapi/linux/vsockmon.
  F:	net/vmw_vsock/
  F:	tools/testing/vsock/
+VMA
 +M:	Andrew Morton akpm@linux-foundation.org
 +R:	Liam R. Howlett Liam.Howlett@oracle.com
 +R:	Vlastimil Babka vbabka@suse.cz
 +R:	Lorenzo Stoakes lorenzo.stoakes@oracle.com
 +L:	linux-mm@kvack.org
 +S:	Maintained
 +W:	https://www.linux-mm.org
 +T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
 +F:	mm/vma.c
 +F:	mm/vma.h
 +F:	mm/vma_internal.h
 +F:	tools/testing/vma/
 +
  VMALLOC
  M:	Andrew Morton akpm@linux-foundation.org
  R:	Uladzislau Rezki urezki@gmail.com
@@@ -24993,6 -24813,17 +24993,17 @@@ T:	git git://git.kernel.org/pub/scm/lin
  F:	Documentation/arch/x86/
  F:	Documentation/devicetree/bindings/x86/
  F:	arch/x86/
+ F:	tools/testing/selftests/x86
+ 
+ X86 CPUID DATABASE
+ M:	Borislav Petkov bp@alien8.de
+ M:	Thomas Gleixner tglx@linutronix.de
+ M:	x86@kernel.org
+ R:	Ahmed S. Darwish darwi@linutronix.de
+ L:	x86-cpuid@lists.linux.dev
+ S:	Maintained
+ W:	https://x86-cpuid.org
+ F:	tools/arch/x86/kcpuid/cpuid.csv
X86 ENTRY CODE
  M:	Andy Lutomirski luto@kernel.org
@@@ -25435,19 -25266,6 +25446,19 @@@ S:	Maintaine
  F:	drivers/spi/spi-xtensa-xtfpga.c
  F:	sound/soc/xtensa/xtfpga-i2s.c
+XZ EMBEDDED
 +M:	Lasse Collin lasse.collin@tukaani.org
 +S:	Maintained
 +W:	https://tukaani.org/xz/embedded.html
 +B:	https://github.com/tukaani-project/xz-embedded/issues
 +C:	irc://irc.libera.chat/tukaani
 +F:	Documentation/staging/xz.rst
 +F:	include/linux/decompress/unxz.h
 +F:	include/linux/xz.h
 +F:	lib/decompress_unxz.c
 +F:	lib/xz/
 +F:	scripts/xz_wrap.sh
 +
  YAM DRIVER FOR AX.25
  M:	Jean-Paul Roubelat jpr@f6fbb.org
  L:	linux-hams@vger.kernel.org
@@@ -25472,6 -25290,7 +25483,6 @@@ F:	tools/net/ynl
YEALINK PHONE DRIVER
  M:	Henk Vergonet Henk.Vergonet@gmail.com
 -L:	usbb2k-api-dev@nongnu.org
  S:	Maintained
  F:	Documentation/input/devices/yealink.rst
  F:	drivers/input/misc/yealink.*
diff --combined arch/arm64/Kconfig
index 3943898f62c93,e68ea648e085b..a77453de94bc5
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@@ -24,7 -24,6 +24,7 @@@ config ARM6
    select ARCH_HAS_CURRENT_STACK_POINTER
    select ARCH_HAS_DEBUG_VIRTUAL
    select ARCH_HAS_DEBUG_VM_PGTABLE
 +	select ARCH_HAS_DMA_OPS if XEN
    select ARCH_HAS_DMA_PREP_COHERENT
    select ARCH_HAS_ACPI_TABLE_UPGRADE if ACPI
    select ARCH_HAS_FAST_MULTIPLIER
@@@ -35,7 -34,6 +35,7 @@@
    select ARCH_HAS_KERNEL_FPU_SUPPORT if KERNEL_MODE_NEON
    select ARCH_HAS_KEEPINITRD
    select ARCH_HAS_MEMBARRIER_SYNC_CORE
 +	select ARCH_HAS_MEM_ENCRYPT
    select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
    select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
    select ARCH_HAS_PTE_DEVMAP
@@@ -101,7 -99,7 +101,8 @@@
    select ARCH_SUPPORTS_NUMA_BALANCING
    select ARCH_SUPPORTS_PAGE_TABLE_CHECK
    select ARCH_SUPPORTS_PER_VMA_LOCK
 +	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
+ 	select ARCH_SUPPORTS_RT
    select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
    select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
    select ARCH_WANT_DEFAULT_BPF_JIT
@@@ -426,7 -424,7 +427,7 @@@ config AMPERE_ERRATUM_AC03_CPU_3
    default y
    help
      This option adds an alternative code sequence to work around Ampere
 -	  erratum AC03_CPU_38 on AmpereOne.
 +	  errata AC03_CPU_38 and AC04_CPU_10 on AmpereOne.
The affected design reports FEAT_HAFDBS as not implemented in
      ID_AA64MMFR1_EL1.HAFDBS, but (V)TCR_ELx.{HA,HD} are not RES0
@@@ -2103,8 -2101,7 +2104,8 @@@ config ARM64_MT
    depends on ARM64_PAN
    select ARCH_HAS_SUBPAGE_FAULTS
    select ARCH_USES_HIGH_VMA_FLAGS
 -	select ARCH_USES_PG_ARCH_X
 +	select ARCH_USES_PG_ARCH_2
 +	select ARCH_USES_PG_ARCH_3
    help
      Memory Tagging (part of the ARMv8.5 Extensions) provides
      architectural support for run-time, always-on detection of
@@@ -2141,29 -2138,6 +2142,29 @@@ config ARM64_EPA
      if the cpu does not implement the feature.
  endmenu # "ARMv8.7 architectural features"
+menu "ARMv8.9 architectural features"
 +
 +config ARM64_POE
 +	prompt "Permission Overlay Extension"
 +	def_bool y
 +	select ARCH_USES_HIGH_VMA_FLAGS
 +	select ARCH_HAS_PKEYS
 +	help
 +	  The Permission Overlay Extension is used to implement Memory
 +	  Protection Keys. Memory Protection Keys provides a mechanism for
 +	  enforcing page-based protections, but without requiring modification
 +	  of the page tables when an application changes protection domains.
 +
 +	  For details, see Documentation/core-api/protection-keys.rst
 +
 +	  If unsure, say y.
 +
 +config ARCH_PKEY_BITS
 +	int
 +	default 3
 +
 +endmenu # "ARMv8.9 architectural features"
 +
  config ARM64_SVE
    bool "ARM Scalable Vector Extension support"
    default y
diff --combined arch/loongarch/include/asm/irq.h
index ce85d4c7d225d,9c2ca785faa9b..a0ca84da8541d
--- a/arch/loongarch/include/asm/irq.h
+++ b/arch/loongarch/include/asm/irq.h
@@@ -39,11 -39,22 +39,22 @@@ void spurious_interrupt(void)
#define NR_IRQS_LEGACY 16
+ /*
+  * 256 Vectors Mapping for AVECINTC:
+  *
+  * 0 - 15: Mapping classic IPs, e.g. IP0-12.
+  * 16 - 255: Mapping vectors for external IRQ.
+  *
+  */
+ #define NR_VECTORS		256
+ #define NR_LEGACY_VECTORS	16
+ #define IRQ_MATRIX_BITS		NR_VECTORS
+ 
  #define arch_trigger_cpumask_backtrace arch_trigger_cpumask_backtrace
  void arch_trigger_cpumask_backtrace(const struct cpumask *mask, int exclude_cpu);
#define MAX_IO_PICS 2
- #define NR_IRQS	(64 + (256 * MAX_IO_PICS))
+ #define NR_IRQS	(64 + NR_VECTORS * (NR_CPUS + MAX_IO_PICS))
struct acpi_vector_group {
    int node;
@@@ -54,7 -65,6 +65,7 @@@ extern struct acpi_vector_group pch_gro
  extern struct acpi_vector_group msi_group[MAX_IO_PICS];
#define CORES_PER_EIO_NODE	4
 +#define CORES_PER_VEIO_NODE	256
#define LOONGSON_CPU_UART0_VEC		10 /* CPU UART0 */
  #define LOONGSON_CPU_THSENS_VEC		14 /* CPU Thsens */
@@@ -66,7 -76,7 +77,7 @@@
  #define LOONGSON_LPC_LAST_IRQ		(LOONGSON_LPC_IRQ_BASE + 15)
#define LOONGSON_CPU_IRQ_BASE		16
- #define LOONGSON_CPU_LAST_IRQ		(LOONGSON_CPU_IRQ_BASE + 14)
+ #define LOONGSON_CPU_LAST_IRQ		(LOONGSON_CPU_IRQ_BASE + 15)
#define LOONGSON_PCH_IRQ_BASE		64
  #define LOONGSON_PCH_ACPI_IRQ		(LOONGSON_PCH_IRQ_BASE + 47)
@@@ -89,20 -99,8 +100,8 @@@ struct acpi_madt_bio_pic
  struct acpi_madt_msi_pic;
  struct acpi_madt_lpc_pic;
- int liointc_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_lio_pic *acpi_liointc);
- int eiointc_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_eio_pic *acpi_eiointc);
- 
- int htvec_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_ht_pic *acpi_htvec);
- int pch_lpc_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_lpc_pic *acpi_pchlpc);
- int pch_msi_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_msi_pic *acpi_pchmsi);
- int pch_pic_acpi_init(struct irq_domain *parent,
- 					struct acpi_madt_bio_pic *acpi_pchpic);
- int find_pch_pic(u32 gsi);
+ void complete_irq_moving(void);
+ 
  struct fwnode_handle *get_pch_msi_handle(int pci_segment);
extern struct acpi_madt_lio_pic *acpi_liointc;
diff --combined arch/loongarch/include/asm/loongarch.h
index 24a3f4925cfb2,631d249b3ef26..04bf1a7f903a2
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@@ -119,7 -119,6 +119,7 @@@
  #define  CPUCFG6_PMP			BIT(0)
  #define  CPUCFG6_PAMVER			GENMASK(3, 1)
  #define  CPUCFG6_PMNUM			GENMASK(7, 4)
 +#define  CPUCFG6_PMNUM_SHIFT		4
  #define  CPUCFG6_PMBITS			GENMASK(13, 8)
  #define  CPUCFG6_UPM			BIT(14)
@@@ -161,8 -160,16 +161,8 @@@
/*
   * CPUCFG index area: 0x40000000 -- 0x400000ff
 - * SW emulation for KVM hypervirsor
 + * SW emulation for KVM hypervirsor, see arch/loongarch/include/uapi/asm/kvm_para.h
   */
 -#define CPUCFG_KVM_BASE			0x40000000
 -#define CPUCFG_KVM_SIZE			0x100
 -
 -#define CPUCFG_KVM_SIG			(CPUCFG_KVM_BASE + 0)
 -#define  KVM_SIGNATURE			"KVM\0"
 -#define CPUCFG_KVM_FEATURE		(CPUCFG_KVM_BASE + 4)
 -#define  KVM_FEATURE_IPI		BIT(1)
 -#define  KVM_FEATURE_STEAL_TIME		BIT(2)
#ifndef __ASSEMBLY__
@@@ -246,8 -253,8 +246,8 @@@
  #define  CSR_ESTAT_EXC_WIDTH		6
  #define  CSR_ESTAT_EXC			(_ULCAST_(0x3f) << CSR_ESTAT_EXC_SHIFT)
  #define  CSR_ESTAT_IS_SHIFT		0
- #define  CSR_ESTAT_IS_WIDTH		14
- #define  CSR_ESTAT_IS			(_ULCAST_(0x3fff) << CSR_ESTAT_IS_SHIFT)
+ #define  CSR_ESTAT_IS_WIDTH		15
+ #define  CSR_ESTAT_IS			(_ULCAST_(0x7fff) << CSR_ESTAT_IS_SHIFT)
#define LOONGARCH_CSR_ERA		0x6	/* ERA */
@@@ -642,6 -649,13 +642,13 @@@
#define LOONGARCH_CSR_CTAG		0x98	/* TagLo + TagHi */
+ #define LOONGARCH_CSR_ISR0		0xa0
+ #define LOONGARCH_CSR_ISR1		0xa1
+ #define LOONGARCH_CSR_ISR2		0xa2
+ #define LOONGARCH_CSR_ISR3		0xa3
+ 
+ #define LOONGARCH_CSR_IRR		0xa4
+ 
  #define LOONGARCH_CSR_PRID		0xc0
/* Shadow MCSR : 0xc0 ~ 0xff */
@@@ -1004,7 -1018,7 +1011,7 @@@
  /*
   * CSR_ECFG IM
   */
- #define ECFG0_IM		0x00001fff
+ #define ECFG0_IM		0x00005fff
  #define ECFGB_SIP0		0
  #define ECFGF_SIP0		(_ULCAST_(1) << ECFGB_SIP0)
  #define ECFGB_SIP1		1
@@@ -1047,6 -1061,7 +1054,7 @@@
  #define  IOCSRF_EIODECODE		BIT_ULL(9)
  #define  IOCSRF_FLATMODE		BIT_ULL(10)
  #define  IOCSRF_VM			BIT_ULL(11)
+ #define  IOCSRF_AVEC			BIT_ULL(15)
#define LOONGARCH_IOCSR_VENDOR		0x10
@@@ -1058,6 -1073,7 +1066,7 @@@
  #define  IOCSR_MISC_FUNC_SOFT_INT	BIT_ULL(10)
  #define  IOCSR_MISC_FUNC_TIMER_RESET	BIT_ULL(21)
  #define  IOCSR_MISC_FUNC_EXT_IOI_EN	BIT_ULL(48)
+ #define  IOCSR_MISC_FUNC_AVEC_EN	BIT_ULL(51)
#define LOONGARCH_IOCSR_CPUTEMP		0x428
@@@ -1380,9 -1396,10 +1389,10 @@@ __BUILD_CSR_OP(tlbidx
  #define INT_TI		11	/* Timer */
  #define INT_IPI		12
  #define INT_NMI		13
+ #define INT_AVEC	14
/* ExcCodes corresponding to interrupts */
- #define EXCCODE_INT_NUM		(INT_NMI + 1)
+ #define EXCCODE_INT_NUM		(INT_AVEC + 1)
  #define EXCCODE_INT_START	64
  #define EXCCODE_INT_END		(EXCCODE_INT_START + EXCCODE_INT_NUM - 1)
diff --combined arch/loongarch/kernel/paravirt.c
index 708eda025ed88,4d736a4e488dd..a5fc61f8b3482
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@@ -13,7 -13,6 +13,7 @@@ static int has_steal_clock
  struct static_key paravirt_steal_enabled;
  struct static_key paravirt_steal_rq_enabled;
  static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
 +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
static u64 native_steal_clock(int cpu)
  {
@@@ -135,6 -134,11 +135,11 @@@ static irqreturn_t pv_ipi_interrupt(in
    	info->ipi_irqs[IPI_IRQ_WORK]++;
    }
+ 	if (action & SMP_CLEAR_VECTOR) {
+ 		complete_irq_moving();
+ 		info->ipi_irqs[IPI_CLEAR_VECTOR]++;
+ 	}
+ 
    return IRQ_HANDLED;
  }
@@@ -152,14 -156,11 +157,14 @@@ static void pv_init_ipi(void
  }
  #endif
-static bool kvm_para_available(void)
 +bool kvm_para_available(void)
  {
    int config;
    static int hypervisor_type;
+	if (!cpu_has_hypervisor)
 +		return false;
 +
    if (!hypervisor_type) {
    	config = read_cpucfg(CPUCFG_KVM_SIG);
    	if (!memcmp(&config, KVM_SIGNATURE, 4))
@@@ -169,22 -170,17 +174,22 @@@
    return hypervisor_type == HYPERVISOR_KVM;
  }
-int __init pv_ipi_init(void)
 +unsigned int kvm_arch_para_features(void)
  {
 -	int feature;
 +	static unsigned int feature;
-	if (!cpu_has_hypervisor)
 -		return 0;
    if (!kvm_para_available())
    	return 0;
-	feature = read_cpucfg(CPUCFG_KVM_FEATURE);
 -	if (!(feature & KVM_FEATURE_IPI))
 +	if (!feature)
 +		feature = read_cpucfg(CPUCFG_KVM_FEATURE);
 +
 +	return feature;
 +}
 +
 +int __init pv_ipi_init(void)
 +{
 +	if (!kvm_para_has_feature(KVM_FEATURE_IPI))
    	return 0;
#ifdef CONFIG_SMP
@@@ -215,7 -211,7 +220,7 @@@ static int pv_enable_steal_time(void
    }
addr |= KVM_STEAL_PHYS_VALID;
 -	kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr);
 +	kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, BIT(KVM_FEATURE_STEAL_TIME), addr);
return 0;
  }
@@@ -223,7 -219,7 +228,7 @@@
  static void pv_disable_steal_time(void)
  {
    if (has_steal_clock)
 -		kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 0);
 +		kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, BIT(KVM_FEATURE_STEAL_TIME), 0);
  }
#ifdef CONFIG_SMP
@@@ -267,9 -263,15 +272,9 @@@ static struct notifier_block pv_reboot_
int __init pv_time_init(void)
  {
 -	int r, feature;
 +	int r;
-	if (!cpu_has_hypervisor)
 -		return 0;
 -	if (!kvm_para_available())
 -		return 0;
 -
 -	feature = read_cpucfg(CPUCFG_KVM_FEATURE);
 -	if (!(feature & KVM_FEATURE_STEAL_TIME))
 +	if (!kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
    	return 0;
has_steal_clock = 1;
@@@ -303,13 -305,3 +308,13 @@@
return 0;
  }
 +
 +int __init pv_spinlock_init(void)
 +{
 +	if (!cpu_has_hypervisor)
 +		return 0;
 +
 +	static_branch_enable(&virt_spin_lock_key);
 +
 +	return 0;
 +}
diff --combined arch/loongarch/kernel/smp.c
index 482b3c7e3042d,9787871fffa08..9afc2d8b34141
--- a/arch/loongarch/kernel/smp.c
+++ b/arch/loongarch/kernel/smp.c
@@@ -72,6 -72,7 +72,7 @@@ static const char *ipi_types[NR_IPI] __
    [IPI_RESCHEDULE] = "Rescheduling interrupts",
    [IPI_CALL_FUNCTION] = "Function call interrupts",
    [IPI_IRQ_WORK] = "IRQ work interrupts",
+ 	[IPI_CLEAR_VECTOR] = "Clear vector interrupts",
  };
void show_ipi_list(struct seq_file *p, int prec)
@@@ -248,6 -249,11 +249,11 @@@ static irqreturn_t loongson_ipi_interru
    	per_cpu(irq_stat, cpu).ipi_irqs[IPI_IRQ_WORK]++;
    }
+ 	if (action & SMP_CLEAR_VECTOR) {
+ 		complete_irq_moving();
+ 		per_cpu(irq_stat, cpu).ipi_irqs[IPI_CLEAR_VECTOR]++;
+ 	}
+ 
    return IRQ_HANDLED;
  }
@@@ -509,8 -515,6 +515,8 @@@ void __init smp_prepare_boot_cpu(void
    		rr_node = next_node_in(rr_node, node_online_map);
    	}
    }
 +
 +	pv_spinlock_init();
  }
/* called from main before smp_init() */
diff --combined arch/riscv/Kconfig
index 801ee681059ca,3c78edf8e5b90..a9be4dca380dc
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@@ -13,7 -13,6 +13,7 @@@ config 32BI
  config RISCV
    def_bool y
    select ACPI_GENERIC_GSI if ACPI
 +	select ACPI_MCFG if (ACPI && PCI)
    select ACPI_PPTT if ACPI
    select ACPI_REDUCED_HARDWARE_ONLY if ACPI
    select ACPI_SPCR_TABLE if ACPI
@@@ -65,6 -64,7 +65,7 @@@
    select ARCH_SUPPORTS_LTO_CLANG_THIN if LLD_VERSION >= 140000
    select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU
    select ARCH_SUPPORTS_PER_VMA_LOCK if MMU
+ 	select ARCH_SUPPORTS_RT
    select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK
    select ARCH_USE_CMPXCHG_LOCKREF if 64BIT
    select ARCH_USE_MEMTEST
@@@ -93,7 -93,6 +94,7 @@@
    select GENERIC_ATOMIC64 if !64BIT
    select GENERIC_CLOCKEVENTS_BROADCAST if SMP
    select GENERIC_CPU_DEVICES
 +	select GENERIC_CPU_VULNERABILITIES
    select GENERIC_EARLY_IOREMAP
    select GENERIC_ENTRY
    select GENERIC_GETTIMEOFDAY if HAVE_GENERIC_VDSO
@@@ -158,7 -157,6 +159,7 @@@
    select HAVE_KERNEL_LZO if !XIP_KERNEL && !EFI_ZBOOT
    select HAVE_KERNEL_UNCOMPRESSED if !XIP_KERNEL && !EFI_ZBOOT
    select HAVE_KERNEL_ZSTD if !XIP_KERNEL && !EFI_ZBOOT
 +	select HAVE_KERNEL_XZ if !XIP_KERNEL && !EFI_ZBOOT
    select HAVE_KPROBES if !XIP_KERNEL
    select HAVE_KRETPROBES if !XIP_KERNEL
    # https://github.com/ClangBuiltLinux/linux/issues/1881
@@@ -191,7 -189,6 +192,7 @@@
    select OF_EARLY_FLATTREE
    select OF_IRQ
    select PCI_DOMAINS_GENERIC if PCI
 +	select PCI_ECAM if (ACPI && PCI)
    select PCI_MSI if PCI
    select RISCV_ALTERNATIVE if !XIP_KERNEL
    select RISCV_APLIC
diff --combined arch/x86/Kconfig
index 47a2ff9096dad,d45d22fa83f7c..a8cf61c52b063
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@@ -28,7 -28,6 +28,7 @@@ config X86_6
    select ARCH_HAS_GIGANTIC_PAGE
    select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
    select ARCH_SUPPORTS_PER_VMA_LOCK
 +	select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
    select HAVE_ARCH_SOFT_DIRTY
    select MODULES_USE_ELF_RELA
    select NEED_DMA_MAP_STATE
@@@ -80,7 -79,6 +80,7 @@@ config X8
    select ARCH_HAS_DEBUG_VIRTUAL
    select ARCH_HAS_DEBUG_VM_PGTABLE	if !X86_PAE
    select ARCH_HAS_DEVMEM_IS_ALLOWED
 +	select ARCH_HAS_DMA_OPS			if GART_IOMMU || XEN
    select ARCH_HAS_EARLY_DEBUG		if KGDB
    select ARCH_HAS_ELF_RANDOMIZE
    select ARCH_HAS_FAST_MULTIPLIER
@@@ -109,6 -107,7 +109,7 @@@
    select ARCH_HAS_DEBUG_WX
    select ARCH_HAS_ZONE_DMA_SET if EXPERT
    select ARCH_HAVE_NMI_SAFE_CMPXCHG
+ 	select ARCH_HAVE_EXTRA_ELF_NOTES
    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
    select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
    select ARCH_MIGHT_HAVE_PC_PARPORT
@@@ -124,6 -123,7 +125,7 @@@
    select ARCH_USES_CFI_TRAPS		if X86_64 && CFI_CLANG
    select ARCH_SUPPORTS_LTO_CLANG
    select ARCH_SUPPORTS_LTO_CLANG_THIN
+ 	select ARCH_SUPPORTS_RT
    select ARCH_USE_BUILTIN_BSWAP
    select ARCH_USE_CMPXCHG_LOCKREF		if X86_CMPXCHG64
    select ARCH_USE_MEMTEST
@@@ -298,7 -298,6 +300,7 @@@
    select NEED_PER_CPU_EMBED_FIRST_CHUNK
    select NEED_PER_CPU_PAGE_FIRST_CHUNK
    select NEED_SG_DMA_LENGTH
 +	select NUMA_MEMBLKS			if NUMA
    select PCI_DOMAINS			if PCI
    select PCI_LOCKLESS_CONFIG		if PCI
    select PERF_EVENTS
@@@ -946,6 -945,7 +948,6 @@@ config DM
config GART_IOMMU
    bool "Old AMD GART IOMMU support"
 -	select DMA_OPS
    select IOMMU_HELPER
    select SWIOTLB
    depends on X86_64 && PCI && AMD_NB
@@@ -1601,6 -1601,14 +1603,6 @@@ config X86_64_ACPI_NUM
    help
      Enable ACPI SRAT based node topology detection.
-config NUMA_EMU
 -	bool "NUMA emulation"
 -	depends on NUMA
 -	help
 -	  Enable NUMA emulation. A flat machine will be split
 -	  into virtual nodes when booted with "numa=fake=N", where N is the
 -	  number of nodes. This is only useful for debugging.
 -
  config NODES_SHIFT
    int "Maximum NUMA Nodes (as a power of 2)" if !MAXSMP
    range 1 10
@@@ -1800,7 -1808,6 +1802,7 @@@ config X86_PA
    def_bool y
    prompt "x86 PAT support" if EXPERT
    depends on MTRR
 +	select ARCH_USES_PG_ARCH_2
    help
      Use PAT attributes to setup page level cache control.
@@@ -1812,6 -1819,10 +1814,6 @@@
If unsure, say Y.
-config ARCH_USES_PG_UNCACHED
 -	def_bool y
 -	depends on X86_PAT
 -
  config X86_UMIP
    def_bool y
    prompt "User Mode Instruction Prevention" if EXPERT
@@@ -1880,10 -1891,6 +1882,10 @@@ config X86_INTEL_MEMORY_PROTECTION_KEY
If unsure, say y.
+config ARCH_PKEY_BITS
 +	int
 +	default 4
 +
  choice
    prompt "TSX enable mode"
    depends on CPU_SUP_INTEL
@@@ -2421,6 -2428,14 +2423,14 @@@ config CFI_AUTO_DEFAUL
source "kernel/livepatch/Kconfig"
+ config X86_BUS_LOCK_DETECT
+ 	bool "Split Lock Detect and Bus Lock Detect support"
+ 	depends on CPU_SUP_INTEL || CPU_SUP_AMD
+ 	default y
+ 	help
+ 	  Enable Split Lock Detect and Bus Lock Detect functionalities.
+ 	  See file:Documentation/arch/x86/buslock.rst for more information.
+ 
  endmenu
config CC_HAS_NAMED_AS
@@@ -2605,24 -2620,15 +2615,15 @@@ config MITIGATION_SL
      against straight line speculation. The kernel image might be slightly
      larger.
- config MITIGATION_GDS_FORCE
- 	bool "Force GDS Mitigation"
+ config MITIGATION_GDS
+ 	bool "Mitigate Gather Data Sampling"
    depends on CPU_SUP_INTEL
- 	default n
+ 	default y
    help
- 	  Gather Data Sampling (GDS) is a hardware vulnerability which allows
- 	  unprivileged speculative access to data which was previously stored in
- 	  vector registers.
- 
- 	  This option is equivalent to setting gather_data_sampling=force on the
- 	  command line. The microcode mitigation is used if present, otherwise
- 	  AVX is disabled as a mitigation. On affected systems that are missing
- 	  the microcode any userspace code that unconditionally uses AVX will
- 	  break with this option set.
- 
- 	  Setting this option on systems not vulnerable to GDS has no effect.
- 
- 	  If in doubt, say N.
+ 	  Enable mitigation for Gather Data Sampling (GDS). GDS is a hardware
+ 	  vulnerability which allows unprivileged speculative access to data
+ 	  which was previously stored in vector registers. The attacker uses gather
+ 	  instructions to infer the stale vector register data.
config MITIGATION_RFDS
    bool "RFDS Mitigation"
@@@ -2645,6 -2651,107 +2646,107 @@@ config MITIGATION_SPECTRE_BH
      indirect branches.
      See file:Documentation/admin-guide/hw-vuln/spectre.rst
+ config MITIGATION_MDS
+ 	bool "Mitigate Microarchitectural Data Sampling (MDS) hardware bug"
+ 	depends on CPU_SUP_INTEL
+ 	default y
+ 	help
+ 	  Enable mitigation for Microarchitectural Data Sampling (MDS). MDS is
+ 	  a hardware vulnerability which allows unprivileged speculative access
+ 	  to data which is available in various CPU internal buffers.
+ 	  See also file:Documentation/admin-guide/hw-vuln/mds.rst
+ 
+ config MITIGATION_TAA
+ 	bool "Mitigate TSX Asynchronous Abort (TAA) hardware bug"
+ 	depends on CPU_SUP_INTEL
+ 	default y
+ 	help
+ 	  Enable mitigation for TSX Asynchronous Abort (TAA). TAA is a hardware
+ 	  vulnerability that allows unprivileged speculative access to data
+ 	  which is available in various CPU internal buffers by using
+ 	  asynchronous aborts within an Intel TSX transactional region.
+ 	  See also file:Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
+ 
+ config MITIGATION_MMIO_STALE_DATA
+ 	bool "Mitigate MMIO Stale Data hardware bug"
+ 	depends on CPU_SUP_INTEL
+ 	default y
+ 	help
+ 	  Enable mitigation for MMIO Stale Data hardware bugs.  Processor MMIO
+ 	  Stale Data Vulnerabilities are a class of memory-mapped I/O (MMIO)
+ 	  vulnerabilities that can expose data. The vulnerabilities require the
+ 	  attacker to have access to MMIO.
+ 	  See also
+ 	  file:Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst
+ 
+ config MITIGATION_L1TF
+ 	bool "Mitigate L1 Terminal Fault (L1TF) hardware bug"
+ 	depends on CPU_SUP_INTEL
+ 	default y
+ 	help
+ 	  Mitigate L1 Terminal Fault (L1TF) hardware bug. L1 Terminal Fault is a
+ 	  hardware vulnerability which allows unprivileged speculative access to data
+ 	  available in the Level 1 Data Cache.
+ 	  See <file:Documentation/admin-guide/hw-vuln/l1tf.rst
+ 
+ config MITIGATION_RETBLEED
+ 	bool "Mitigate RETBleed hardware bug"
+ 	depends on (CPU_SUP_INTEL && MITIGATION_SPECTRE_V2) || MITIGATION_UNRET_ENTRY || MITIGATION_IBPB_ENTRY
+ 	default y
+ 	help
+ 	  Enable mitigation for RETBleed (Arbitrary Speculative Code Execution
+ 	  with Return Instructions) vulnerability.  RETBleed is a speculative
+ 	  execution attack which takes advantage of microarchitectural behavior
+ 	  in many modern microprocessors, similar to Spectre v2. An
+ 	  unprivileged attacker can use these flaws to bypass conventional
+ 	  memory security restrictions to gain read access to privileged memory
+ 	  that would otherwise be inaccessible.
+ 
+ config MITIGATION_SPECTRE_V1
+ 	bool "Mitigate SPECTRE V1 hardware bug"
+ 	default y
+ 	help
+ 	  Enable mitigation for Spectre V1 (Bounds Check Bypass). Spectre V1 is a
+ 	  class of side channel attacks that takes advantage of speculative
+ 	  execution that bypasses conditional branch instructions used for
+ 	  memory access bounds check.
+ 	  See also file:Documentation/admin-guide/hw-vuln/spectre.rst
+ 
+ config MITIGATION_SPECTRE_V2
+ 	bool "Mitigate SPECTRE V2 hardware bug"
+ 	default y
+ 	help
+ 	  Enable mitigation for Spectre V2 (Branch Target Injection). Spectre
+ 	  V2 is a class of side channel attacks that takes advantage of
+ 	  indirect branch predictors inside the processor. In Spectre variant 2
+ 	  attacks, the attacker can steer speculative indirect branches in the
+ 	  victim to gadget code by poisoning the branch target buffer of a CPU
+ 	  used for predicting indirect branch addresses.
+ 	  See also file:Documentation/admin-guide/hw-vuln/spectre.rst
+ 
+ config MITIGATION_SRBDS
+ 	bool "Mitigate Special Register Buffer Data Sampling (SRBDS) hardware bug"
+ 	depends on CPU_SUP_INTEL
+ 	default y
+ 	help
+ 	  Enable mitigation for Special Register Buffer Data Sampling (SRBDS).
+ 	  SRBDS is a hardware vulnerability that allows Microarchitectural Data
+ 	  Sampling (MDS) techniques to infer values returned from special
+ 	  register accesses. An unprivileged user can extract values returned
+ 	  from RDRAND and RDSEED executed on another core or sibling thread
+ 	  using MDS techniques.
+ 	  See also
+ 	  file:Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst
+ 
+ config MITIGATION_SSB
+ 	bool "Mitigate Speculative Store Bypass (SSB) hardware bug"
+ 	default y
+ 	help
+ 	  Enable mitigation for Speculative Store Bypass (SSB). SSB is a
+ 	  hardware security vulnerability and its exploitation takes advantage
+ 	  of speculative execution in a similar way to the Meltdown and Spectre
+ 	  security vulnerabilities.
+ 
  endif
config ARCH_HAS_ADD_PAGES
diff --combined arch/x86/include/asm/mmu_context.h
index 80f2a3187aa66,19091ebb86338..2886cb668d7fa
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@@ -88,7 -88,13 +88,13 @@@ static inline void switch_ldt(struct mm
  #ifdef CONFIG_ADDRESS_MASKING
  static inline unsigned long mm_lam_cr3_mask(struct mm_struct *mm)
  {
- 	return mm->context.lam_cr3_mask;
+ 	/*
+ 	 * When switch_mm_irqs_off() is called for a kthread, it may race with
+ 	 * LAM enablement. switch_mm_irqs_off() uses the LAM mask to do two
+ 	 * things: populate CR3 and populate 'cpu_tlbstate.lam'. Make sure it
+ 	 * reads a single value for both.
+ 	 */
+ 	return READ_ONCE(mm->context.lam_cr3_mask);
  }
static inline void dup_lam(struct mm_struct *oldmm, struct mm_struct *mm)
@@@ -232,6 -238,11 +238,6 @@@ static inline bool is_64bit_mm(struct m
  }
  #endif
-static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
 -			      unsigned long end)
 -{
 -}
 -
  /*
   * We only want to enforce protection keys on the current process
   * because we effectively have no access to PKRU for other
diff --combined arch/x86/include/asm/processor.h
index 775acbdea1a96,399f7d1c4c61f..4a686f0e5dbf6
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@@ -582,7 -582,8 +582,8 @@@ extern void switch_gdt_and_percpu_base(
  extern void load_direct_gdt(int);
  extern void load_fixmap_gdt(int);
  extern void cpu_init(void);
- extern void cpu_init_exception_handling(void);
+ extern void cpu_init_exception_handling(bool boot_cpu);
+ extern void cpu_init_replace_early_idt(void);
  extern void cr4_init(void);
extern void set_task_blockstep(struct task_struct *task, bool on);
@@@ -691,6 -692,8 +692,6 @@@ static inline u32 per_cpu_l2c_id(unsign
  }
#ifdef CONFIG_CPU_SUP_AMD
 -extern u32 amd_get_highest_perf(void);
 -
  /*
   * Issue a DIV 0/1 insn to clear any division data from previous DIV
   * operations.
@@@ -703,6 -706,7 +704,6 @@@ static __always_inline void amd_clear_d
extern void amd_check_microcode(void);
  #else
 -static inline u32 amd_get_highest_perf(void)		{ return 0; }
  static inline void amd_clear_divider(void)		{ }
  static inline void amd_check_microcode(void)		{ }
  #endif
diff --combined arch/x86/kernel/cpu/sgx/main.c
index d01deb3863955,3a79105455f1d..9ace84486499b
--- a/arch/x86/kernel/cpu/sgx/main.c
+++ b/arch/x86/kernel/cpu/sgx/main.c
@@@ -475,24 -475,25 +475,25 @@@ struct sgx_epc_page *__sgx_alloc_epc_pa
  {
    struct sgx_epc_page *page;
    int nid_of_current = numa_node_id();
- 	int nid = nid_of_current;
+ 	int nid_start, nid;
- 	if (node_isset(nid_of_current, sgx_numa_mask)) {
- 		page = __sgx_alloc_epc_page_from_node(nid_of_current);
- 		if (page)
- 			return page;
- 	}
- 
- 	/* Fall back to the non-local NUMA nodes: */
- 	while (true) {
- 		nid = next_node_in(nid, sgx_numa_mask);
- 		if (nid == nid_of_current)
- 			break;
+ 	/*
+ 	 * Try local node first. If it doesn't have an EPC section,
+ 	 * fall back to the non-local NUMA nodes.
+ 	 */
+ 	if (node_isset(nid_of_current, sgx_numa_mask))
+ 		nid_start = nid_of_current;
+ 	else
+ 		nid_start = next_node_in(nid_of_current, sgx_numa_mask);
+ 	nid = nid_start;
+ 	do {
    	page = __sgx_alloc_epc_page_from_node(nid);
    	if (page)
    		return page;
- 	}
+ 
+ 		nid = next_node_in(nid, sgx_numa_mask);
+ 	} while (nid != nid_start);
return ERR_PTR(-ENOMEM);
  }
@@@ -732,7 -733,7 +733,7 @@@ out
    return 0;
  }
- /**
+ /*
   * A section metric is concatenated in a way that @low bits 12-31 define the
   * bits 12-31 of the metric and @high bits 0-19 define the bits 32-51 of the
   * metric.
@@@ -847,6 -848,13 +848,13 @@@ static bool __init sgx_page_cache_init(
    	return false;
    }
+ 	for_each_online_node(nid) {
+ 		if (!node_isset(nid, sgx_numa_mask) &&
+ 		    node_state(nid, N_MEMORY) && node_state(nid, N_CPU))
+ 			pr_info("node%d has both CPUs and memory but doesn't have an EPC section\n",
+ 				nid);
+ 	}
+ 
    return true;
  }
@@@ -895,10 -903,10 +903,10 @@@ int sgx_set_attribute(unsigned long *al
  {
    struct fd f = fdget(attribute_fd);
-	if (!f.file)
 +	if (!fd_file(f))
    	return -EINVAL;
-	if (f.file->f_op != &sgx_provision_fops) {
 +	if (fd_file(f)->f_op != &sgx_provision_fops) {
    	fdput(f);
    	return -EINVAL;
    }
diff --combined block/blk-mq.c
index 3f1f7d0b3ff35,aa28157b1aafc..831c5cf5d8740
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@@ -1128,7 -1128,7 +1128,7 @@@ static void blk_complete_reqs(struct ll
    	rq->q->mq_ops->complete(rq);
  }
- static __latent_entropy void blk_done_softirq(struct softirq_action *h)
+ static __latent_entropy void blk_done_softirq(void)
  {
    blk_complete_reqs(this_cpu_ptr(&blk_cpu_done));
  }
@@@ -2753,7 -2753,6 +2753,7 @@@ static void blk_mq_dispatch_plug_list(s
  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
  {
    struct request *rq;
 +	unsigned int depth;
/*
     * We may have been called recursively midway through handling
@@@ -2764,7 -2763,6 +2764,7 @@@
     */
    if (plug->rq_count == 0)
    	return;
 +	depth = plug->rq_count;
    plug->rq_count = 0;
if (!plug->multiple_queues && !plug->has_elevator && !from_schedule) {
@@@ -2772,7 -2770,6 +2772,7 @@@
rq = rq_list_peek(&plug->mq_list);
    	q = rq->q;
 +		trace_block_unplug(q, depth, true);
/*
    	 * Peek first request and see if we have a ->queue_rqs() hook.
@@@ -2942,7 -2939,7 +2942,7 @@@ void blk_mq_submit_bio(struct bio *bio
    struct blk_plug *plug = current->plug;
    const int is_sync = op_is_sync(bio->bi_opf);
    struct blk_mq_hw_ctx *hctx;
 -	unsigned int nr_segs = 1;
 +	unsigned int nr_segs;
    struct request *rq;
    blk_status_t ret;
@@@ -2984,10 -2981,11 +2984,10 @@@
    	goto queue_exit;
    }
-	if (unlikely(bio_may_exceed_limits(bio, &q->limits))) {
 -		bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
 -		if (!bio)
 -			goto queue_exit;
 -	}
 +	bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
 +	if (!bio)
 +		goto queue_exit;
 +
    if (!bio_integrity_prep(bio))
    	goto queue_exit;
diff --combined drivers/gpu/drm/i915/i915_utils.c
index b34a2d3d331d6,f2ba51c20e975..2576f8f6c0f69
--- a/drivers/gpu/drm/i915/i915_utils.c
+++ b/drivers/gpu/drm/i915/i915_utils.c
@@@ -11,10 -11,51 +11,10 @@@
  #include "i915_reg.h"
  #include "i915_utils.h"
-#define FDO_BUG_MSG "Please file a bug on drm/i915; see " FDO_BUG_URL " for details."
 -
 -void
 -__i915_printk(struct drm_i915_private *dev_priv, const char *level,
 -	      const char *fmt, ...)
 -{
 -	static bool shown_bug_once;
 -	struct device *kdev = dev_priv->drm.dev;
 -	bool is_error = level[1] <= KERN_ERR[1];
 -	bool is_debug = level[1] == KERN_DEBUG[1];
 -	struct va_format vaf;
 -	va_list args;
 -
 -	if (is_debug && !drm_debug_enabled(DRM_UT_DRIVER))
 -		return;
 -
 -	va_start(args, fmt);
 -
 -	vaf.fmt = fmt;
 -	vaf.va = &args;
 -
 -	if (is_error)
 -		dev_printk(level, kdev, "%pV", &vaf);
 -	else
 -		dev_printk(level, kdev, "[" DRM_NAME ":%ps] %pV",
 -			   __builtin_return_address(0), &vaf);
 -
 -	va_end(args);
 -
 -	if (is_error && !shown_bug_once) {
 -		/*
 -		 * Ask the user to file a bug report for the error, except
 -		 * if they may have caused the bug by fiddling with unsafe
 -		 * module parameters.
 -		 */
 -		if (!test_taint(TAINT_USER))
 -			dev_notice(kdev, "%s", FDO_BUG_MSG);
 -		shown_bug_once = true;
 -	}
 -}
 -
  void add_taint_for_CI(struct drm_i915_private *i915, unsigned int taint)
  {
 -	__i915_printk(i915, KERN_NOTICE, "CI tainted:%#x by %pS\n",
 -		      taint, (void *)_RET_IP_);
 +	drm_notice(&i915->drm, "CI tainted: %#x by %pS\n",
 +		   taint, __builtin_return_address(0));
/* Failures that occur during fault injection testing are expected */
    if (!i915_error_injected())
@@@ -33,9 -74,9 +33,9 @@@ int __i915_inject_probe_error(struct dr
    if (++i915_probe_fail_count < i915_modparams.inject_probe_failure)
    	return 0;
-	__i915_printk(i915, KERN_INFO,
 -		      "Injecting failure %d at checkpoint %u [%s:%d]\n",
 -		      err, i915_modparams.inject_probe_failure, func, line);
 +	drm_info(&i915->drm, "Injecting failure %d at checkpoint %u [%s:%d]\n",
 +		 err, i915_modparams.inject_probe_failure, func, line);
 +
    i915_modparams.inject_probe_failure = 0;
    return err;
  }
@@@ -69,7 -110,7 +69,7 @@@ void set_timer_ms(struct timer_list *t
     * Paranoia to make sure the compiler computes the timeout before
     * loading 'jiffies' as jiffies is volatile and may be updated in
     * the background by a timer tick. All to reduce the complexity
- 	 * of the addition and reduce the risk of losing a jiffie.
+ 	 * of the addition and reduce the risk of losing a jiffy.
     */
    barrier();
diff --combined drivers/gpu/drm/v3d/v3d_bo.c
index ecb80fd75b1a0,9eafe53a8f41a..ebe52bef4ffb8
--- a/drivers/gpu/drm/v3d/v3d_bo.c
+++ b/drivers/gpu/drm/v3d/v3d_bo.c
@@@ -26,17 -26,6 +26,17 @@@
  #include "v3d_drv.h"
  #include "uapi/drm/v3d_drm.h"
+static enum drm_gem_object_status v3d_gem_status(struct drm_gem_object *obj)
 +{
 +	struct v3d_bo *bo = to_v3d_bo(obj);
 +	enum drm_gem_object_status res = 0;
 +
 +	if (bo->base.pages)
 +		res |= DRM_GEM_OBJECT_RESIDENT;
 +
 +	return res;
 +}
 +
  /* Called DRM core on the last userspace/kernel unreference of the
   * BO.
   */
@@@ -74,7 -63,6 +74,7 @@@ static const struct drm_gem_object_func
    .vmap = drm_gem_shmem_object_vmap,
    .vunmap = drm_gem_shmem_object_vunmap,
    .mmap = drm_gem_shmem_object_mmap,
 +	.status = v3d_gem_status,
    .vm_ops = &drm_gem_shmem_vm_ops,
  };
@@@ -291,7 -279,7 +291,7 @@@ v3d_wait_bo_ioctl(struct drm_device *de
    else
    	args->timeout_ns = 0;
- 	/* Asked to wait beyond the jiffie/scheduler precision? */
+ 	/* Asked to wait beyond the jiffy/scheduler precision? */
    if (ret == -ETIME && args->timeout_ns)
    	ret = -EAGAIN;
diff --combined drivers/hwmon/k10temp.c
index 85a7632f3b50a,f96b91e433126..7dc19c5d62ac3
--- a/drivers/hwmon/k10temp.c
+++ b/drivers/hwmon/k10temp.c
@@@ -438,21 -438,16 +438,21 @@@ static int k10temp_probe(struct pci_de
    	data->disp_negative = true;
    }
-	if (boot_cpu_data.x86 == 0x15 &&
 +	data->is_zen = cpu_feature_enabled(X86_FEATURE_ZEN);
 +	if (data->is_zen) {
 +		data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK;
 +		data->read_tempreg = read_tempreg_nb_zen;
 +	} else if (boot_cpu_data.x86 == 0x15 &&
        ((boot_cpu_data.x86_model & 0xf0) == 0x60 ||
         (boot_cpu_data.x86_model & 0xf0) == 0x70)) {
    	data->read_htcreg = read_htcreg_nb_f15;
    	data->read_tempreg = read_tempreg_nb_f15;
 -	} else if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) {
 -		data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK;
 -		data->read_tempreg = read_tempreg_nb_zen;
 -		data->is_zen = true;
 +	} else {
 +		data->read_htcreg = read_htcreg_pci;
 +		data->read_tempreg = read_tempreg_pci;
 +	}
+	if (boot_cpu_data.x86 == 0x17 || boot_cpu_data.x86 == 0x18) {
    	switch (boot_cpu_data.x86_model) {
    	case 0x1:	/* Zen */
    	case 0x8:	/* Zen+ */
@@@ -474,6 -469,10 +474,6 @@@
    		break;
    	}
    } else if (boot_cpu_data.x86 == 0x19) {
 -		data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK;
 -		data->read_tempreg = read_tempreg_nb_zen;
 -		data->is_zen = true;
 -
    	switch (boot_cpu_data.x86_model) {
    	case 0x0 ... 0x1:	/* Zen3 SP3/TR */
    	case 0x8:		/* Zen3 TR Chagall */
@@@ -497,6 -496,13 +497,6 @@@
    		k10temp_get_ccd_support(data, 12);
    		break;
    	}
 -	} else if (boot_cpu_data.x86 == 0x1a) {
 -		data->temp_adjust_mask = ZEN_CUR_TEMP_RANGE_SEL_MASK;
 -		data->read_tempreg = read_tempreg_nb_zen;
 -		data->is_zen = true;
 -	} else {
 -		data->read_htcreg = read_htcreg_pci;
 -		data->read_tempreg = read_tempreg_pci;
    }
for (i = 0; i < ARRAY_SIZE(tctl_offset_table); i++) {
@@@ -542,6 -548,7 +542,7 @@@ static const struct pci_device_id k10te
    { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_19H_M78H_DF_F3) },
    { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3) },
    { PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3) },
+ 	{ PCI_VDEVICE(AMD, PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3) },
    { PCI_VDEVICE(HYGON, PCI_DEVICE_ID_AMD_17H_DF_F3) },
    {}
  };
diff --combined drivers/iommu/intel/iommu.h
index 428d253f13484,5c0e93042f7a2..1497f3112b12c
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@@ -584,23 -584,11 +584,23 @@@ struct iommu_domain_info 
    				 * to VT-d spec, section 9.3 */
  };
+/*
 + * We start simply by using a fixed size for the batched descriptors. This
 + * size is currently sufficient for our needs. Future improvements could
 + * involve dynamically allocating the batch buffer based on actual demand,
 + * allowing us to adjust the batch size for optimal performance in different
 + * scenarios.
 + */
 +#define QI_MAX_BATCHED_DESC_COUNT 16
 +struct qi_batch {
 +	struct qi_desc descs[QI_MAX_BATCHED_DESC_COUNT];
 +	unsigned int index;
 +};
 +
  struct dmar_domain {
    int	nid;			/* node id */
    struct xarray iommu_array;	/* Attached IOMMU array */
-	u8 has_iotlb_device: 1;
    u8 iommu_coherency: 1;		/* indicate coherency of iommu access */
    u8 force_snooping : 1;		/* Create IOPTEs with snoop control */
    u8 set_pte_snp:1;
@@@ -621,7 -609,6 +621,7 @@@
spinlock_t cache_lock;		/* Protect the cache tag list */
    struct list_head cache_tags;	/* Cache tag list */
 +	struct qi_batch *qi_batch;	/* Batched QI descriptors */
int		iommu_superpage;/* Level of superpages supported:
    				   0 == 4KiB (no superpages), 1 == 2MiB,
@@@ -700,8 -687,6 +700,6 @@@ struct iommu_pmu 
    DECLARE_BITMAP(used_mask, IOMMU_PMU_IDX_MAX);
    struct perf_event	*event_list[IOMMU_PMU_IDX_MAX];
    unsigned char		irq_name[16];
- 	struct hlist_node	cpuhp_node;
- 	int			cpu;
  };
#define IOMMU_IRQ_ID_OFFSET_PRQ		(DMAR_UNITS_SUPPORTED)
@@@ -1080,115 -1065,6 +1078,115 @@@ static inline unsigned long nrpages_to_
    return npages << VTD_PAGE_SHIFT;
  }
+static inline void qi_desc_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 +				 unsigned int size_order, u64 type,
 +				 struct qi_desc *desc)
 +{
 +	u8 dw = 0, dr = 0;
 +	int ih = 0;
 +
 +	if (cap_write_drain(iommu->cap))
 +		dw = 1;
 +
 +	if (cap_read_drain(iommu->cap))
 +		dr = 1;
 +
 +	desc->qw0 = QI_IOTLB_DID(did) | QI_IOTLB_DR(dr) | QI_IOTLB_DW(dw)
 +		| QI_IOTLB_GRAN(type) | QI_IOTLB_TYPE;
 +	desc->qw1 = QI_IOTLB_ADDR(addr) | QI_IOTLB_IH(ih)
 +		| QI_IOTLB_AM(size_order);
 +	desc->qw2 = 0;
 +	desc->qw3 = 0;
 +}
 +
 +static inline void qi_desc_dev_iotlb(u16 sid, u16 pfsid, u16 qdep, u64 addr,
 +				     unsigned int mask, struct qi_desc *desc)
 +{
 +	if (mask) {
 +		addr |= (1ULL << (VTD_PAGE_SHIFT + mask - 1)) - 1;
 +		desc->qw1 = QI_DEV_IOTLB_ADDR(addr) | QI_DEV_IOTLB_SIZE;
 +	} else {
 +		desc->qw1 = QI_DEV_IOTLB_ADDR(addr);
 +	}
 +
 +	if (qdep >= QI_DEV_IOTLB_MAX_INVS)
 +		qdep = 0;
 +
 +	desc->qw0 = QI_DEV_IOTLB_SID(sid) | QI_DEV_IOTLB_QDEP(qdep) |
 +		   QI_DIOTLB_TYPE | QI_DEV_IOTLB_PFSID(pfsid);
 +	desc->qw2 = 0;
 +	desc->qw3 = 0;
 +}
 +
 +static inline void qi_desc_piotlb(u16 did, u32 pasid, u64 addr,
 +				  unsigned long npages, bool ih,
 +				  struct qi_desc *desc)
 +{
 +	if (npages == -1) {
 +		desc->qw0 = QI_EIOTLB_PASID(pasid) |
 +				QI_EIOTLB_DID(did) |
 +				QI_EIOTLB_GRAN(QI_GRAN_NONG_PASID) |
 +				QI_EIOTLB_TYPE;
 +		desc->qw1 = 0;
 +	} else {
 +		int mask = ilog2(__roundup_pow_of_two(npages));
 +		unsigned long align = (1ULL << (VTD_PAGE_SHIFT + mask));
 +
 +		if (WARN_ON_ONCE(!IS_ALIGNED(addr, align)))
 +			addr = ALIGN_DOWN(addr, align);
 +
 +		desc->qw0 = QI_EIOTLB_PASID(pasid) |
 +				QI_EIOTLB_DID(did) |
 +				QI_EIOTLB_GRAN(QI_GRAN_PSI_PASID) |
 +				QI_EIOTLB_TYPE;
 +		desc->qw1 = QI_EIOTLB_ADDR(addr) |
 +				QI_EIOTLB_IH(ih) |
 +				QI_EIOTLB_AM(mask);
 +	}
 +}
 +
 +static inline void qi_desc_dev_iotlb_pasid(u16 sid, u16 pfsid, u32 pasid,
 +					   u16 qdep, u64 addr,
 +					   unsigned int size_order,
 +					   struct qi_desc *desc)
 +{
 +	unsigned long mask = 1UL << (VTD_PAGE_SHIFT + size_order - 1);
 +
 +	desc->qw0 = QI_DEV_EIOTLB_PASID(pasid) | QI_DEV_EIOTLB_SID(sid) |
 +		QI_DEV_EIOTLB_QDEP(qdep) | QI_DEIOTLB_TYPE |
 +		QI_DEV_IOTLB_PFSID(pfsid);
 +
 +	/*
 +	 * If S bit is 0, we only flush a single page. If S bit is set,
 +	 * The least significant zero bit indicates the invalidation address
 +	 * range. VT-d spec 6.5.2.6.
 +	 * e.g. address bit 12[0] indicates 8KB, 13[0] indicates 16KB.
 +	 * size order = 0 is PAGE_SIZE 4KB
 +	 * Max Invs Pending (MIP) is set to 0 for now until we have DIT in
 +	 * ECAP.
 +	 */
 +	if (!IS_ALIGNED(addr, VTD_PAGE_SIZE << size_order))
 +		pr_warn_ratelimited("Invalidate non-aligned address %llx, order %d\n",
 +				    addr, size_order);
 +
 +	/* Take page address */
 +	desc->qw1 = QI_DEV_EIOTLB_ADDR(addr);
 +
 +	if (size_order) {
 +		/*
 +		 * Existing 0s in address below size_order may be the least
 +		 * significant bit, we must set them to 1s to avoid having
 +		 * smaller size than desired.
 +		 */
 +		desc->qw1 |= GENMASK_ULL(size_order + VTD_PAGE_SHIFT - 1,
 +					VTD_PAGE_SHIFT);
 +		/* Clear size_order bit to indicate size */
 +		desc->qw1 &= ~mask;
 +		/* Set the S bit to indicate flushing more than 1 page */
 +		desc->qw1 |= QI_DEV_EIOTLB_SIZE;
 +	}
 +}
 +
  /* Convert value to context PASID directory size field coding. */
  #define context_pdts(pds)	(((pds) & 0x7) << 9)
@@@ -1220,15 -1096,13 +1218,15 @@@ void qi_flush_pasid_cache(struct intel_
int qi_submit_sync(struct intel_iommu *iommu, struct qi_desc *desc,
    	   unsigned int count, unsigned long options);
 +
 +void __iommu_flush_iotlb(struct intel_iommu *iommu, u16 did, u64 addr,
 +			 unsigned int size_order, u64 type);
  /*
   * Options used in qi_submit_sync:
   * QI_OPT_WAIT_DRAIN - Wait for PRQ drain completion, spec 6.5.2.8.
   */
  #define QI_OPT_WAIT_DRAIN		BIT(0)
-void domain_update_iotlb(struct dmar_domain *domain);
  int domain_attach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
  void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu);
  void device_block_translation(struct device *dev);
diff --combined drivers/irqchip/irq-loongson-eiointc.c
index 2057388470660,e24db71a8783c..9a46af3083d61
--- a/drivers/irqchip/irq-loongson-eiointc.c
+++ b/drivers/irqchip/irq-loongson-eiointc.c
@@@ -14,10 -14,11 +14,12 @@@
  #include <linux/irqdomain.h>
  #include <linux/irqchip/chained_irq.h>
  #include <linux/kernel.h>
 +#include <linux/kvm_para.h>
  #include <linux/syscore_ops.h>
  #include <asm/numa.h>
+ #include "irq-loongson.h"
+ 
  #define EIOINTC_REG_NODEMAP	0x14a0
  #define EIOINTC_REG_IPMAP	0x14c0
  #define EIOINTC_REG_ENABLE	0x1600
@@@ -25,37 -26,15 +27,37 @@@
  #define EIOINTC_REG_ISR		0x1800
  #define EIOINTC_REG_ROUTE	0x1c00
+#define EXTIOI_VIRT_FEATURES           0x40000000
 +#define  EXTIOI_HAS_VIRT_EXTENSION     BIT(0)
 +#define  EXTIOI_HAS_ENABLE_OPTION      BIT(1)
 +#define  EXTIOI_HAS_INT_ENCODE         BIT(2)
 +#define  EXTIOI_HAS_CPU_ENCODE         BIT(3)
 +#define EXTIOI_VIRT_CONFIG             0x40000004
 +#define  EXTIOI_ENABLE                 BIT(1)
 +#define  EXTIOI_ENABLE_INT_ENCODE      BIT(2)
 +#define  EXTIOI_ENABLE_CPU_ENCODE      BIT(3)
 +
  #define VEC_REG_COUNT		4
  #define VEC_COUNT_PER_REG	64
  #define VEC_COUNT		(VEC_REG_COUNT * VEC_COUNT_PER_REG)
  #define VEC_REG_IDX(irq_id)	((irq_id) / VEC_COUNT_PER_REG)
  #define VEC_REG_BIT(irq_id)     ((irq_id) % VEC_COUNT_PER_REG)
  #define EIOINTC_ALL_ENABLE	0xffffffff
 +#define EIOINTC_ALL_ENABLE_VEC_MASK(vector)	(EIOINTC_ALL_ENABLE & ~BIT(vector & 0x1f))
 +#define EIOINTC_REG_ENABLE_VEC(vector)		(EIOINTC_REG_ENABLE + ((vector >> 5) << 2))
 +#define EIOINTC_USE_CPU_ENCODE			BIT(0)
#define MAX_EIO_NODES		(NR_CPUS / CORES_PER_EIO_NODE)
+/*
 + * Routing registers are 32bit, and there is 8-bit route setting for every
 + * interrupt vector. So one Route register contains four vectors routing
 + * information.
 + */
 +#define EIOINTC_REG_ROUTE_VEC(vector)		(EIOINTC_REG_ROUTE + (vector & ~0x03))
 +#define EIOINTC_REG_ROUTE_VEC_SHIFT(vector)	((vector & 0x03) << 3)
 +#define EIOINTC_REG_ROUTE_VEC_MASK(vector)	(0xff << EIOINTC_REG_ROUTE_VEC_SHIFT(vector))
 +
  static int nr_pics;
struct eiointc_priv {
@@@ -65,7 -44,6 +67,7 @@@
    cpumask_t		cpuspan_map;
    struct fwnode_handle	*domain_handle;
    struct irq_domain	*eiointc_domain;
 +	int			flags;
  };
static struct eiointc_priv *eiointc_priv[MAX_IO_PICS];
@@@ -81,10 -59,7 +83,10 @@@ static void eiointc_enable(void
static int cpu_to_eio_node(int cpu)
  {
 -	return cpu_logical_map(cpu) / CORES_PER_EIO_NODE;
 +	if (!kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI))
 +		return cpu_logical_map(cpu) / CORES_PER_EIO_NODE;
 +	else
 +		return cpu_logical_map(cpu) / CORES_PER_VEIO_NODE;
  }
#ifdef CONFIG_SMP
@@@ -116,17 -91,6 +118,17 @@@ static void eiointc_set_irq_route(int p
static DEFINE_RAW_SPINLOCK(affinity_lock);
+static void veiointc_set_irq_route(unsigned int vector, unsigned int cpu)
 +{
 +	unsigned long reg = EIOINTC_REG_ROUTE_VEC(vector);
 +	unsigned int data;
 +
 +	data = iocsr_read32(reg);
 +	data &= ~EIOINTC_REG_ROUTE_VEC_MASK(vector);
 +	data |= cpu_logical_map(cpu) << EIOINTC_REG_ROUTE_VEC_SHIFT(vector);
 +	iocsr_write32(data, reg);
 +}
 +
  static int eiointc_set_irq_affinity(struct irq_data *d, const struct cpumask *affinity, bool force)
  {
    unsigned int cpu;
@@@ -143,24 -107,18 +145,24 @@@
    }
vector = d->hwirq;
 -	regaddr = EIOINTC_REG_ENABLE + ((vector >> 5) << 2);
 -
 -	/* Mask target vector */
 -	csr_any_send(regaddr, EIOINTC_ALL_ENABLE & (~BIT(vector & 0x1F)),
 -			0x0, priv->node * CORES_PER_EIO_NODE);
 -
 -	/* Set route for target vector */
 -	eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map);
 -
 -	/* Unmask target vector */
 -	csr_any_send(regaddr, EIOINTC_ALL_ENABLE,
 -			0x0, priv->node * CORES_PER_EIO_NODE);
 +	regaddr = EIOINTC_REG_ENABLE_VEC(vector);
 +
 +	if (priv->flags & EIOINTC_USE_CPU_ENCODE) {
 +		iocsr_write32(EIOINTC_ALL_ENABLE_VEC_MASK(vector), regaddr);
 +		veiointc_set_irq_route(vector, cpu);
 +		iocsr_write32(EIOINTC_ALL_ENABLE, regaddr);
 +	} else {
 +		/* Mask target vector */
 +		csr_any_send(regaddr, EIOINTC_ALL_ENABLE_VEC_MASK(vector),
 +			     0x0, priv->node * CORES_PER_EIO_NODE);
 +
 +		/* Set route for target vector */
 +		eiointc_set_irq_route(vector, cpu, priv->node, &priv->node_map);
 +
 +		/* Unmask target vector */
 +		csr_any_send(regaddr, EIOINTC_ALL_ENABLE,
 +			     0x0, priv->node * CORES_PER_EIO_NODE);
 +	}
irq_data_update_effective_affinity(d, cpumask_of(cpu));
@@@ -184,23 -142,17 +186,23 @@@ static int eiointc_index(int node
static int eiointc_router_init(unsigned int cpu)
  {
 -	int i, bit;
 -	uint32_t data;
 -	uint32_t node = cpu_to_eio_node(cpu);
 -	int index = eiointc_index(node);
 +	int i, bit, cores, index, node;
 +	unsigned int data;
 +
 +	node = cpu_to_eio_node(cpu);
 +	index = eiointc_index(node);
if (index < 0) {
    	pr_err("Error: invalid nodemap!\n");
 -		return -1;
 +		return -EINVAL;
    }
-	if ((cpu_logical_map(cpu) % CORES_PER_EIO_NODE) == 0) {
 +	if (!(eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE))
 +		cores = CORES_PER_EIO_NODE;
 +	else
 +		cores = CORES_PER_VEIO_NODE;
 +
 +	if ((cpu_logical_map(cpu) % cores) == 0) {
    	eiointc_enable();
for (i = 0; i < eiointc_priv[0]->vec_count / 32; i++) {
@@@ -216,9 -168,7 +218,9 @@@
for (i = 0; i < eiointc_priv[0]->vec_count / 4; i++) {
    		/* Route to Node-0 Core-0 */
 -			if (index == 0)
 +			if (eiointc_priv[index]->flags & EIOINTC_USE_CPU_ENCODE)
 +				bit = cpu_logical_map(0);
 +			else if (index == 0)
    			bit = BIT(cpu_logical_map(0));
    		else
    			bit = (eiointc_priv[index]->node << 4) | 1;
@@@ -412,6 -362,9 +414,9 @@@ static int __init acpi_cascade_irqdomai
    if (r < 0)
    	return r;
+ 	if (cpu_has_avecint)
+ 		return 0;
+ 
    r = acpi_table_parse_madt(ACPI_MADT_TYPE_MSI_PIC, pch_msi_parse_madt, 1);
    if (r < 0)
    	return r;
@@@ -422,7 -375,7 +427,7 @@@
  static int __init eiointc_init(struct eiointc_priv *priv, int parent_irq,
    		       u64 node_map)
  {
 -	int i;
 +	int i, val;
node_map = node_map ? node_map : -1ULL;
    for_each_possible_cpu(i) {
@@@ -442,28 -395,14 +447,28 @@@
    	return -ENOMEM;
    }
+	if (kvm_para_has_feature(KVM_FEATURE_VIRT_EXTIOI)) {
 +		val = iocsr_read32(EXTIOI_VIRT_FEATURES);
 +		/*
 +		 * With EXTIOI_ENABLE_CPU_ENCODE set
 +		 * interrupts can route to 256 vCPUs.
 +		 */
 +		if (val & EXTIOI_HAS_CPU_ENCODE) {
 +			val = iocsr_read32(EXTIOI_VIRT_CONFIG);
 +			val |= EXTIOI_ENABLE_CPU_ENCODE;
 +			iocsr_write32(val, EXTIOI_VIRT_CONFIG);
 +			priv->flags = EIOINTC_USE_CPU_ENCODE;
 +		}
 +	}
 +
    eiointc_priv[nr_pics++] = priv;
    eiointc_router_init(0);
    irq_set_chained_handler_and_data(parent_irq, eiointc_irq_dispatch, priv);
if (nr_pics == 1) {
    	register_syscore_ops(&eiointc_syscore_ops);
- 		cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_LOONGARCH_STARTING,
- 					  "irqchip/loongarch/intc:starting",
+ 		cpuhp_setup_state_nocalls(CPUHP_AP_IRQ_EIOINTC_STARTING,
+ 					  "irqchip/loongarch/eiointc:starting",
    				  eiointc_router_init, NULL);
    }
diff --combined fs/proc/base.c
index 1ad51858528f7,632cf1fc8f8c1..b31283d81c52e
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@@ -85,7 -85,6 +85,7 @@@
  #include <linux/elf.h>
  #include <linux/pid_namespace.h>
  #include <linux/user_namespace.h>
 +#include <linux/fs_parser.h>
  #include <linux/fs_struct.h>
  #include <linux/slab.h>
  #include <linux/sched/autogroup.h>
@@@ -118,40 -117,6 +118,40 @@@
  static u8 nlink_tid __ro_after_init;
  static u8 nlink_tgid __ro_after_init;
+enum proc_mem_force {
 +	PROC_MEM_FORCE_ALWAYS,
 +	PROC_MEM_FORCE_PTRACE,
 +	PROC_MEM_FORCE_NEVER
 +};
 +
 +static enum proc_mem_force proc_mem_force_override __ro_after_init =
 +	IS_ENABLED(CONFIG_PROC_MEM_NO_FORCE) ? PROC_MEM_FORCE_NEVER :
 +	IS_ENABLED(CONFIG_PROC_MEM_FORCE_PTRACE) ? PROC_MEM_FORCE_PTRACE :
 +	PROC_MEM_FORCE_ALWAYS;
 +
 +static const struct constant_table proc_mem_force_table[] __initconst = {
 +	{ "always", PROC_MEM_FORCE_ALWAYS },
 +	{ "ptrace", PROC_MEM_FORCE_PTRACE },
 +	{ "never", PROC_MEM_FORCE_NEVER },
 +	{ }
 +};
 +
 +static int __init early_proc_mem_force_override(char *buf)
 +{
 +	if (!buf)
 +		return -EINVAL;
 +
 +	/*
 +	 * lookup_constant() defaults to proc_mem_force_override to preseve
 +	 * the initial Kconfig choice in case an invalid param gets passed.
 +	 */
 +	proc_mem_force_override = lookup_constant(proc_mem_force_table,
 +						  buf, proc_mem_force_override);
 +
 +	return 0;
 +}
 +early_param("proc_mem.force_override", early_proc_mem_force_override);
 +
  struct pid_entry {
    const char *name;
    unsigned int len;
@@@ -862,31 -827,12 +862,31 @@@ static int __mem_open(struct inode *ino
static int mem_open(struct inode *inode, struct file *file)
  {
 -	int ret = __mem_open(inode, file, PTRACE_MODE_ATTACH);
 -
 -	/* OK to pass negative loff_t, we can catch out-of-range */
 -	file->f_mode |= FMODE_UNSIGNED_OFFSET;
 +	if (WARN_ON_ONCE(!(file->f_op->fop_flags & FOP_UNSIGNED_OFFSET)))
 +		return -EINVAL;
 +	return __mem_open(inode, file, PTRACE_MODE_ATTACH);
 +}
-	return ret;
 +static bool proc_mem_foll_force(struct file *file, struct mm_struct *mm)
 +{
 +	struct task_struct *task;
 +	bool ptrace_active = false;
 +
 +	switch (proc_mem_force_override) {
 +	case PROC_MEM_FORCE_NEVER:
 +		return false;
 +	case PROC_MEM_FORCE_PTRACE:
 +		task = get_proc_task(file_inode(file));
 +		if (task) {
 +			ptrace_active =	READ_ONCE(task->ptrace) &&
 +					READ_ONCE(task->mm) == mm &&
 +					READ_ONCE(task->parent) == current;
 +			put_task_struct(task);
 +		}
 +		return ptrace_active;
 +	default:
 +		return true;
 +	}
  }
static ssize_t mem_rw(struct file *file, char __user *buf,
@@@ -909,9 -855,7 +909,9 @@@
    if (!mmget_not_zero(mm))
    	goto free;
-	flags = FOLL_FORCE | (write ? FOLL_WRITE : 0);
 +	flags = write ? FOLL_WRITE : 0;
 +	if (proc_mem_foll_force(file, mm))
 +		flags |= FOLL_FORCE;
while (count > 0) {
    	size_t this_len = min_t(size_t, count, PAGE_SIZE);
@@@ -988,7 -932,6 +988,7 @@@ static const struct file_operations pro
    .write		= mem_write,
    .open		= mem_open,
    .release	= mem_release,
 +	.fop_flags	= FOP_UNSIGNED_OFFSET,
  };
static int environ_open(struct inode *inode, struct file *file)
@@@ -2333,8 -2276,8 +2333,8 @@@ proc_map_files_instantiate(struct dentr
    inode->i_op = &proc_map_files_link_inode_operations;
    inode->i_size = 64;
-	d_set_d_op(dentry, &tid_map_files_dentry_operations);
 -	return d_splice_alias(inode, dentry);
 +	return proc_splice_unmountable(inode, dentry,
 +				       &tid_map_files_dentry_operations);
  }
static struct dentry *proc_map_files_lookup(struct inode *dir,
@@@ -2513,13 -2456,13 +2513,13 @@@ static void *timers_start(struct seq_fi
    if (!tp->sighand)
    	return ERR_PTR(-ESRCH);
- 	return seq_list_start(&tp->task->signal->posix_timers, *pos);
+ 	return seq_hlist_start(&tp->task->signal->posix_timers, *pos);
  }
static void *timers_next(struct seq_file *m, void *v, loff_t *pos)
  {
    struct timers_private *tp = m->private;
- 	return seq_list_next(v, &tp->task->signal->posix_timers, pos);
+ 	return seq_hlist_next(v, &tp->task->signal->posix_timers, pos);
  }
static void timers_stop(struct seq_file *m, void *v)
@@@ -2548,7 -2491,7 +2548,7 @@@ static int show_timer(struct seq_file *
    	[SIGEV_THREAD] = "thread",
    };
- 	timer = list_entry((struct list_head *)v, struct k_itimer, list);
+ 	timer = hlist_entry((struct hlist_node *)v, struct k_itimer, list);
    notify = timer->it_sigev_notify;
seq_printf(m, "ID: %d\n", timer->it_id);
@@@ -2626,10 -2569,11 +2626,11 @@@ static ssize_t timerslack_ns_write(stru
    }
task_lock(p);
- 	if (slack_ns == 0)
- 		p->timer_slack_ns = p->default_timer_slack_ns;
- 	else
- 		p->timer_slack_ns = slack_ns;
+ 	if (rt_or_dl_task_policy(p))
+ 		slack_ns = 0;
+ 	else if (slack_ns == 0)
+ 		slack_ns = p->default_timer_slack_ns;
+ 	p->timer_slack_ns = slack_ns;
    task_unlock(p);
out:
@@@ -3927,12 -3871,12 +3928,12 @@@ static int proc_task_readdir(struct fil
    if (!dir_emit_dots(file, ctx))
    	return 0;
-	/* f_version caches the tgid value that the last readdir call couldn't
 -	 * return. lseek aka telldir automagically resets f_version to 0.
 +	/* We cache the tgid value that the last readdir call couldn't
 +	 * return and lseek resets it to 0.
     */
    ns = proc_pid_ns(inode->i_sb);
 -	tid = (int)file->f_version;
 -	file->f_version = 0;
 +	tid = (int)(intptr_t)file->private_data;
 +	file->private_data = NULL;
    for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
         task;
         task = next_tid(task), ctx->pos++) {
@@@ -3947,7 -3891,7 +3948,7 @@@
    			proc_task_instantiate, task, NULL)) {
    		/* returning this tgid failed, save it as the first
    		 * pid for the next readir call */
 -			file->f_version = (u64)tid;
 +			file->private_data = (void *)(intptr_t)tid;
    		put_task_struct(task);
    		break;
    	}
@@@ -3972,24 -3916,6 +3973,24 @@@ static int proc_task_getattr(struct mnt
    return 0;
  }
+/*
 + * proc_task_readdir() set @file->private_data to a positive integer
 + * value, so casting that to u64 is safe. generic_llseek_cookie() will
 + * set @cookie to 0, so casting to an int is safe. The WARN_ON_ONCE() is
 + * here to catch any unexpected change in behavior either in
 + * proc_task_readdir() or generic_llseek_cookie().
 + */
 +static loff_t proc_dir_llseek(struct file *file, loff_t offset, int whence)
 +{
 +	u64 cookie = (u64)(intptr_t)file->private_data;
 +	loff_t off;
 +
 +	off = generic_llseek_cookie(file, offset, whence, &cookie);
 +	WARN_ON_ONCE(cookie > INT_MAX);
 +	file->private_data = (void *)(intptr_t)cookie; /* serialized by f_pos_lock */
 +	return off;
 +}
 +
  static const struct inode_operations proc_task_inode_operations = {
    .lookup		= proc_task_lookup,
    .getattr	= proc_task_getattr,
@@@ -4000,7 -3926,7 +4001,7 @@@
  static const struct file_operations proc_task_operations = {
    .read		= generic_read_dir,
    .iterate_shared	= proc_task_readdir,
 -	.llseek		= generic_file_llseek,
 +	.llseek		= proc_dir_llseek,
  };
void __init set_proc_pid_nlink(void)
diff --combined fs/select.c
index f8880a6c069cf,ad171b7a5c11f..6a179c4c2632e
--- a/fs/select.c
+++ b/fs/select.c
@@@ -77,19 -77,16 +77,16 @@@ u64 select_estimate_accuracy(struct tim
  {
    u64 ret;
    struct timespec64 now;
+ 	u64 slack = current->timer_slack_ns;
- 	/*
- 	 * Realtime tasks get a slack of 0 for obvious reasons.
- 	 */
- 
- 	if (rt_task(current))
+ 	if (slack == 0)
    	return 0;
ktime_get_ts64(&now);
    now = timespec64_sub(*tv, now);
    ret = __estimate_accuracy(&now);
- 	if (ret < current->timer_slack_ns)
- 		return current->timer_slack_ns;
+ 	if (ret < slack)
+ 		return slack;
    return ret;
  }
@@@ -532,10 -529,10 +529,10 @@@ static noinline_for_stack int do_select
    				continue;
    			mask = EPOLLNVAL;
    			f = fdget(i);
 -				if (f.file) {
 +				if (fd_file(f)) {
    				wait_key_set(wait, in, out, bit,
    					     busy_flag);
 -					mask = vfs_poll(f.file, wait);
 +					mask = vfs_poll(fd_file(f), wait);
fdput(f);
    			}
@@@ -840,7 -837,7 +837,7 @@@ SYSCALL_DEFINE1(old_select, struct sel_
  struct poll_list {
    struct poll_list *next;
    unsigned int len;
 -	struct pollfd entries[];
 +	struct pollfd entries[] __counted_by(len);
  };
#define POLLFD_PER_PAGE  ((PAGE_SIZE-sizeof(struct poll_list)) / sizeof(struct pollfd))
@@@ -864,13 -861,13 +861,13 @@@ static inline __poll_t do_pollfd(struc
    	goto out;
    mask = EPOLLNVAL;
    f = fdget(fd);
 -	if (!f.file)
 +	if (!fd_file(f))
    	goto out;
/* userland u16 ->events contains POLL... bitmap */
    filter = demangle_poll(pollfd->events) | EPOLLERR | EPOLLHUP;
    pwait->_key = filter | busy_flag;
 -	mask = vfs_poll(f.file, pwait);
 +	mask = vfs_poll(fd_file(f), pwait);
    if (mask & busy_flag)
    	*can_busy_poll = true;
    mask &= filter;		/* Mask out unneeded events. */
diff --combined fs/signalfd.c
index 777e889ab0e80,d0333bce015e9..736bebf935918
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@@ -159,7 -159,7 +159,7 @@@ static ssize_t signalfd_dequeue(struct 
    DECLARE_WAITQUEUE(wait, current);
spin_lock_irq(&current->sighand->siglock);
- 	ret = dequeue_signal(current, &ctx->sigmask, info, &type);
+ 	ret = dequeue_signal(&ctx->sigmask, info, &type);
    switch (ret) {
    case 0:
    	if (!nonblock)
@@@ -174,7 -174,7 +174,7 @@@
    add_wait_queue(&current->sighand->signalfd_wqh, &wait);
    for (;;) {
    	set_current_state(TASK_INTERRUPTIBLE);
- 		ret = dequeue_signal(current, &ctx->sigmask, info, &type);
+ 		ret = dequeue_signal(&ctx->sigmask, info, &type);
    	if (ret != 0)
    		break;
    	if (signal_pending(current)) {
@@@ -289,10 -289,10 +289,10 @@@ static int do_signalfd4(int ufd, sigset
    	fd_install(ufd, file);
    } else {
    	struct fd f = fdget(ufd);
 -		if (!f.file)
 +		if (!fd_file(f))
    		return -EBADF;
 -		ctx = f.file->private_data;
 -		if (f.file->f_op != &signalfd_fops) {
 +		ctx = fd_file(f)->private_data;
 +		if (fd_file(f)->f_op != &signalfd_fops) {
    		fdput(f);
    		return -EINVAL;
    	}
diff --combined include/linux/cleanup.h
index a3d3e888cf1f3,9c6b4f2c01765..038b2d523bf88
--- a/include/linux/cleanup.h
+++ b/include/linux/cleanup.h
@@@ -4,6 -4,142 +4,142 @@@
#include <linux/compiler.h>
+ /**
+  * DOC: scope-based cleanup helpers
+  *
+  * The "goto error" pattern is notorious for introducing subtle resource
+  * leaks. It is tedious and error prone to add new resource acquisition
+  * constraints into code paths that already have several unwind
+  * conditions. The "cleanup" helpers enable the compiler to help with
+  * this tedium and can aid in maintaining LIFO (last in first out)
+  * unwind ordering to avoid unintentional leaks.
+  *
+  * As drivers make up the majority of the kernel code base, here is an
+  * example of using these helpers to clean up PCI drivers. The target of
+  * the cleanups are occasions where a goto is used to unwind a device
+  * reference (pci_dev_put()), or unlock the device (pci_dev_unlock())
+  * before returning.
+  *
+  * The DEFINE_FREE() macro can arrange for PCI device references to be
+  * dropped when the associated variable goes out of scope::
+  *
+  *	DEFINE_FREE(pci_dev_put, struct pci_dev *, if (_T) pci_dev_put(_T))
+  *	...
+  *	struct pci_dev *dev __free(pci_dev_put) =
+  *		pci_get_slot(parent, PCI_DEVFN(0, 0));
+  *
+  * The above will automatically call pci_dev_put() if @dev is non-NULL
+  * when @dev goes out of scope (automatic variable scope). If a function
+  * wants to invoke pci_dev_put() on error, but return @dev (i.e. without
+  * freeing it) on success, it can do::
+  *
+  *	return no_free_ptr(dev);
+  *
+  * ...or::
+  *
+  *	return_ptr(dev);
+  *
+  * The DEFINE_GUARD() macro can arrange for the PCI device lock to be
+  * dropped when the scope where guard() is invoked ends::
+  *
+  *	DEFINE_GUARD(pci_dev, struct pci_dev *, pci_dev_lock(_T), pci_dev_unlock(_T))
+  *	...
+  *	guard(pci_dev)(dev);
+  *
+  * The lifetime of the lock obtained by the guard() helper follows the
+  * scope of automatic variable declaration. Take the following example::
+  *
+  *	func(...)
+  *	{
+  *		if (...) {
+  *			...
+  *			guard(pci_dev)(dev); // pci_dev_lock() invoked here
+  *			...
+  *		} // <- implied pci_dev_unlock() triggered here
+  *	}
+  *
+  * Observe the lock is held for the remainder of the "if ()" block not
+  * the remainder of "func()".
+  *
+  * Now, when a function uses both __free() and guard(), or multiple
+  * instances of __free(), the LIFO order of variable definition order
+  * matters. GCC documentation says:
+  *
+  * "When multiple variables in the same scope have cleanup attributes,
+  * at exit from the scope their associated cleanup functions are run in
+  * reverse order of definition (last defined, first cleanup)."
+  *
+  * When the unwind order matters it requires that variables be defined
+  * mid-function scope rather than at the top of the file.  Take the
+  * following example and notice the bug highlighted by "!!"::
+  *
+  *	LIST_HEAD(list);
+  *	DEFINE_MUTEX(lock);
+  *
+  *	struct object {
+  *	        struct list_head node;
+  *	};
+  *
+  *	static struct object *alloc_add(void)
+  *	{
+  *	        struct object *obj;
+  *
+  *	        lockdep_assert_held(&lock);
+  *	        obj = kzalloc(sizeof(*obj), GFP_KERNEL);
+  *	        if (obj) {
+  *	                LIST_HEAD_INIT(&obj->node);
+  *	                list_add(obj->node, &list):
+  *	        }
+  *	        return obj;
+  *	}
+  *
+  *	static void remove_free(struct object *obj)
+  *	{
+  *	        lockdep_assert_held(&lock);
+  *	        list_del(&obj->node);
+  *	        kfree(obj);
+  *	}
+  *
+  *	DEFINE_FREE(remove_free, struct object *, if (_T) remove_free(_T))
+  *	static int init(void)
+  *	{
+  *	        struct object *obj __free(remove_free) = NULL;
+  *	        int err;
+  *
+  *	        guard(mutex)(&lock);
+  *	        obj = alloc_add();
+  *
+  *	        if (!obj)
+  *	                return -ENOMEM;
+  *
+  *	        err = other_init(obj);
+  *	        if (err)
+  *	                return err; // remove_free() called without the lock!!
+  *
+  *	        no_free_ptr(obj);
+  *	        return 0;
+  *	}
+  *
+  * That bug is fixed by changing init() to call guard() and define +
+  * initialize @obj in this order::
+  *
+  *	guard(mutex)(&lock);
+  *	struct object *obj __free(remove_free) = alloc_add();
+  *
+  * Given that the "__free(...) = NULL" pattern for variables defined at
+  * the top of the function poses this potential interdependency problem
+  * the recommendation is to always define and assign variables in one
+  * statement and not group variable definitions at the top of the
+  * function when __free() is used.
+  *
+  * Lastly, given that the benefit of cleanup helpers is removal of
+  * "goto", and that the "goto" statement can jump between scopes, the
+  * expectation is that usage of "goto" and cleanup helpers is never
+  * mixed in the same function. I.e. for a given routine, convert all
+  * resources that need a "goto" cleanup to scope-based cleanup, or
+  * convert none of them.
+  */
+ 
  /*
   * DEFINE_FREE(name, type, free):
   *	simple helper macro that defines the required wrapper for a __free()
@@@ -98,7 -234,7 +234,7 @@@ const volatile void * __must_check_fn(c
   * DEFINE_CLASS(fdget, struct fd, fdput(_T), fdget(fd), int fd)
   *
   *	CLASS(fdget, f)(fd);
 - *	if (!f.file)
 + *	if (!fd_file(f))
   *		return -EBADF;
   *
   *	// use 'f' without concern
diff --combined include/linux/pci_ids.h
index 2c94d4004dd50,91182aa1d2ec5..e4bddb9277956
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@@ -580,6 -580,7 +580,7 @@@
  #define PCI_DEVICE_ID_AMD_19H_M78H_DF_F3 0x12fb
  #define PCI_DEVICE_ID_AMD_1AH_M00H_DF_F3 0x12c3
  #define PCI_DEVICE_ID_AMD_1AH_M20H_DF_F3 0x16fb
+ #define PCI_DEVICE_ID_AMD_1AH_M60H_DF_F3 0x124b
  #define PCI_DEVICE_ID_AMD_1AH_M70H_DF_F3 0x12bb
  #define PCI_DEVICE_ID_AMD_MI200_DF_F3	0x14d3
  #define PCI_DEVICE_ID_AMD_MI300_DF_F3	0x152b
@@@ -2661,8 -2662,6 +2662,8 @@@
  #define PCI_DEVICE_ID_DCI_PCCOM8	0x0002
  #define PCI_DEVICE_ID_DCI_PCCOM2	0x0004
+#define PCI_VENDOR_ID_GLENFLY		0x6766
 +
  #define PCI_VENDOR_ID_INTEL		0x8086
  #define PCI_DEVICE_ID_INTEL_EESSC	0x0008
  #define PCI_DEVICE_ID_INTEL_HDA_CML_LP	0x02c8
diff --combined include/linux/perf_event.h
index e336306b8c08e,794f660578780..fb908843f2092
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@@ -168,6 -168,9 +168,9 @@@ struct hw_perf_event 
    		struct hw_perf_event_extra extra_reg;
    		struct hw_perf_event_extra branch_reg;
    	};
+ 		struct { /* aux / Intel-PT */
+ 			u64		aux_config;
+ 		};
    	struct { /* software */
    		struct hrtimer	hrtimer;
    	};
@@@ -292,6 -295,19 +295,19 @@@ struct perf_event_pmu_context
  #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
  #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
+ /**
+  * pmu::scope
+  */
+ enum perf_pmu_scope {
+ 	PERF_PMU_SCOPE_NONE	= 0,
+ 	PERF_PMU_SCOPE_CORE,
+ 	PERF_PMU_SCOPE_DIE,
+ 	PERF_PMU_SCOPE_CLUSTER,
+ 	PERF_PMU_SCOPE_PKG,
+ 	PERF_PMU_SCOPE_SYS_WIDE,
+ 	PERF_PMU_MAX_SCOPE,
+ };
+ 
  struct perf_output_handle;
#define PMU_NULL_DEV	((void *)(~0UL))
@@@ -315,6 -331,11 +331,11 @@@ struct pmu 
     */
    int				capabilities;
+ 	/*
+ 	 * PMU scope
+ 	 */
+ 	unsigned int			scope;
+ 
    int __percpu			*pmu_disable_count;
    struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
    atomic_t			exclusive_cnt; /* < 0: cpu; > 0: tsk */
@@@ -615,10 -636,13 +636,13 @@@ typedef void (*perf_overflow_handler_t)
   * PERF_EV_CAP_SIBLING: An event with this flag must be a group sibling and
   * cannot be a group leader. If an event with this flag is detached from the
   * group it is scheduled out and moved into an unrecoverable ERROR state.
+  * PERF_EV_CAP_READ_SCOPE: A CPU event that can be read from any CPU of the
+  * PMU scope where it is active.
   */
  #define PERF_EV_CAP_SOFTWARE		BIT(0)
  #define PERF_EV_CAP_READ_ACTIVE_PKG	BIT(1)
  #define PERF_EV_CAP_SIBLING		BIT(2)
+ #define PERF_EV_CAP_READ_SCOPE		BIT(3)
#define SWEVENT_HLIST_BITS		8
  #define SWEVENT_HLIST_SIZE		(1 << SWEVENT_HLIST_BITS)
@@@ -963,12 -987,16 +987,16 @@@ struct perf_event_context 
    struct rcu_head			rcu_head;
/*
- 	 * Sum (event->pending_work + event->pending_work)
+ 	 * The count of events for which using the switch-out fast path
+ 	 * should be avoided.
+ 	 *
+ 	 * Sum (event->pending_work + events with
+ 	 *    (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ)))
     *
     * The SIGTRAP is targeted at ctx->task, as such it won't do changing
     * that until the signal is delivered.
     */
- 	local_t				nr_pending;
+ 	local_t				nr_no_switch_fast;
  };
struct perf_cpu_pmu_context {
@@@ -1602,7 -1630,13 +1630,7 @@@ static inline int perf_is_paranoid(void
    return sysctl_perf_event_paranoid > -1;
  }
-static inline int perf_allow_kernel(struct perf_event_attr *attr)
 -{
 -	if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
 -		return -EACCES;
 -
 -	return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
 -}
 +int perf_allow_kernel(struct perf_event_attr *attr);
static inline int perf_allow_cpu(struct perf_event_attr *attr)
  {
diff --combined include/linux/uprobes.h
index 493dc95d912c9,2b294bf1881fe..e6f4e73125ffa
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@@ -16,6 -16,7 +16,7 @@@
  #include <linux/types.h>
  #include <linux/wait.h>
+ struct uprobe;
  struct vm_area_struct;
  struct mm_struct;
  struct inode;
@@@ -27,22 -28,22 +28,22 @@@ struct page
#define MAX_URETPROBE_DEPTH		64
- enum uprobe_filter_ctx {
- 	UPROBE_FILTER_REGISTER,
- 	UPROBE_FILTER_UNREGISTER,
- 	UPROBE_FILTER_MMAP,
- };
- 
  struct uprobe_consumer {
+ 	/*
+ 	 * handler() can return UPROBE_HANDLER_REMOVE to signal the need to
+ 	 * unregister uprobe for current process. If UPROBE_HANDLER_REMOVE is
+ 	 * returned, filter() callback has to be implemented as well and it
+ 	 * should return false to "confirm" the decision to uninstall uprobe
+ 	 * for the current process. If filter() is omitted or returns true,
+ 	 * UPROBE_HANDLER_REMOVE is effectively ignored.
+ 	 */
    int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
    int (*ret_handler)(struct uprobe_consumer *self,
    			unsigned long func,
    			struct pt_regs *regs);
- 	bool (*filter)(struct uprobe_consumer *self,
- 				enum uprobe_filter_ctx ctx,
- 				struct mm_struct *mm);
+ 	bool (*filter)(struct uprobe_consumer *self, struct mm_struct *mm);
- 	struct uprobe_consumer *next;
+ 	struct list_head cons_node;
  };
#ifdef CONFIG_UPROBES
@@@ -76,6 -77,8 +77,8 @@@ struct uprobe_task 
    struct uprobe			*active_uprobe;
    unsigned long			xol_vaddr;
+ 	struct arch_uprobe              *auprobe;
+ 
    struct return_instance		*return_instances;
    unsigned int			depth;
  };
@@@ -110,10 -113,10 +113,10 @@@ extern bool is_trap_insn(uprobe_opcode_
  extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
- extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
- extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
- extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
- extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+ extern struct uprobe *uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
+ extern int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool);
+ extern void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc);
+ extern void uprobe_unregister_sync(void);
  extern int uprobe_mmap(struct vm_area_struct *vma);
  extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end);
  extern void uprobe_start_dup_mmap(void);
@@@ -126,6 -129,7 +129,6 @@@ extern int uprobe_pre_sstep_notifier(st
  extern void uprobe_notify_resume(struct pt_regs *regs);
  extern bool uprobe_deny_signal(void);
  extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
 -extern void uprobe_clear_state(struct mm_struct *mm);
  extern int  arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
  extern int  arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
  extern int  arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
@@@ -150,22 -154,21 +153,21 @@@ static inline void uprobes_init(void
#define uprobe_get_trap_addr(regs)	instruction_pointer(regs)
- static inline int
- uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
- {
- 	return -ENOSYS;
- }
- static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+ static inline struct uprobe *
+ uprobe_register(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
  {
- 	return -ENOSYS;
+ 	return ERR_PTR(-ENOSYS);
  }
  static inline int
- uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
+ uprobe_apply(struct uprobe* uprobe, struct uprobe_consumer *uc, bool add)
  {
    return -ENOSYS;
  }
  static inline void
- uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+ uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
+ {
+ }
+ static inline void uprobe_unregister_sync(void)
  {
  }
  static inline int uprobe_mmap(struct vm_area_struct *vma)
diff --combined include/uapi/linux/elf.h
index 81762ff3c99e1,e30a9b47dc87f..b9935988da5cf
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@@ -411,6 -411,7 +411,7 @@@ typedef struct elf64_shdr 
  #define NT_X86_XSTATE	0x202		/* x86 extended state using xsave */
  /* Old binutils treats 0x203 as a CET state */
  #define NT_X86_SHSTK	0x204		/* x86 SHSTK state */
+ #define NT_X86_XSAVE_LAYOUT	0x205	/* XSAVE layout description */
  #define NT_S390_HIGH_GPRS	0x300	/* s390 upper register halves */
  #define NT_S390_TIMER	0x301		/* s390 timer register */
  #define NT_S390_TODCMP	0x302		/* s390 TOD clock comparator register */
@@@ -441,7 -442,6 +442,7 @@@
  #define NT_ARM_ZA	0x40c		/* ARM SME ZA registers */
  #define NT_ARM_ZT	0x40d		/* ARM SME ZT registers */
  #define NT_ARM_FPMR	0x40e		/* ARM floating point mode register */
 +#define NT_ARM_POE	0x40f		/* ARM POE registers */
  #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
  #define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
  #define NT_MIPS_DSP	0x800		/* MIPS DSP ASE registers */
diff --combined kernel/events/core.c
index d932557d5664b,2766090de84e4..bfd5553e53b28
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@@ -155,20 -155,55 +155,55 @@@ static int cpu_function_call(int cpu, r
    return data.ret;
  }
+ enum event_type_t {
+ 	EVENT_FLEXIBLE	= 0x01,
+ 	EVENT_PINNED	= 0x02,
+ 	EVENT_TIME	= 0x04,
+ 	EVENT_FROZEN	= 0x08,
+ 	/* see ctx_resched() for details */
+ 	EVENT_CPU	= 0x10,
+ 	EVENT_CGROUP	= 0x20,
+ 
+ 	/* compound helpers */
+ 	EVENT_ALL         = EVENT_FLEXIBLE | EVENT_PINNED,
+ 	EVENT_TIME_FROZEN = EVENT_TIME | EVENT_FROZEN,
+ };
+ 
+ static inline void __perf_ctx_lock(struct perf_event_context *ctx)
+ {
+ 	raw_spin_lock(&ctx->lock);
+ 	WARN_ON_ONCE(ctx->is_active & EVENT_FROZEN);
+ }
+ 
  static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
    		  struct perf_event_context *ctx)
  {
- 	raw_spin_lock(&cpuctx->ctx.lock);
+ 	__perf_ctx_lock(&cpuctx->ctx);
    if (ctx)
- 		raw_spin_lock(&ctx->lock);
+ 		__perf_ctx_lock(ctx);
+ }
+ 
+ static inline void __perf_ctx_unlock(struct perf_event_context *ctx)
+ {
+ 	/*
+ 	 * If ctx_sched_in() didn't again set any ALL flags, clean up
+ 	 * after ctx_sched_out() by clearing is_active.
+ 	 */
+ 	if (ctx->is_active & EVENT_FROZEN) {
+ 		if (!(ctx->is_active & EVENT_ALL))
+ 			ctx->is_active = 0;
+ 		else
+ 			ctx->is_active &= ~EVENT_FROZEN;
+ 	}
+ 	raw_spin_unlock(&ctx->lock);
  }
static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
    		    struct perf_event_context *ctx)
  {
    if (ctx)
- 		raw_spin_unlock(&ctx->lock);
- 	raw_spin_unlock(&cpuctx->ctx.lock);
+ 		__perf_ctx_unlock(ctx);
+ 	__perf_ctx_unlock(&cpuctx->ctx);
  }
#define TASK_TOMBSTONE ((void *)-1L)
@@@ -264,6 -299,7 +299,7 @@@ static void event_function_call(struct 
  {
    struct perf_event_context *ctx = event->ctx;
    struct task_struct *task = READ_ONCE(ctx->task); /* verified in event_function */
+ 	struct perf_cpu_context *cpuctx;
    struct event_function_struct efs = {
    	.event = event,
    	.func = func,
@@@ -291,22 -327,25 +327,25 @@@ again
    if (!task_function_call(task, event_function, &efs))
    	return;
- 	raw_spin_lock_irq(&ctx->lock);
+ 	local_irq_disable();
+ 	cpuctx = this_cpu_ptr(&perf_cpu_context);
+ 	perf_ctx_lock(cpuctx, ctx);
    /*
     * Reload the task pointer, it might have been changed by
     * a concurrent perf_event_context_sched_out().
     */
    task = ctx->task;
- 	if (task == TASK_TOMBSTONE) {
- 		raw_spin_unlock_irq(&ctx->lock);
- 		return;
- 	}
+ 	if (task == TASK_TOMBSTONE)
+ 		goto unlock;
    if (ctx->is_active) {
- 		raw_spin_unlock_irq(&ctx->lock);
+ 		perf_ctx_unlock(cpuctx, ctx);
+ 		local_irq_enable();
    	goto again;
    }
    func(event, NULL, ctx, data);
- 	raw_spin_unlock_irq(&ctx->lock);
+ unlock:
+ 	perf_ctx_unlock(cpuctx, ctx);
+ 	local_irq_enable();
  }
/*
@@@ -369,16 -408,6 +408,6 @@@ unlock
    (PERF_SAMPLE_BRANCH_KERNEL |\
     PERF_SAMPLE_BRANCH_HV)
- enum event_type_t {
- 	EVENT_FLEXIBLE = 0x1,
- 	EVENT_PINNED = 0x2,
- 	EVENT_TIME = 0x4,
- 	/* see ctx_resched() for details */
- 	EVENT_CPU = 0x8,
- 	EVENT_CGROUP = 0x10,
- 	EVENT_ALL = EVENT_FLEXIBLE | EVENT_PINNED,
- };
- 
  /*
   * perf_sched_events : >0 events exist
   */
@@@ -407,6 -436,11 +436,11 @@@ static LIST_HEAD(pmus)
  static DEFINE_MUTEX(pmus_lock);
  static struct srcu_struct pmus_srcu;
  static cpumask_var_t perf_online_mask;
+ static cpumask_var_t perf_online_core_mask;
+ static cpumask_var_t perf_online_die_mask;
+ static cpumask_var_t perf_online_cluster_mask;
+ static cpumask_var_t perf_online_pkg_mask;
+ static cpumask_var_t perf_online_sys_mask;
  static struct kmem_cache *perf_event_cache;
/*
@@@ -685,30 -719,32 +719,32 @@@ do {									
    ___p;								\
  })
+ #define for_each_epc(_epc, _ctx, _pmu, _cgroup)				\
+ 	list_for_each_entry(_epc, &((_ctx)->pmu_ctx_list), pmu_ctx_entry) \
+ 		if (_cgroup && !_epc->nr_cgroups)			\
+ 			continue;					\
+ 		else if (_pmu && _epc->pmu != _pmu)			\
+ 			continue;					\
+ 		else
+ 
  static void perf_ctx_disable(struct perf_event_context *ctx, bool cgroup)
  {
    struct perf_event_pmu_context *pmu_ctx;
- 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- 		if (cgroup && !pmu_ctx->nr_cgroups)
- 			continue;
+ 	for_each_epc(pmu_ctx, ctx, NULL, cgroup)
    	perf_pmu_disable(pmu_ctx->pmu);
- 	}
  }
static void perf_ctx_enable(struct perf_event_context *ctx, bool cgroup)
  {
    struct perf_event_pmu_context *pmu_ctx;
- 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- 		if (cgroup && !pmu_ctx->nr_cgroups)
- 			continue;
+ 	for_each_epc(pmu_ctx, ctx, NULL, cgroup)
    	perf_pmu_enable(pmu_ctx->pmu);
- 	}
  }
- static void ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type);
- static void ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type);
+ static void ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type);
+ static void ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type);
#ifdef CONFIG_CGROUP_PERF
@@@ -865,7 -901,7 +901,7 @@@ static void perf_cgroup_switch(struct t
    perf_ctx_lock(cpuctx, cpuctx->task_ctx);
    perf_ctx_disable(&cpuctx->ctx, true);
- 	ctx_sched_out(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
+ 	ctx_sched_out(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP);
    /*
     * must not be done before ctxswout due
     * to update_cgrp_time_from_cpuctx() in
@@@ -877,7 -913,7 +913,7 @@@
     * perf_cgroup_set_timestamp() in ctx_sched_in()
     * to not have to pass task around
     */
- 	ctx_sched_in(&cpuctx->ctx, EVENT_ALL|EVENT_CGROUP);
+ 	ctx_sched_in(&cpuctx->ctx, NULL, EVENT_ALL|EVENT_CGROUP);
perf_ctx_enable(&cpuctx->ctx, true);
    perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@@ -933,10 -969,10 +969,10 @@@ static inline int perf_cgroup_connect(i
    struct fd f = fdget(fd);
    int ret = 0;
-	if (!f.file)
 +	if (!fd_file(f))
    	return -EBADF;
-	css = css_tryget_online_from_dir(f.file->f_path.dentry,
 +	css = css_tryget_online_from_dir(fd_file(f)->f_path.dentry,
    				 &perf_event_cgrp_subsys);
    if (IS_ERR(css)) {
    	ret = PTR_ERR(css);
@@@ -1768,6 -1804,14 +1804,14 @@@ perf_event_groups_next(struct perf_even
    	event = rb_entry_safe(rb_next(&event->group_node),	\
    			typeof(*event), group_node))
+ /*
+  * Does the event attribute request inherit with PERF_SAMPLE_READ
+  */
+ static inline bool has_inherit_and_sample_read(struct perf_event_attr *attr)
+ {
+ 	return attr->inherit && (attr->sample_type & PERF_SAMPLE_READ);
+ }
+ 
  /*
   * Add an event from the lists for its context.
   * Must be called with ctx->mutex and ctx->lock held.
@@@ -1798,6 -1842,8 +1842,8 @@@ list_add_event(struct perf_event *event
    	ctx->nr_user++;
    if (event->attr.inherit_stat)
    	ctx->nr_stat++;
+ 	if (has_inherit_and_sample_read(&event->attr))
+ 		local_inc(&ctx->nr_no_switch_fast);
if (event->state > PERF_EVENT_STATE_OFF)
    	perf_cgroup_event_enable(event, ctx);
@@@ -2022,6 -2068,8 +2068,8 @@@ list_del_event(struct perf_event *event
    	ctx->nr_user--;
    if (event->attr.inherit_stat)
    	ctx->nr_stat--;
+ 	if (has_inherit_and_sample_read(&event->attr))
+ 		local_dec(&ctx->nr_no_switch_fast);
list_del_rcu(&event->event_entry);
@@@ -2317,6 -2365,45 +2365,45 @@@ group_sched_out(struct perf_event *grou
    	event_sched_out(event, ctx);
  }
+ static inline void
+ __ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx, bool final)
+ {
+ 	if (ctx->is_active & EVENT_TIME) {
+ 		if (ctx->is_active & EVENT_FROZEN)
+ 			return;
+ 		update_context_time(ctx);
+ 		update_cgrp_time_from_cpuctx(cpuctx, final);
+ 	}
+ }
+ 
+ static inline void
+ ctx_time_update(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
+ {
+ 	__ctx_time_update(cpuctx, ctx, false);
+ }
+ 
+ /*
+  * To be used inside perf_ctx_lock() / perf_ctx_unlock(). Lasts until perf_ctx_unlock().
+  */
+ static inline void
+ ctx_time_freeze(struct perf_cpu_context *cpuctx, struct perf_event_context *ctx)
+ {
+ 	ctx_time_update(cpuctx, ctx);
+ 	if (ctx->is_active & EVENT_TIME)
+ 		ctx->is_active |= EVENT_FROZEN;
+ }
+ 
+ static inline void
+ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
+ {
+ 	if (ctx->is_active & EVENT_TIME) {
+ 		if (ctx->is_active & EVENT_FROZEN)
+ 			return;
+ 		update_context_time(ctx);
+ 		update_cgrp_time_from_event(event);
+ 	}
+ }
+ 
  #define DETACH_GROUP	0x01UL
  #define DETACH_CHILD	0x02UL
  #define DETACH_DEAD	0x04UL
@@@ -2336,10 -2423,7 +2423,7 @@@ __perf_remove_from_context(struct perf_
    struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
    unsigned long flags = (unsigned long)info;
- 	if (ctx->is_active & EVENT_TIME) {
- 		update_context_time(ctx);
- 		update_cgrp_time_from_cpuctx(cpuctx, false);
- 	}
+ 	ctx_time_update(cpuctx, ctx);
/*
     * Ensure event_sched_out() switches to OFF, at the very least
@@@ -2424,12 -2508,8 +2508,8 @@@ static void __perf_event_disable(struc
    if (event->state < PERF_EVENT_STATE_INACTIVE)
    	return;
- 	if (ctx->is_active & EVENT_TIME) {
- 		update_context_time(ctx);
- 		update_cgrp_time_from_event(event);
- 	}
- 
    perf_pmu_disable(event->pmu_ctx->pmu);
+ 	ctx_time_update_event(ctx, event);
if (event == event->group_leader)
    	group_sched_out(event, ctx);
@@@ -2645,7 -2725,8 +2725,8 @@@ static void add_event_to_ctx(struct per
  }
static void task_ctx_sched_out(struct perf_event_context *ctx,
- 				enum event_type_t event_type)
+ 			       struct pmu *pmu,
+ 			       enum event_type_t event_type)
  {
    struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
@@@ -2655,18 -2736,19 +2736,19 @@@
    if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
    	return;
- 	ctx_sched_out(ctx, event_type);
+ 	ctx_sched_out(ctx, pmu, event_type);
  }
static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
- 				struct perf_event_context *ctx)
+ 				struct perf_event_context *ctx,
+ 				struct pmu *pmu)
  {
- 	ctx_sched_in(&cpuctx->ctx, EVENT_PINNED);
+ 	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_PINNED);
    if (ctx)
- 		 ctx_sched_in(ctx, EVENT_PINNED);
- 	ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE);
+ 		 ctx_sched_in(ctx, pmu, EVENT_PINNED);
+ 	ctx_sched_in(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
    if (ctx)
- 		 ctx_sched_in(ctx, EVENT_FLEXIBLE);
+ 		 ctx_sched_in(ctx, pmu, EVENT_FLEXIBLE);
  }
/*
@@@ -2684,16 -2766,12 +2766,12 @@@
   * event_type is a bit mask of the types of events involved. For CPU events,
   * event_type is only either EVENT_PINNED or EVENT_FLEXIBLE.
   */
- /*
-  * XXX: ctx_resched() reschedule entire perf_event_context while adding new
-  * event to the context or enabling existing event in the context. We can
-  * probably optimize it by rescheduling only affected pmu_ctx.
-  */
  static void ctx_resched(struct perf_cpu_context *cpuctx,
    		struct perf_event_context *task_ctx,
- 			enum event_type_t event_type)
+ 			struct pmu *pmu, enum event_type_t event_type)
  {
    bool cpu_event = !!(event_type & EVENT_CPU);
+ 	struct perf_event_pmu_context *epc;
/*
     * If pinned groups are involved, flexible groups also need to be
@@@ -2704,10 -2782,14 +2782,14 @@@
event_type &= EVENT_ALL;
- 	perf_ctx_disable(&cpuctx->ctx, false);
+ 	for_each_epc(epc, &cpuctx->ctx, pmu, false)
+ 		perf_pmu_disable(epc->pmu);
+ 
    if (task_ctx) {
- 		perf_ctx_disable(task_ctx, false);
- 		task_ctx_sched_out(task_ctx, event_type);
+ 		for_each_epc(epc, task_ctx, pmu, false)
+ 			perf_pmu_disable(epc->pmu);
+ 
+ 		task_ctx_sched_out(task_ctx, pmu, event_type);
    }
/*
@@@ -2718,15 -2800,19 +2800,19 @@@
     *  - otherwise, do nothing more.
     */
    if (cpu_event)
- 		ctx_sched_out(&cpuctx->ctx, event_type);
+ 		ctx_sched_out(&cpuctx->ctx, pmu, event_type);
    else if (event_type & EVENT_PINNED)
- 		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+ 		ctx_sched_out(&cpuctx->ctx, pmu, EVENT_FLEXIBLE);
+ 
+ 	perf_event_sched_in(cpuctx, task_ctx, pmu);
- 	perf_event_sched_in(cpuctx, task_ctx);
+ 	for_each_epc(epc, &cpuctx->ctx, pmu, false)
+ 		perf_pmu_enable(epc->pmu);
- 	perf_ctx_enable(&cpuctx->ctx, false);
- 	if (task_ctx)
- 		perf_ctx_enable(task_ctx, false);
+ 	if (task_ctx) {
+ 		for_each_epc(epc, task_ctx, pmu, false)
+ 			perf_pmu_enable(epc->pmu);
+ 	}
  }
void perf_pmu_resched(struct pmu *pmu)
@@@ -2735,7 -2821,7 +2821,7 @@@
    struct perf_event_context *task_ctx = cpuctx->task_ctx;
perf_ctx_lock(cpuctx, task_ctx);
- 	ctx_resched(cpuctx, task_ctx, EVENT_ALL|EVENT_CPU);
+ 	ctx_resched(cpuctx, task_ctx, pmu, EVENT_ALL|EVENT_CPU);
    perf_ctx_unlock(cpuctx, task_ctx);
  }
@@@ -2791,9 -2877,10 +2877,10 @@@ static int  __perf_install_in_context(v
  #endif
if (reprogram) {
- 		ctx_sched_out(ctx, EVENT_TIME);
+ 		ctx_time_freeze(cpuctx, ctx);
    	add_event_to_ctx(event, ctx);
- 		ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ 		ctx_resched(cpuctx, task_ctx, event->pmu_ctx->pmu,
+ 			    get_event_type(event));
    } else {
    	add_event_to_ctx(event, ctx);
    }
@@@ -2936,8 -3023,7 +3023,7 @@@ static void __perf_event_enable(struct 
        event->state <= PERF_EVENT_STATE_ERROR)
    	return;
- 	if (ctx->is_active)
- 		ctx_sched_out(ctx, EVENT_TIME);
+ 	ctx_time_freeze(cpuctx, ctx);
perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
    perf_cgroup_event_enable(event, ctx);
@@@ -2945,25 -3031,21 +3031,21 @@@
    if (!ctx->is_active)
    	return;
- 	if (!event_filter_match(event)) {
- 		ctx_sched_in(ctx, EVENT_TIME);
+ 	if (!event_filter_match(event))
    	return;
- 	}
/*
     * If the event is in a group and isn't the group leader,
     * then don't put it on unless the group is on.
     */
- 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
- 		ctx_sched_in(ctx, EVENT_TIME);
+ 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE)
    	return;
- 	}
task_ctx = cpuctx->task_ctx;
    if (ctx->task)
    	WARN_ON_ONCE(task_ctx != ctx);
- 	ctx_resched(cpuctx, task_ctx, get_event_type(event));
+ 	ctx_resched(cpuctx, task_ctx, event->pmu_ctx->pmu, get_event_type(event));
  }
/*
@@@ -3231,7 -3313,7 +3313,7 @@@ static void __pmu_ctx_sched_out(struct 
    struct perf_event *event, *tmp;
    struct pmu *pmu = pmu_ctx->pmu;
- 	if (ctx->task && !ctx->is_active) {
+ 	if (ctx->task && !(ctx->is_active & EVENT_ALL)) {
    	struct perf_cpu_pmu_context *cpc;
cpc = this_cpu_ptr(pmu->cpu_pmu_context);
@@@ -3239,7 -3321,7 +3321,7 @@@
    	cpc->task_epc = NULL;
    }
- 	if (!event_type)
+ 	if (!(event_type & EVENT_ALL))
    	return;
perf_pmu_disable(pmu);
@@@ -3265,8 -3347,17 +3347,17 @@@
    perf_pmu_enable(pmu);
  }
+ /*
+  * Be very careful with the @pmu argument since this will change ctx state.
+  * The @pmu argument works for ctx_resched(), because that is symmetric in
+  * ctx_sched_out() / ctx_sched_in() usage and the ctx state ends up invariant.
+  *
+  * However, if you were to be asymmetrical, you could end up with messed up
+  * state, eg. ctx->is_active cleared even though most EPCs would still actually
+  * be active.
+  */
  static void
- ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
+ ctx_sched_out(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type)
  {
    struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
    struct perf_event_pmu_context *pmu_ctx;
@@@ -3297,34 -3388,36 +3388,36 @@@
     *
     * would only update time for the pinned events.
     */
- 	if (is_active & EVENT_TIME) {
- 		/* update (and stop) ctx time */
- 		update_context_time(ctx);
- 		update_cgrp_time_from_cpuctx(cpuctx, ctx == &cpuctx->ctx);
+ 	__ctx_time_update(cpuctx, ctx, ctx == &cpuctx->ctx);
+ 
+ 	/*
+ 	 * CPU-release for the below ->is_active store,
+ 	 * see __load_acquire() in perf_event_time_now()
+ 	 */
+ 	barrier();
+ 	ctx->is_active &= ~event_type;
+ 
+ 	if (!(ctx->is_active & EVENT_ALL)) {
    	/*
- 		 * CPU-release for the below ->is_active store,
- 		 * see __load_acquire() in perf_event_time_now()
+ 		 * For FROZEN, preserve TIME|FROZEN such that perf_event_time_now()
+ 		 * does not observe a hole. perf_ctx_unlock() will clean up.
    	 */
- 		barrier();
+ 		if (ctx->is_active & EVENT_FROZEN)
+ 			ctx->is_active &= EVENT_TIME_FROZEN;
+ 		else
+ 			ctx->is_active = 0;
    }
- 	ctx->is_active &= ~event_type;
- 	if (!(ctx->is_active & EVENT_ALL))
- 		ctx->is_active = 0;
- 
    if (ctx->task) {
    	WARN_ON_ONCE(cpuctx->task_ctx != ctx);
- 		if (!ctx->is_active)
+ 		if (!(ctx->is_active & EVENT_ALL))
    		cpuctx->task_ctx = NULL;
    }
is_active ^= ctx->is_active; /* changed bits */
- 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- 		if (cgroup && !pmu_ctx->nr_cgroups)
- 			continue;
+ 	for_each_epc(pmu_ctx, ctx, pmu, cgroup)
    	__pmu_ctx_sched_out(pmu_ctx, is_active);
- 	}
  }
/*
@@@ -3517,12 -3610,17 +3610,17 @@@ perf_event_context_sched_out(struct tas
perf_ctx_disable(ctx, false);
- 			/* PMIs are disabled; ctx->nr_pending is stable. */
- 			if (local_read(&ctx->nr_pending) ||
- 			    local_read(&next_ctx->nr_pending)) {
+ 			/* PMIs are disabled; ctx->nr_no_switch_fast is stable. */
+ 			if (local_read(&ctx->nr_no_switch_fast) ||
+ 			    local_read(&next_ctx->nr_no_switch_fast)) {
    			/*
    			 * Must not swap out ctx when there's pending
    			 * events that rely on the ctx->task relation.
+ 				 *
+ 				 * Likewise, when a context contains inherit +
+ 				 * SAMPLE_READ events they should be switched
+ 				 * out using the slow path so that they are
+ 				 * treated as if they were distinct contexts.
    			 */
    			raw_spin_unlock(&next_ctx->lock);
    			rcu_read_unlock();
@@@ -3563,7 -3661,7 +3661,7 @@@ unlock
inside_switch:
    	perf_ctx_sched_task_cb(ctx, false);
- 		task_ctx_sched_out(ctx, EVENT_ALL);
+ 		task_ctx_sched_out(ctx, NULL, EVENT_ALL);
perf_ctx_enable(ctx, false);
    	raw_spin_unlock(&ctx->lock);
@@@ -3861,29 -3959,22 +3959,22 @@@ static void pmu_groups_sched_in(struct 
    		   merge_sched_in, &can_add_hw);
  }
- static void ctx_groups_sched_in(struct perf_event_context *ctx,
- 				struct perf_event_groups *groups,
- 				bool cgroup)
+ static void __pmu_ctx_sched_in(struct perf_event_pmu_context *pmu_ctx,
+ 			       enum event_type_t event_type)
  {
- 	struct perf_event_pmu_context *pmu_ctx;
- 
- 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
- 		if (cgroup && !pmu_ctx->nr_cgroups)
- 			continue;
- 		pmu_groups_sched_in(ctx, groups, pmu_ctx->pmu);
- 	}
- }
+ 	struct perf_event_context *ctx = pmu_ctx->ctx;
- static void __pmu_ctx_sched_in(struct perf_event_context *ctx,
- 			       struct pmu *pmu)
- {
- 	pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu);
+ 	if (event_type & EVENT_PINNED)
+ 		pmu_groups_sched_in(ctx, &ctx->pinned_groups, pmu_ctx->pmu);
+ 	if (event_type & EVENT_FLEXIBLE)
+ 		pmu_groups_sched_in(ctx, &ctx->flexible_groups, pmu_ctx->pmu);
  }
static void
- ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
+ ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu, enum event_type_t event_type)
  {
    struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
+ 	struct perf_event_pmu_context *pmu_ctx;
    int is_active = ctx->is_active;
    bool cgroup = event_type & EVENT_CGROUP;
@@@ -3907,7 -3998,7 +3998,7 @@@
ctx->is_active |= (event_type | EVENT_TIME);
    if (ctx->task) {
- 		if (!is_active)
+ 		if (!(is_active & EVENT_ALL))
    		cpuctx->task_ctx = ctx;
    	else
    		WARN_ON_ONCE(cpuctx->task_ctx != ctx);
@@@ -3919,12 -4010,16 +4010,16 @@@
     * First go through the list and put on any pinned groups
     * in order to give them the best chance of going on.
     */
- 	if (is_active & EVENT_PINNED)
- 		ctx_groups_sched_in(ctx, &ctx->pinned_groups, cgroup);
+ 	if (is_active & EVENT_PINNED) {
+ 		for_each_epc(pmu_ctx, ctx, pmu, cgroup)
+ 			__pmu_ctx_sched_in(pmu_ctx, EVENT_PINNED);
+ 	}
/* Then walk through the lower prio flexible groups */
- 	if (is_active & EVENT_FLEXIBLE)
- 		ctx_groups_sched_in(ctx, &ctx->flexible_groups, cgroup);
+ 	if (is_active & EVENT_FLEXIBLE) {
+ 		for_each_epc(pmu_ctx, ctx, pmu, cgroup)
+ 			__pmu_ctx_sched_in(pmu_ctx, EVENT_FLEXIBLE);
+ 	}
  }
static void perf_event_context_sched_in(struct task_struct *task)
@@@ -3967,10 -4062,10 +4062,10 @@@
     */
    if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
    	perf_ctx_disable(&cpuctx->ctx, false);
- 		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+ 		ctx_sched_out(&cpuctx->ctx, NULL, EVENT_FLEXIBLE);
    }
- 	perf_event_sched_in(cpuctx, ctx);
+ 	perf_event_sched_in(cpuctx, ctx, NULL);
perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
@@@ -4093,7 -4188,11 +4188,11 @@@ static void perf_adjust_period(struct p
    period = perf_calculate_period(event, nsec, count);
delta = (s64)(period - hwc->sample_period);
- 	delta = (delta + 7) / 8; /* low pass filter */
+ 	if (delta >= 0)
+ 		delta += 7;
+ 	else
+ 		delta -= 7;
+ 	delta /= 8; /* low pass filter */
sample_period = hwc->sample_period + delta;
@@@ -4311,14 -4410,14 +4410,14 @@@ static bool perf_rotate_context(struct 
    	update_context_time(&cpuctx->ctx);
    	__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
    	rotate_ctx(&cpuctx->ctx, cpu_event);
- 		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
+ 		__pmu_ctx_sched_in(cpu_epc, EVENT_FLEXIBLE);
    }
if (task_event)
    	rotate_ctx(task_epc->ctx, task_event);
if (task_event || (task_epc && cpu_event))
- 		__pmu_ctx_sched_in(task_epc->ctx, pmu);
+ 		__pmu_ctx_sched_in(task_epc, EVENT_FLEXIBLE);
perf_pmu_enable(pmu);
    perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@@ -4384,7 -4483,7 +4483,7 @@@ static void perf_event_enable_on_exec(s
cpuctx = this_cpu_ptr(&perf_cpu_context);
    perf_ctx_lock(cpuctx, ctx);
- 	ctx_sched_out(ctx, EVENT_TIME);
+ 	ctx_time_freeze(cpuctx, ctx);
list_for_each_entry(event, &ctx->event_list, event_entry) {
    	enabled |= event_enable_on_exec(event, ctx);
@@@ -4396,9 -4495,7 +4495,7 @@@
     */
    if (enabled) {
    	clone_ctx = unclone_ctx(ctx);
- 		ctx_resched(cpuctx, ctx, event_type);
- 	} else {
- 		ctx_sched_in(ctx, EVENT_TIME);
+ 		ctx_resched(cpuctx, ctx, NULL, event_type);
    }
    perf_ctx_unlock(cpuctx, ctx);
@@@ -4459,16 -4556,24 +4556,24 @@@ struct perf_read_data 
    int ret;
  };
+ static inline const struct cpumask *perf_scope_cpu_topology_cpumask(unsigned int scope, int cpu);
+ 
  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
  {
+ 	int local_cpu = smp_processor_id();
    u16 local_pkg, event_pkg;
if ((unsigned)event_cpu >= nr_cpu_ids)
    	return event_cpu;
- 	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
- 		int local_cpu = smp_processor_id();
+ 	if (event->group_caps & PERF_EV_CAP_READ_SCOPE) {
+ 		const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(event->pmu->scope, event_cpu);
+ 		if (cpumask && cpumask_test_cpu(local_cpu, cpumask))
+ 			return local_cpu;
+ 	}
+ 
+ 	if (event->group_caps & PERF_EV_CAP_READ_ACTIVE_PKG) {
    	event_pkg = topology_physical_package_id(event_cpu);
    	local_pkg = topology_physical_package_id(local_cpu);
@@@ -4501,10 -4606,7 +4606,7 @@@ static void __perf_event_read(void *inf
    	return;
raw_spin_lock(&ctx->lock);
- 	if (ctx->is_active & EVENT_TIME) {
- 		update_context_time(ctx);
- 		update_cgrp_time_from_event(event);
- 	}
+ 	ctx_time_update_event(ctx, event);
perf_event_update_time(event);
    if (data->group)
@@@ -4539,8 -4641,11 +4641,11 @@@ unlock
    raw_spin_unlock(&ctx->lock);
  }
- static inline u64 perf_event_count(struct perf_event *event)
+ static inline u64 perf_event_count(struct perf_event *event, bool self)
  {
+ 	if (self)
+ 		return local64_read(&event->count);
+ 
    return local64_read(&event->count) + atomic64_read(&event->child_count);
  }
@@@ -4701,10 -4806,7 +4806,7 @@@ again
    	 * May read while context is not active (e.g., thread is
    	 * blocked), in that case we cannot update context time
    	 */
- 		if (ctx->is_active & EVENT_TIME) {
- 			update_context_time(ctx);
- 			update_cgrp_time_from_event(event);
- 		}
+ 		ctx_time_update_event(ctx, event);
perf_event_update_time(event);
    	if (group)
@@@ -5205,7 -5307,7 +5307,7 @@@ static void perf_pending_task_sync(stru
     */
    if (task_work_cancel(current, head)) {
    	event->pending_work = 0;
- 		local_dec(&event->ctx->nr_pending);
+ 		local_dec(&event->ctx->nr_no_switch_fast);
    	return;
    }
@@@ -5499,7 -5601,7 +5601,7 @@@ static u64 __perf_event_read_value(stru
    mutex_lock(&event->child_mutex);
(void)perf_event_read(event, false);
- 	total += perf_event_count(event);
+ 	total += perf_event_count(event, false);
*enabled += event->total_time_enabled +
    		atomic64_read(&event->child_total_time_enabled);
@@@ -5508,7 -5610,7 +5610,7 @@@
list_for_each_entry(child, &event->child_list, child_list) {
    	(void)perf_event_read(child, false);
- 		total += perf_event_count(child);
+ 		total += perf_event_count(child, false);
    	*enabled += child->total_time_enabled;
    	*running += child->total_time_running;
    }
@@@ -5590,14 -5692,14 +5692,14 @@@ static int __perf_read_group_add(struc
    /*
     * Write {count,id} tuples for every sibling.
     */
- 	values[n++] += perf_event_count(leader);
+ 	values[n++] += perf_event_count(leader, false);
    if (read_format & PERF_FORMAT_ID)
    	values[n++] = primary_event_id(leader);
    if (read_format & PERF_FORMAT_LOST)
    	values[n++] = atomic64_read(&leader->lost_samples);
for_each_sibling_event(sub, leader) {
- 		values[n++] += perf_event_count(sub);
+ 		values[n++] += perf_event_count(sub, false);
    	if (read_format & PERF_FORMAT_ID)
    		values[n++] = primary_event_id(sub);
    	if (read_format & PERF_FORMAT_LOST)
@@@ -5899,10 -6001,10 +6001,10 @@@ static const struct file_operations per
  static inline int perf_fget_light(int fd, struct fd *p)
  {
    struct fd f = fdget(fd);
 -	if (!f.file)
 +	if (!fd_file(f))
    	return -EBADF;
-	if (f.file->f_op != &perf_fops) {
 +	if (fd_file(f)->f_op != &perf_fops) {
    	fdput(f);
    	return -EBADF;
    }
@@@ -5962,7 -6064,7 +6064,7 @@@ static long _perf_ioctl(struct perf_eve
    		ret = perf_fget_light(arg, &output);
    		if (ret)
    			return ret;
 -			output_event = output.file->private_data;
 +			output_event = fd_file(output)->private_data;
    		ret = perf_event_set_output(event, output_event);
    		fdput(output);
    	} else {
@@@ -6177,7 -6279,7 +6279,7 @@@ void perf_event_update_userpage(struct 
    ++userpg->lock;
    barrier();
    userpg->index = perf_event_index(event);
- 	userpg->offset = perf_event_count(event);
+ 	userpg->offset = perf_event_count(event, false);
    if (userpg->index)
    	userpg->offset -= local64_read(&event->hw.prev_count);
@@@ -6874,7 -6976,7 +6976,7 @@@ static void perf_pending_task(struct ca
    if (event->pending_work) {
    	event->pending_work = 0;
    	perf_sigtrap(event);
- 		local_dec(&event->ctx->nr_pending);
+ 		local_dec(&event->ctx->nr_no_switch_fast);
    	rcuwait_wake_up(&event->pending_work_wait);
    }
    rcu_read_unlock();
@@@ -7256,7 -7358,7 +7358,7 @@@ static void perf_output_read_one(struc
    u64 values[5];
    int n = 0;
- 	values[n++] = perf_event_count(event);
+ 	values[n++] = perf_event_count(event, has_inherit_and_sample_read(&event->attr));
    if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
    	values[n++] = enabled +
    		atomic64_read(&event->child_total_time_enabled);
@@@ -7274,14 -7376,15 +7376,15 @@@
  }
static void perf_output_read_group(struct perf_output_handle *handle,
- 			    struct perf_event *event,
- 			    u64 enabled, u64 running)
+ 				   struct perf_event *event,
+ 				   u64 enabled, u64 running)
  {
    struct perf_event *leader = event->group_leader, *sub;
    u64 read_format = event->attr.read_format;
    unsigned long flags;
    u64 values[6];
    int n = 0;
+ 	bool self = has_inherit_and_sample_read(&event->attr);
/*
     * Disabling interrupts avoids all counter scheduling
@@@ -7301,7 -7404,7 +7404,7 @@@
        (leader->state == PERF_EVENT_STATE_ACTIVE))
    	leader->pmu->read(leader);
- 	values[n++] = perf_event_count(leader);
+ 	values[n++] = perf_event_count(leader, self);
    if (read_format & PERF_FORMAT_ID)
    	values[n++] = primary_event_id(leader);
    if (read_format & PERF_FORMAT_LOST)
@@@ -7316,7 -7419,7 +7419,7 @@@
    	    (sub->state == PERF_EVENT_STATE_ACTIVE))
    		sub->pmu->read(sub);
- 		values[n++] = perf_event_count(sub);
+ 		values[n++] = perf_event_count(sub, self);
    	if (read_format & PERF_FORMAT_ID)
    		values[n++] = primary_event_id(sub);
    	if (read_format & PERF_FORMAT_LOST)
@@@ -7337,6 -7440,10 +7440,10 @@@
   * The problem is that its both hard and excessively expensive to iterate the
   * child list, not to mention that its impossible to IPI the children running
   * on another CPU, from interrupt/NMI context.
+  *
+  * Instead the combination of PERF_SAMPLE_READ and inherit will track per-thread
+  * counts rather than attempting to accumulate some value across all children on
+  * all cores.
   */
  static void perf_output_read(struct perf_output_handle *handle,
    		     struct perf_event *event)
@@@ -8857,7 -8964,7 +8964,7 @@@ got_name
    mmap_event->event_id.header.size = sizeof(mmap_event->event_id) + size;
if (atomic_read(&nr_build_id_events))
 -		build_id_parse(vma, mmap_event->build_id, &mmap_event->build_id_size);
 +		build_id_parse_nofault(vma, mmap_event->build_id, &mmap_event->build_id_size);
perf_iterate_sb(perf_event_mmap_output,
    	       mmap_event,
@@@ -9747,7 -9854,7 +9854,7 @@@ static int __perf_event_overflow(struc
    	if (!event->pending_work &&
    	    !task_work_add(current, &event->pending_task, notify_mode)) {
    		event->pending_work = pending_id;
- 			local_inc(&event->ctx->nr_pending);
+ 			local_inc(&event->ctx->nr_no_switch_fast);
event->pending_addr = 0;
    		if (valid_sample && (data->sample_flags & PERF_SAMPLE_ADDR))
@@@ -11484,10 -11591,60 +11591,60 @@@ perf_event_mux_interval_ms_store(struc
  }
  static DEVICE_ATTR_RW(perf_event_mux_interval_ms);
+ static inline const struct cpumask *perf_scope_cpu_topology_cpumask(unsigned int scope, int cpu)
+ {
+ 	switch (scope) {
+ 	case PERF_PMU_SCOPE_CORE:
+ 		return topology_sibling_cpumask(cpu);
+ 	case PERF_PMU_SCOPE_DIE:
+ 		return topology_die_cpumask(cpu);
+ 	case PERF_PMU_SCOPE_CLUSTER:
+ 		return topology_cluster_cpumask(cpu);
+ 	case PERF_PMU_SCOPE_PKG:
+ 		return topology_core_cpumask(cpu);
+ 	case PERF_PMU_SCOPE_SYS_WIDE:
+ 		return cpu_online_mask;
+ 	}
+ 
+ 	return NULL;
+ }
+ 
+ static inline struct cpumask *perf_scope_cpumask(unsigned int scope)
+ {
+ 	switch (scope) {
+ 	case PERF_PMU_SCOPE_CORE:
+ 		return perf_online_core_mask;
+ 	case PERF_PMU_SCOPE_DIE:
+ 		return perf_online_die_mask;
+ 	case PERF_PMU_SCOPE_CLUSTER:
+ 		return perf_online_cluster_mask;
+ 	case PERF_PMU_SCOPE_PKG:
+ 		return perf_online_pkg_mask;
+ 	case PERF_PMU_SCOPE_SYS_WIDE:
+ 		return perf_online_sys_mask;
+ 	}
+ 
+ 	return NULL;
+ }
+ 
+ static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr,
+ 			    char *buf)
+ {
+ 	struct pmu *pmu = dev_get_drvdata(dev);
+ 	struct cpumask *mask = perf_scope_cpumask(pmu->scope);
+ 
+ 	if (mask)
+ 		return cpumap_print_to_pagebuf(true, buf, mask);
+ 	return 0;
+ }
+ 
+ static DEVICE_ATTR_RO(cpumask);
+ 
  static struct attribute *pmu_dev_attrs[] = {
    &dev_attr_type.attr,
    &dev_attr_perf_event_mux_interval_ms.attr,
    &dev_attr_nr_addr_filters.attr,
+ 	&dev_attr_cpumask.attr,
    NULL,
  };
@@@ -11499,6 -11656,10 +11656,10 @@@ static umode_t pmu_dev_is_visible(struc
    if (n == 2 && !pmu->nr_addr_filters)
    	return 0;
+ 	/* cpumask */
+ 	if (n == 3 && pmu->scope == PERF_PMU_SCOPE_NONE)
+ 		return 0;
+ 
    return a->mode;
  }
@@@ -11583,6 -11744,11 +11744,11 @@@ int perf_pmu_register(struct pmu *pmu, 
    	goto free_pdc;
    }
+ 	if (WARN_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE, "Can not register a pmu with an invalid scope.\n")) {
+ 		ret = -EINVAL;
+ 		goto free_pdc;
+ 	}
+ 
    pmu->name = name;
if (type >= 0)
@@@ -11737,6 -11903,22 +11903,22 @@@ static int perf_try_init_event(struct p
    	    event_has_any_exclude_flag(event))
    		ret = -EINVAL;
+ 		if (pmu->scope != PERF_PMU_SCOPE_NONE && event->cpu >= 0) {
+ 			const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(pmu->scope, event->cpu);
+ 			struct cpumask *pmu_cpumask = perf_scope_cpumask(pmu->scope);
+ 			int cpu;
+ 
+ 			if (pmu_cpumask && cpumask) {
+ 				cpu = cpumask_any_and(pmu_cpumask, cpumask);
+ 				if (cpu >= nr_cpu_ids)
+ 					ret = -ENODEV;
+ 				else
+ 					event->event_caps |= PERF_EV_CAP_READ_SCOPE;
+ 			} else {
+ 				ret = -ENODEV;
+ 			}
+ 		}
+ 
    	if (ret && event->destroy)
    		event->destroy(event);
    }
@@@ -12064,10 -12246,12 +12246,12 @@@ perf_event_alloc(struct perf_event_att
    local64_set(&hwc->period_left, hwc->sample_period);
/*
- 	 * We currently do not support PERF_SAMPLE_READ on inherited events.
+ 	 * We do not support PERF_SAMPLE_READ on inherited events unless
+ 	 * PERF_SAMPLE_TID is also selected, which allows inherited events to
+ 	 * collect per-thread samples.
     * See perf_output_read().
     */
- 	if (attr->inherit && (attr->sample_type & PERF_SAMPLE_READ))
+ 	if (has_inherit_and_sample_read(attr) && !(attr->sample_type & PERF_SAMPLE_TID))
    	goto err_ns;
if (!has_branch_stack(event))
@@@ -12481,7 -12665,7 +12665,7 @@@ SYSCALL_DEFINE5(perf_event_open
    struct perf_event_attr attr;
    struct perf_event_context *ctx;
    struct file *event_file = NULL;
 -	struct fd group = {NULL, 0};
 +	struct fd group = EMPTY_FD;
    struct task_struct *task = NULL;
    struct pmu *pmu;
    int event_fd;
@@@ -12556,7 -12740,7 +12740,7 @@@
    	err = perf_fget_light(group_fd, &group);
    	if (err)
    		goto err_fd;
 -		group_leader = group.file->private_data;
 +		group_leader = fd_file(group)->private_data;
    	if (flags & PERF_FLAG_FD_OUTPUT)
    		output_event = group_leader;
    	if (flags & PERF_FLAG_FD_NO_GROUP)
@@@ -13091,7 -13275,7 +13275,7 @@@ static void sync_child_event(struct per
    		perf_event_read_event(child_event, task);
    }
- 	child_val = perf_event_count(child_event);
+ 	child_val = perf_event_count(child_event, false);
/*
     * Add back the child's count to the parent's count:
@@@ -13182,7 -13366,7 +13366,7 @@@ static void perf_event_exit_task_contex
     * in.
     */
    raw_spin_lock_irq(&child_ctx->lock);
- 	task_ctx_sched_out(child_ctx, EVENT_ALL);
+ 	task_ctx_sched_out(child_ctx, NULL, EVENT_ALL);
/*
     * Now that the context is inactive, destroy the task <-> ctx relation
@@@ -13358,15 -13542,6 +13542,15 @@@ const struct perf_event_attr *perf_even
    return &event->attr;
  }
+int perf_allow_kernel(struct perf_event_attr *attr)
 +{
 +	if (sysctl_perf_event_paranoid > 1 && !perfmon_capable())
 +		return -EACCES;
 +
 +	return security_perf_event_open(attr, PERF_SECURITY_KERNEL);
 +}
 +EXPORT_SYMBOL_GPL(perf_allow_kernel);
 +
  /*
   * Inherit an event from parent task to child task.
   *
@@@ -13697,6 -13872,12 +13881,12 @@@ static void __init perf_event_init_all_
    int cpu;
zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL);
+ 	zalloc_cpumask_var(&perf_online_core_mask, GFP_KERNEL);
+ 	zalloc_cpumask_var(&perf_online_die_mask, GFP_KERNEL);
+ 	zalloc_cpumask_var(&perf_online_cluster_mask, GFP_KERNEL);
+ 	zalloc_cpumask_var(&perf_online_pkg_mask, GFP_KERNEL);
+ 	zalloc_cpumask_var(&perf_online_sys_mask, GFP_KERNEL);
+
for_each_possible_cpu(cpu) {
    	swhash = &per_cpu(swevent_htable, cpu);
@@@ -13740,12 -13921,46 +13930,46 @@@ static void __perf_event_exit_context(v
    struct perf_event *event;
raw_spin_lock(&ctx->lock);
- 	ctx_sched_out(ctx, EVENT_TIME);
+ 	ctx_sched_out(ctx, NULL, EVENT_TIME);
    list_for_each_entry(event, &ctx->event_list, event_entry)
    	__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
    raw_spin_unlock(&ctx->lock);
  }
+ static void perf_event_clear_cpumask(unsigned int cpu)
+ {
+ 	int target[PERF_PMU_MAX_SCOPE];
+ 	unsigned int scope;
+ 	struct pmu *pmu;
+ 
+ 	cpumask_clear_cpu(cpu, perf_online_mask);
+ 
+ 	for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) {
+ 		const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(scope, cpu);
+ 		struct cpumask *pmu_cpumask = perf_scope_cpumask(scope);
+ 
+ 		target[scope] = -1;
+ 		if (WARN_ON_ONCE(!pmu_cpumask || !cpumask))
+ 			continue;
+ 
+ 		if (!cpumask_test_and_clear_cpu(cpu, pmu_cpumask))
+ 			continue;
+ 		target[scope] = cpumask_any_but(cpumask, cpu);
+ 		if (target[scope] < nr_cpu_ids)
+ 			cpumask_set_cpu(target[scope], pmu_cpumask);
+ 	}
+ 
+ 	/* migrate */
+ 	list_for_each_entry_rcu(pmu, &pmus, entry, lockdep_is_held(&pmus_srcu)) {
+ 		if (pmu->scope == PERF_PMU_SCOPE_NONE ||
+ 		    WARN_ON_ONCE(pmu->scope >= PERF_PMU_MAX_SCOPE))
+ 			continue;
+ 
+ 		if (target[pmu->scope] >= 0 && target[pmu->scope] < nr_cpu_ids)
+ 			perf_pmu_migrate_context(pmu, cpu, target[pmu->scope]);
+ 	}
+ }
+ 
  static void perf_event_exit_cpu_context(int cpu)
  {
    struct perf_cpu_context *cpuctx;
@@@ -13753,6 -13968,11 +13977,11 @@@
// XXX simplify cpuctx->online
    mutex_lock(&pmus_lock);
+ 	/*
+ 	 * Clear the cpumasks, and migrate to other CPUs if possible.
+ 	 * Must be invoked before the __perf_event_exit_context.
+ 	 */
+ 	perf_event_clear_cpumask(cpu);
    cpuctx = per_cpu_ptr(&perf_cpu_context, cpu);
    ctx = &cpuctx->ctx;
@@@ -13760,7 -13980,6 +13989,6 @@@
    smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
    cpuctx->online = 0;
    mutex_unlock(&ctx->mutex);
- 	cpumask_clear_cpu(cpu, perf_online_mask);
    mutex_unlock(&pmus_lock);
  }
  #else
@@@ -13769,6 -13988,42 +13997,42 @@@ static void perf_event_exit_cpu_context
#endif
+ static void perf_event_setup_cpumask(unsigned int cpu)
+ {
+ 	struct cpumask *pmu_cpumask;
+ 	unsigned int scope;
+ 
+ 	cpumask_set_cpu(cpu, perf_online_mask);
+ 
+ 	/*
+ 	 * Early boot stage, the cpumask hasn't been set yet.
+ 	 * The perf_online_<domain>_masks includes the first CPU of each domain.
+ 	 * Always uncondifionally set the boot CPU for the perf_online_<domain>_masks.
+ 	 */
+ 	if (!topology_sibling_cpumask(cpu)) {
+ 		for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) {
+ 			pmu_cpumask = perf_scope_cpumask(scope);
+ 			if (WARN_ON_ONCE(!pmu_cpumask))
+ 				continue;
+ 			cpumask_set_cpu(cpu, pmu_cpumask);
+ 		}
+ 		return;
+ 	}
+ 
+ 	for (scope = PERF_PMU_SCOPE_NONE + 1; scope < PERF_PMU_MAX_SCOPE; scope++) {
+ 		const struct cpumask *cpumask = perf_scope_cpu_topology_cpumask(scope, cpu);
+ 
+ 		pmu_cpumask = perf_scope_cpumask(scope);
+ 
+ 		if (WARN_ON_ONCE(!pmu_cpumask || !cpumask))
+ 			continue;
+ 
+ 		if (!cpumask_empty(cpumask) &&
+ 		    cpumask_any_and(pmu_cpumask, cpumask) >= nr_cpu_ids)
+ 			cpumask_set_cpu(cpu, pmu_cpumask);
+ 	}
+ }
+ 
  int perf_event_init_cpu(unsigned int cpu)
  {
    struct perf_cpu_context *cpuctx;
@@@ -13777,7 -14032,7 +14041,7 @@@
    perf_swevent_init_cpu(cpu);
mutex_lock(&pmus_lock);
- 	cpumask_set_cpu(cpu, perf_online_mask);
+ 	perf_event_setup_cpumask(cpu);
    cpuctx = per_cpu_ptr(&perf_cpu_context, cpu);
    ctx = &cpuctx->ctx;
diff --combined kernel/events/uprobes.c
index 5afd00f264314,4b7e590dc428e..86fcb2386ea2f
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@@ -40,6 -40,9 +40,9 @@@ static struct rb_root uprobes_tree = RB
  #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+ static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
+ 
+ DEFINE_STATIC_SRCU(uprobes_srcu);
#define UPROBES_HASH_SZ	13
  /* serialize uprobe->pending_list */
@@@ -57,8 -60,9 +60,9 @@@ struct uprobe 
    struct rw_semaphore	register_rwsem;
    struct rw_semaphore	consumer_rwsem;
    struct list_head	pending_list;
- 	struct uprobe_consumer	*consumers;
+ 	struct list_head	consumers;
    struct inode		*inode;		/* Also hold a ref to inode */
+ 	struct rcu_head		rcu;
    loff_t			offset;
    loff_t			ref_ctr_offset;
    unsigned long		flags;
@@@ -109,6 -113,11 +113,11 @@@ struct xol_area 
    unsigned long 			vaddr;		/* Page(s) of instruction slots */
  };
+ static void uprobe_warn(struct task_struct *t, const char *msg)
+ {
+ 	pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg);
+ }
+ 
  /*
   * valid_vma: Verify if the specified vma is an executable vma
   * Relax restrictions while unregistering: vm_flags might have
@@@ -453,7 -462,7 +462,7 @@@ static int update_ref_ctr(struct uprob
   * @vaddr: the virtual address to store the opcode.
   * @opcode: opcode to be written at @vaddr.
   *
-  * Called with mm->mmap_lock held for write.
+  * Called with mm->mmap_lock held for read or write.
   * Return 0 (success) or a negative errno.
   */
  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
@@@ -587,25 -596,63 +596,63 @@@ set_orig_insn(struct arch_uprobe *aupro
    		*(uprobe_opcode_t *)&auprobe->insn);
  }
+ /* uprobe should have guaranteed positive refcount */
  static struct uprobe *get_uprobe(struct uprobe *uprobe)
  {
    refcount_inc(&uprobe->ref);
    return uprobe;
  }
+ /*
+  * uprobe should have guaranteed lifetime, which can be either of:
+  *   - caller already has refcount taken (and wants an extra one);
+  *   - uprobe is RCU protected and won't be freed until after grace period;
+  *   - we are holding uprobes_treelock (for read or write, doesn't matter).
+  */
+ static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+ {
+ 	if (refcount_inc_not_zero(&uprobe->ref))
+ 		return uprobe;
+ 	return NULL;
+ }
+ 
+ static inline bool uprobe_is_active(struct uprobe *uprobe)
+ {
+ 	return !RB_EMPTY_NODE(&uprobe->rb_node);
+ }
+ 
+ static void uprobe_free_rcu(struct rcu_head *rcu)
+ {
+ 	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+ 
+ 	kfree(uprobe);
+ }
+ 
  static void put_uprobe(struct uprobe *uprobe)
  {
- 	if (refcount_dec_and_test(&uprobe->ref)) {
- 		/*
- 		 * If application munmap(exec_vma) before uprobe_unregister()
- 		 * gets called, we don't get a chance to remove uprobe from
- 		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
- 		 */
- 		mutex_lock(&delayed_uprobe_lock);
- 		delayed_uprobe_remove(uprobe, NULL);
- 		mutex_unlock(&delayed_uprobe_lock);
- 		kfree(uprobe);
+ 	if (!refcount_dec_and_test(&uprobe->ref))
+ 		return;
+ 
+ 	write_lock(&uprobes_treelock);
+ 
+ 	if (uprobe_is_active(uprobe)) {
+ 		write_seqcount_begin(&uprobes_seqcount);
+ 		rb_erase(&uprobe->rb_node, &uprobes_tree);
+ 		write_seqcount_end(&uprobes_seqcount);
    }
+ 
+ 	write_unlock(&uprobes_treelock);
+ 
+ 	/*
+ 	 * If application munmap(exec_vma) before uprobe_unregister()
+ 	 * gets called, we don't get a chance to remove uprobe from
+ 	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+ 	 */
+ 	mutex_lock(&delayed_uprobe_lock);
+ 	delayed_uprobe_remove(uprobe, NULL);
+ 	mutex_unlock(&delayed_uprobe_lock);
+ 
+ 	call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu);
  }
static __always_inline
@@@ -647,62 -694,86 +694,86 @@@ static inline int __uprobe_cmp(struct r
    return uprobe_cmp(u->inode, u->offset, __node_2_uprobe(b));
  }
- static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+ /*
+  * Assumes being inside RCU protected region.
+  * No refcount is taken on returned uprobe.
+  */
+ static struct uprobe *find_uprobe_rcu(struct inode *inode, loff_t offset)
  {
    struct __uprobe_key key = {
    	.inode = inode,
    	.offset = offset,
    };
- 	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
+ 	struct rb_node *node;
+ 	unsigned int seq;
- 	if (node)
- 		return get_uprobe(__node_2_uprobe(node));
+ 	lockdep_assert(srcu_read_lock_held(&uprobes_srcu));
+ 
+ 	do {
+ 		seq = read_seqcount_begin(&uprobes_seqcount);
+ 		node = rb_find_rcu(&key, &uprobes_tree, __uprobe_cmp_key);
+ 		/*
+ 		 * Lockless RB-tree lookups can result only in false negatives.
+ 		 * If the element is found, it is correct and can be returned
+ 		 * under RCU protection. If we find nothing, we need to
+ 		 * validate that seqcount didn't change. If it did, we have to
+ 		 * try again as we might have missed the element (false
+ 		 * negative). If seqcount is unchanged, search truly failed.
+ 		 */
+ 		if (node)
+ 			return __node_2_uprobe(node);
+ 	} while (read_seqcount_retry(&uprobes_seqcount, seq));
return NULL;
  }
/*
-  * Find a uprobe corresponding to a given inode:offset
-  * Acquires uprobes_treelock
+  * Attempt to insert a new uprobe into uprobes_tree.
+  *
+  * If uprobe already exists (for given inode+offset), we just increment
+  * refcount of previously existing uprobe.
+  *
+  * If not, a provided new instance of uprobe is inserted into the tree (with
+  * assumed initial refcount == 1).
+  *
+  * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+  * Caller has to clean up new uprobe instance, if it ended up not being
+  * inserted into the tree.
+  *
+  * We assume that uprobes_treelock is held for writing.
   */
- static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
- {
- 	struct uprobe *uprobe;
- 
- 	read_lock(&uprobes_treelock);
- 	uprobe = __find_uprobe(inode, offset);
- 	read_unlock(&uprobes_treelock);
- 
- 	return uprobe;
- }
- 
  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
  {
    struct rb_node *node;
+ again:
+ 	node = rb_find_add_rcu(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
+ 	if (node) {
+ 		struct uprobe *u = __node_2_uprobe(node);
+ 
+ 		if (!try_get_uprobe(u)) {
+ 			rb_erase(node, &uprobes_tree);
+ 			RB_CLEAR_NODE(&u->rb_node);
+ 			goto again;
+ 		}
- 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
- 	if (node)
- 		return get_uprobe(__node_2_uprobe(node));
+ 		return u;
+ 	}
- 	/* get access + creation ref */
- 	refcount_set(&uprobe->ref, 2);
- 	return NULL;
+ 	return uprobe;
  }
/*
-  * Acquire uprobes_treelock.
-  * Matching uprobe already exists in rbtree;
-  *	increment (access refcount) and return the matching uprobe.
-  *
-  * No matching uprobe; insert the uprobe in rb_tree;
-  *	get a double refcount (access + creation) and return NULL.
+  * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+  * (or reuse existing one, see __insert_uprobe() comments above).
   */
  static struct uprobe *insert_uprobe(struct uprobe *uprobe)
  {
    struct uprobe *u;
write_lock(&uprobes_treelock);
+ 	write_seqcount_begin(&uprobes_seqcount);
    u = __insert_uprobe(uprobe);
+ 	write_seqcount_end(&uprobes_seqcount);
    write_unlock(&uprobes_treelock);
return u;
@@@ -725,18 -796,21 +796,21 @@@ static struct uprobe *alloc_uprobe(stru
uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
    if (!uprobe)
- 		return NULL;
+ 		return ERR_PTR(-ENOMEM);
uprobe->inode = inode;
    uprobe->offset = offset;
    uprobe->ref_ctr_offset = ref_ctr_offset;
+ 	INIT_LIST_HEAD(&uprobe->consumers);
    init_rwsem(&uprobe->register_rwsem);
    init_rwsem(&uprobe->consumer_rwsem);
+ 	RB_CLEAR_NODE(&uprobe->rb_node);
+ 	refcount_set(&uprobe->ref, 1);
/* add to uprobes_tree, sorted on inode:offset */
    cur_uprobe = insert_uprobe(uprobe);
    /* a uprobe exists for this inode:offset combination */
- 	if (cur_uprobe) {
+ 	if (cur_uprobe != uprobe) {
    	if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
    		ref_ctr_mismatch_warn(cur_uprobe, uprobe);
    		put_uprobe(cur_uprobe);
@@@ -753,32 -827,19 +827,19 @@@
  static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc)
  {
    down_write(&uprobe->consumer_rwsem);
- 	uc->next = uprobe->consumers;
- 	uprobe->consumers = uc;
+ 	list_add_rcu(&uc->cons_node, &uprobe->consumers);
    up_write(&uprobe->consumer_rwsem);
  }
/*
   * For uprobe @uprobe, delete the consumer @uc.
-  * Return true if the @uc is deleted successfully
-  * or return false.
+  * Should never be called with consumer that's not part of @uprobe->consumers.
   */
- static bool consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
+ static void consumer_del(struct uprobe *uprobe, struct uprobe_consumer *uc)
  {
- 	struct uprobe_consumer **con;
- 	bool ret = false;
- 
    down_write(&uprobe->consumer_rwsem);
- 	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
- 		if (*con == uc) {
- 			*con = uc->next;
- 			ret = true;
- 			break;
- 		}
- 	}
+ 	list_del_rcu(&uc->cons_node);
    up_write(&uprobe->consumer_rwsem);
- 
- 	return ret;
  }
static int __copy_insn(struct address_space *mapping, struct file *filp,
@@@ -863,21 -924,20 +924,20 @@@ static int prepare_uprobe(struct uprob
    return ret;
  }
- static inline bool consumer_filter(struct uprobe_consumer *uc,
- 				   enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+ static inline bool consumer_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
  {
- 	return !uc->filter || uc->filter(uc, ctx, mm);
+ 	return !uc->filter || uc->filter(uc, mm);
  }
- static bool filter_chain(struct uprobe *uprobe,
- 			 enum uprobe_filter_ctx ctx, struct mm_struct *mm)
+ static bool filter_chain(struct uprobe *uprobe, struct mm_struct *mm)
  {
    struct uprobe_consumer *uc;
    bool ret = false;
down_read(&uprobe->consumer_rwsem);
- 	for (uc = uprobe->consumers; uc; uc = uc->next) {
- 		ret = consumer_filter(uc, ctx, mm);
+ 	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+ 				 srcu_read_lock_held(&uprobes_srcu)) {
+ 		ret = consumer_filter(uc, mm);
    	if (ret)
    		break;
    }
@@@ -921,27 -981,6 +981,6 @@@ remove_breakpoint(struct uprobe *uprobe
    return set_orig_insn(&uprobe->arch, mm, vaddr);
  }
- static inline bool uprobe_is_active(struct uprobe *uprobe)
- {
- 	return !RB_EMPTY_NODE(&uprobe->rb_node);
- }
- /*
-  * There could be threads that have already hit the breakpoint. They
-  * will recheck the current insn and restart if find_uprobe() fails.
-  * See find_active_uprobe().
-  */
- static void delete_uprobe(struct uprobe *uprobe)
- {
- 	if (WARN_ON(!uprobe_is_active(uprobe)))
- 		return;
- 
- 	write_lock(&uprobes_treelock);
- 	rb_erase(&uprobe->rb_node, &uprobes_tree);
- 	write_unlock(&uprobes_treelock);
- 	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
- 	put_uprobe(uprobe);
- }
- 
  struct map_info {
    struct map_info *next;
    struct mm_struct *mm;
@@@ -1046,7 -1085,13 +1085,13 @@@ register_for_each_vma(struct uprobe *up
if (err && is_register)
    		goto free;
- 
+ 		/*
+ 		 * We take mmap_lock for writing to avoid the race with
+ 		 * find_active_uprobe_rcu() which takes mmap_lock for reading.
+ 		 * Thus this install_breakpoint() can not make
+ 		 * is_trap_at_addr() true right after find_uprobe_rcu()
+ 		 * returns NULL in find_active_uprobe_rcu().
+ 		 */
    	mmap_write_lock(mm);
    	vma = find_vma(mm, info->vaddr);
    	if (!vma || !valid_vma(vma, is_register) ||
@@@ -1059,12 -1104,10 +1104,10 @@@
if (is_register) {
    		/* consult only the "caller", new consumer. */
- 			if (consumer_filter(new,
- 					UPROBE_FILTER_REGISTER, mm))
+ 			if (consumer_filter(new, mm))
    			err = install_breakpoint(uprobe, mm, vma, info->vaddr);
    	} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
- 			if (!filter_chain(uprobe,
- 					UPROBE_FILTER_UNREGISTER, mm))
+ 			if (!filter_chain(uprobe, mm))
    			err |= remove_breakpoint(uprobe, mm, info->vaddr);
    	}
@@@ -1079,152 -1122,140 +1122,140 @@@
    return err;
  }
- static void
- __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+ /**
+  * uprobe_unregister_nosync - unregister an already registered probe.
+  * @uprobe: uprobe to remove
+  * @uc: identify which probe if multiple probes are colocated.
+  */
+ void uprobe_unregister_nosync(struct uprobe *uprobe, struct uprobe_consumer *uc)
  {
    int err;
- 	if (WARN_ON(!consumer_del(uprobe, uc)))
- 		return;
- 
+ 	down_write(&uprobe->register_rwsem);
+ 	consumer_del(uprobe, uc);
    err = register_for_each_vma(uprobe, NULL);
- 	/* TODO : cant unregister? schedule a worker thread */
- 	if (!uprobe->consumers && !err)
- 		delete_uprobe(uprobe);
- }
- 
- /*
-  * uprobe_unregister - unregister an already registered probe.
-  * @inode: the file in which the probe has to be removed.
-  * @offset: offset from the start of the file.
-  * @uc: identify which probe if multiple probes are colocated.
-  */
- void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
- {
- 	struct uprobe *uprobe;
+ 	up_write(&uprobe->register_rwsem);
- 	uprobe = find_uprobe(inode, offset);
- 	if (WARN_ON(!uprobe))
+ 	/* TODO : cant unregister? schedule a worker thread */
+ 	if (unlikely(err)) {
+ 		uprobe_warn(current, "unregister, leaking uprobe");
    	return;
+ 	}
- 	down_write(&uprobe->register_rwsem);
- 	__uprobe_unregister(uprobe, uc);
- 	up_write(&uprobe->register_rwsem);
    put_uprobe(uprobe);
  }
- EXPORT_SYMBOL_GPL(uprobe_unregister);
+ EXPORT_SYMBOL_GPL(uprobe_unregister_nosync);
- /*
-  * __uprobe_register - register a probe
+ void uprobe_unregister_sync(void)
+ {
+ 	/*
+ 	 * Now that handler_chain() and handle_uretprobe_chain() iterate over
+ 	 * uprobe->consumers list under RCU protection without holding
+ 	 * uprobe->register_rwsem, we need to wait for RCU grace period to
+ 	 * make sure that we can't call into just unregistered
+ 	 * uprobe_consumer's callbacks anymore. If we don't do that, fast and
+ 	 * unlucky enough caller can free consumer's memory and cause
+ 	 * handler_chain() or handle_uretprobe_chain() to do an use-after-free.
+ 	 */
+ 	synchronize_srcu(&uprobes_srcu);
+ }
+ EXPORT_SYMBOL_GPL(uprobe_unregister_sync);
+ 
+ /**
+  * uprobe_register - register a probe
   * @inode: the file in which the probe has to be placed.
   * @offset: offset from the start of the file.
+  * @ref_ctr_offset: offset of SDT marker / reference counter
   * @uc: information on howto handle the probe..
   *
-  * Apart from the access refcount, __uprobe_register() takes a creation
+  * Apart from the access refcount, uprobe_register() takes a creation
   * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
   * inserted into the rbtree (i.e first consumer for a @inode:@offset
   * tuple).  Creation refcount stops uprobe_unregister from freeing the
   * @uprobe even before the register operation is complete. Creation
   * refcount is released when the last @uc for the @uprobe
-  * unregisters. Caller of __uprobe_register() is required to keep @inode
+  * unregisters. Caller of uprobe_register() is required to keep @inode
   * (and the containing mount) referenced.
   *
-  * Return errno if it cannot successully install probes
-  * else return 0 (success)
+  * Return: pointer to the new uprobe on success or an ERR_PTR on failure.
   */
- static int __uprobe_register(struct inode *inode, loff_t offset,
- 			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+ struct uprobe *uprobe_register(struct inode *inode,
+ 				loff_t offset, loff_t ref_ctr_offset,
+ 				struct uprobe_consumer *uc)
  {
    struct uprobe *uprobe;
    int ret;
/* Uprobe must have at least one set consumer */
    if (!uc->handler && !uc->ret_handler)
- 		return -EINVAL;
+ 		return ERR_PTR(-EINVAL);
/* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */
    if (!inode->i_mapping->a_ops->read_folio &&
        !shmem_mapping(inode->i_mapping))
- 		return -EIO;
+ 		return ERR_PTR(-EIO);
    /* Racy, just to catch the obvious mistakes */
    if (offset > i_size_read(inode))
- 		return -EINVAL;
+ 		return ERR_PTR(-EINVAL);
/*
     * This ensures that copy_from_page(), copy_to_page() and
     * __update_ref_ctr() can't cross page boundary.
     */
    if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE))
- 		return -EINVAL;
+ 		return ERR_PTR(-EINVAL);
    if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
- 		return -EINVAL;
+ 		return ERR_PTR(-EINVAL);
-  retry:
    uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
- 	if (!uprobe)
- 		return -ENOMEM;
    if (IS_ERR(uprobe))
- 		return PTR_ERR(uprobe);
+ 		return uprobe;
- 	/*
- 	 * We can race with uprobe_unregister()->delete_uprobe().
- 	 * Check uprobe_is_active() and retry if it is false.
- 	 */
    down_write(&uprobe->register_rwsem);
- 	ret = -EAGAIN;
- 	if (likely(uprobe_is_active(uprobe))) {
- 		consumer_add(uprobe, uc);
- 		ret = register_for_each_vma(uprobe, uc);
- 		if (ret)
- 			__uprobe_unregister(uprobe, uc);
- 	}
+ 	consumer_add(uprobe, uc);
+ 	ret = register_for_each_vma(uprobe, uc);
    up_write(&uprobe->register_rwsem);
- 	put_uprobe(uprobe);
- 	if (unlikely(ret == -EAGAIN))
- 		goto retry;
- 	return ret;
- }
+ 	if (ret) {
+ 		uprobe_unregister_nosync(uprobe, uc);
+ 		/*
+ 		 * Registration might have partially succeeded, so we can have
+ 		 * this consumer being called right at this time. We need to
+ 		 * sync here. It's ok, it's unlikely slow path.
+ 		 */
+ 		uprobe_unregister_sync();
+ 		return ERR_PTR(ret);
+ 	}
- int uprobe_register(struct inode *inode, loff_t offset,
- 		    struct uprobe_consumer *uc)
- {
- 	return __uprobe_register(inode, offset, 0, uc);
+ 	return uprobe;
  }
  EXPORT_SYMBOL_GPL(uprobe_register);
- int uprobe_register_refctr(struct inode *inode, loff_t offset,
- 			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
- {
- 	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
- }
- EXPORT_SYMBOL_GPL(uprobe_register_refctr);
- 
- /*
-  * uprobe_apply - unregister an already registered probe.
-  * @inode: the file in which the probe has to be removed.
-  * @offset: offset from the start of the file.
+ /**
+  * uprobe_apply - add or remove the breakpoints according to @uc->filter
+  * @uprobe: uprobe which "owns" the breakpoint
   * @uc: consumer which wants to add more or remove some breakpoints
   * @add: add or remove the breakpoints
+  * Return: 0 on success or negative error code.
   */
- int uprobe_apply(struct inode *inode, loff_t offset,
- 			struct uprobe_consumer *uc, bool add)
+ int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool add)
  {
- 	struct uprobe *uprobe;
    struct uprobe_consumer *con;
- 	int ret = -ENOENT;
- 
- 	uprobe = find_uprobe(inode, offset);
- 	if (WARN_ON(!uprobe))
- 		return ret;
+ 	int ret = -ENOENT, srcu_idx;
down_write(&uprobe->register_rwsem);
- 	for (con = uprobe->consumers; con && con != uc ; con = con->next)
- 		;
- 	if (con)
- 		ret = register_for_each_vma(uprobe, add ? uc : NULL);
+ 
+ 	srcu_idx = srcu_read_lock(&uprobes_srcu);
+ 	list_for_each_entry_srcu(con, &uprobe->consumers, cons_node,
+ 				 srcu_read_lock_held(&uprobes_srcu)) {
+ 		if (con == uc) {
+ 			ret = register_for_each_vma(uprobe, add ? uc : NULL);
+ 			break;
+ 		}
+ 	}
+ 	srcu_read_unlock(&uprobes_srcu, srcu_idx);
+ 
    up_write(&uprobe->register_rwsem);
- 	put_uprobe(uprobe);
return ret;
  }
@@@ -1305,15 -1336,17 +1336,17 @@@ static void build_probe_list(struct ino
    		u = rb_entry(t, struct uprobe, rb_node);
    		if (u->inode != inode || u->offset < min)
    			break;
- 			list_add(&u->pending_list, head);
- 			get_uprobe(u);
+ 			/* if uprobe went away, it's safe to ignore it */
+ 			if (try_get_uprobe(u))
+ 				list_add(&u->pending_list, head);
    	}
    	for (t = n; (t = rb_next(t)); ) {
    		u = rb_entry(t, struct uprobe, rb_node);
    		if (u->inode != inode || u->offset > max)
    			break;
- 			list_add(&u->pending_list, head);
- 			get_uprobe(u);
+ 			/* if uprobe went away, it's safe to ignore it */
+ 			if (try_get_uprobe(u))
+ 				list_add(&u->pending_list, head);
    	}
    }
    read_unlock(&uprobes_treelock);
@@@ -1384,7 -1417,7 +1417,7 @@@ int uprobe_mmap(struct vm_area_struct *
     */
    list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
    	if (!fatal_signal_pending(current) &&
- 		    filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) {
+ 		    filter_chain(uprobe, vma->vm_mm)) {
    		unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset);
    		install_breakpoint(uprobe, vma->vm_mm, vma, vaddr);
    	}
@@@ -1482,22 -1515,6 +1515,22 @@@ void * __weak arch_uprobe_trampoline(un
    return &insn;
  }
+/*
 + * uprobe_clear_state - Free the area allocated for slots.
 + */
 +static void uprobe_clear_state(const struct vm_special_mapping *sm, struct vm_area_struct *vma)
 +{
 +	struct xol_area *area = container_of(vma->vm_private_data, struct xol_area, xol_mapping);
 +
 +	mutex_lock(&delayed_uprobe_lock);
 +	delayed_uprobe_remove(NULL, vma->vm_mm);
 +	mutex_unlock(&delayed_uprobe_lock);
 +
 +	put_page(area->pages[0]);
 +	kfree(area->bitmap);
 +	kfree(area);
 +}
 +
  static struct xol_area *__create_xol_area(unsigned long vaddr)
  {
    struct mm_struct *mm = current->mm;
@@@ -1515,7 -1532,6 +1548,7 @@@
    	goto free_area;
area->xol_mapping.name = "[uprobes]";
 +	area->xol_mapping.close = uprobe_clear_state;
    area->xol_mapping.pages = area->pages;
    area->pages[0] = alloc_page(GFP_HIGHUSER);
    if (!area->pages[0])
@@@ -1561,6 -1577,25 +1594,6 @@@ static struct xol_area *get_xol_area(vo
    return area;
  }
-/*
 - * uprobe_clear_state - Free the area allocated for slots.
 - */
 -void uprobe_clear_state(struct mm_struct *mm)
 -{
 -	struct xol_area *area = mm->uprobes_state.xol_area;
 -
 -	mutex_lock(&delayed_uprobe_lock);
 -	delayed_uprobe_remove(NULL, mm);
 -	mutex_unlock(&delayed_uprobe_lock);
 -
 -	if (!area)
 -		return;
 -
 -	put_page(area->pages[0]);
 -	kfree(area->bitmap);
 -	kfree(area);
 -}
 -
  void uprobe_start_dup_mmap(void)
  {
    percpu_down_read(&dup_mmap_sem);
@@@ -1768,6 -1803,12 +1801,12 @@@ static int dup_utask(struct task_struc
    		return -ENOMEM;
*n = *o;
+ 		/*
+ 		 * uprobe's refcnt has to be positive at this point, kept by
+ 		 * utask->return_instances items; return_instances can't be
+ 		 * removed right now, as task is blocked due to duping; so
+ 		 * get_uprobe() is safe to use here.
+ 		 */
    	get_uprobe(n->uprobe);
    	n->next = NULL;
@@@ -1779,12 -1820,6 +1818,6 @@@
    return 0;
  }
- static void uprobe_warn(struct task_struct *t, const char *msg)
- {
- 	pr_warn("uprobe: %s:%d failed to %s\n",
- 			current->comm, current->pid, msg);
- }
- 
  static void dup_xol_work(struct callback_head *work)
  {
    if (current->flags & PF_EXITING)
@@@ -1881,9 -1916,13 +1914,13 @@@ static void prepare_uretprobe(struct up
    	return;
    }
+ 	/* we need to bump refcount to store uprobe in utask */
+ 	if (!try_get_uprobe(uprobe))
+ 		return;
+ 
    ri = kmalloc(sizeof(struct return_instance), GFP_KERNEL);
    if (!ri)
- 		return;
+ 		goto fail;
trampoline_vaddr = uprobe_get_trampoline_vaddr();
    orig_ret_vaddr = arch_uretprobe_hijack_return_addr(trampoline_vaddr, regs);
@@@ -1910,8 -1949,7 +1947,7 @@@
    	}
    	orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
    }
- 
- 	ri->uprobe = get_uprobe(uprobe);
+ 	ri->uprobe = uprobe;
    ri->func = instruction_pointer(regs);
    ri->stack = user_stack_pointer(regs);
    ri->orig_ret_vaddr = orig_ret_vaddr;
@@@ -1922,8 -1960,9 +1958,9 @@@
    utask->return_instances = ri;
return;
-  fail:
+ fail:
    kfree(ri);
+ 	put_uprobe(uprobe);
  }
/* Prepare to single-step probed instruction out of line. */
@@@ -1938,9 -1977,14 +1975,14 @@@ pre_ssout(struct uprobe *uprobe, struc
    if (!utask)
    	return -ENOMEM;
+ 	if (!try_get_uprobe(uprobe))
+ 		return -EINVAL;
+ 
    xol_vaddr = xol_get_insn_slot(uprobe);
- 	if (!xol_vaddr)
- 		return -ENOMEM;
+ 	if (!xol_vaddr) {
+ 		err = -ENOMEM;
+ 		goto err_out;
+ 	}
utask->xol_vaddr = xol_vaddr;
    utask->vaddr = bp_vaddr;
@@@ -1948,12 -1992,15 +1990,15 @@@
    err = arch_uprobe_pre_xol(&uprobe->arch, regs);
    if (unlikely(err)) {
    	xol_free_insn_slot(current);
- 		return err;
+ 		goto err_out;
    }
utask->active_uprobe = uprobe;
    utask->state = UTASK_SSTEP;
    return 0;
+ err_out:
+ 	put_uprobe(uprobe);
+ 	return err;
  }
/*
@@@ -2026,13 -2073,7 +2071,7 @@@ static int is_trap_at_addr(struct mm_st
    if (likely(result == 0))
    	goto out;
- 	/*
- 	 * The NULL 'tsk' here ensures that any faults that occur here
- 	 * will not be accounted to the task.  'mm' *is* current->mm,
- 	 * but we treat this as a 'remote' access since it is
- 	 * essentially a kernel access to the memory.
- 	 */
- 	result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL);
+ 	result = get_user_pages(vaddr, 1, FOLL_FORCE, &page);
    if (result < 0)
    	return result;
@@@ -2043,7 -2084,8 +2082,8 @@@
    return is_trap_insn(&opcode);
  }
- static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp)
+ /* assumes being inside RCU protected region */
+ static struct uprobe *find_active_uprobe_rcu(unsigned long bp_vaddr, int *is_swbp)
  {
    struct mm_struct *mm = current->mm;
    struct uprobe *uprobe = NULL;
@@@ -2056,7 -2098,7 +2096,7 @@@
    		struct inode *inode = file_inode(vma->vm_file);
    		loff_t offset = vaddr_to_offset(vma, bp_vaddr);
- 			uprobe = find_uprobe(inode, offset);
+ 			uprobe = find_uprobe_rcu(inode, offset);
    	}
if (!uprobe)
@@@ -2077,9 -2119,12 +2117,12 @@@ static void handler_chain(struct uprob
    struct uprobe_consumer *uc;
    int remove = UPROBE_HANDLER_REMOVE;
    bool need_prep = false; /* prepare return uprobe, when needed */
+ 	bool has_consumers = false;
+ 
+ 	current->utask->auprobe = &uprobe->arch;
- 	down_read(&uprobe->register_rwsem);
- 	for (uc = uprobe->consumers; uc; uc = uc->next) {
+ 	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+ 				 srcu_read_lock_held(&uprobes_srcu)) {
    	int rc = 0;
if (uc->handler) {
@@@ -2092,16 -2137,24 +2135,24 @@@
    		need_prep = true;
remove &= rc;
+ 		has_consumers = true;
    }
+ 	current->utask->auprobe = NULL;
if (need_prep && !remove)
    	prepare_uretprobe(uprobe, regs); /* put bp at return */
- 	if (remove && uprobe->consumers) {
- 		WARN_ON(!uprobe_is_active(uprobe));
- 		unapply_uprobe(uprobe, current->mm);
+ 	if (remove && has_consumers) {
+ 		down_read(&uprobe->register_rwsem);
+ 
+ 		/* re-check that removal is still required, this time under lock */
+ 		if (!filter_chain(uprobe, current->mm)) {
+ 			WARN_ON(!uprobe_is_active(uprobe));
+ 			unapply_uprobe(uprobe, current->mm);
+ 		}
+ 
+ 		up_read(&uprobe->register_rwsem);
    }
- 	up_read(&uprobe->register_rwsem);
  }
static void
@@@ -2109,13 -2162,15 +2160,15 @@@ handle_uretprobe_chain(struct return_in
  {
    struct uprobe *uprobe = ri->uprobe;
    struct uprobe_consumer *uc;
+ 	int srcu_idx;
- 	down_read(&uprobe->register_rwsem);
- 	for (uc = uprobe->consumers; uc; uc = uc->next) {
+ 	srcu_idx = srcu_read_lock(&uprobes_srcu);
+ 	list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node,
+ 				 srcu_read_lock_held(&uprobes_srcu)) {
    	if (uc->ret_handler)
    		uc->ret_handler(uc, ri->func, regs);
    }
- 	up_read(&uprobe->register_rwsem);
+ 	srcu_read_unlock(&uprobes_srcu, srcu_idx);
  }
static struct return_instance *find_next_ret_chain(struct return_instance *ri)
@@@ -2200,13 -2255,15 +2253,15 @@@ static void handle_swbp(struct pt_regs 
  {
    struct uprobe *uprobe;
    unsigned long bp_vaddr;
- 	int is_swbp;
+ 	int is_swbp, srcu_idx;
bp_vaddr = uprobe_get_swbp_addr(regs);
    if (bp_vaddr == uprobe_get_trampoline_vaddr())
    	return uprobe_handle_trampoline(regs);
- 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
+ 	srcu_idx = srcu_read_lock(&uprobes_srcu);
+ 
+ 	uprobe = find_active_uprobe_rcu(bp_vaddr, &is_swbp);
    if (!uprobe) {
    	if (is_swbp > 0) {
    		/* No matching uprobe; signal SIGTRAP. */
@@@ -2222,7 -2279,7 +2277,7 @@@
    		 */
    		instruction_pointer_set(regs, bp_vaddr);
    	}
- 		return;
+ 		goto out;
    }
/* change it in advance for ->handler() and restart */
@@@ -2257,12 -2314,12 +2312,12 @@@
    if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
    	goto out;
- 	if (!pre_ssout(uprobe, regs, bp_vaddr))
- 		return;
+ 	if (pre_ssout(uprobe, regs, bp_vaddr))
+ 		goto out;
- 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
  out:
- 	put_uprobe(uprobe);
+ 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
+ 	srcu_read_unlock(&uprobes_srcu, srcu_idx);
  }
/*
diff --combined kernel/fork.c
index 0241a2ff1d336,0b71fc9fa750d..cd3f92f0a13d6
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@@ -832,7 -832,7 +832,7 @@@ static void check_mm(struct mm_struct *
    	pr_alert("BUG: non-zero pgtables_bytes on freeing mm: %ld\n",
    			mm_pgtables_bytes(mm));
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
    VM_BUG_ON_MM(mm->pmd_huge_pte, mm);
  #endif
  }
@@@ -1182,7 -1182,7 +1182,7 @@@ static struct task_struct *dup_task_str
    tsk->active_memcg = NULL;
  #endif
- #ifdef CONFIG_CPU_SUP_INTEL
+ #ifdef CONFIG_X86_BUS_LOCK_DETECT
    tsk->reported_split_lock = 0;
  #endif
@@@ -1276,7 -1276,7 +1276,7 @@@ static struct mm_struct *mm_init(struc
    RCU_INIT_POINTER(mm->exe_file, NULL);
    mmu_notifier_subscriptions_init(mm);
    init_tlb_flush_pending(mm);
 -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !defined(CONFIG_SPLIT_PMD_PTLOCKS)
    mm->pmd_huge_pte = NULL;
  #endif
    mm_init_uprobes_state(mm);
@@@ -1338,6 -1338,7 +1338,6 @@@ static inline void __mmput(struct mm_st
  {
    VM_BUG_ON(atomic_read(&mm->mm_users));
-	uprobe_clear_state(mm);
    exit_aio(mm);
    ksm_exit(mm);
    khugepaged_exit(mm); /* must run before exit_mmap */
@@@ -1753,30 -1754,33 +1753,30 @@@ static int copy_files(unsigned long clo
    	      int no_files)
  {
    struct files_struct *oldf, *newf;
 -	int error = 0;
/*
     * A background process may not have any files ...
     */
    oldf = current->files;
    if (!oldf)
 -		goto out;
 +		return 0;
if (no_files) {
    	tsk->files = NULL;
 -		goto out;
 +		return 0;
    }
if (clone_flags & CLONE_FILES) {
    	atomic_inc(&oldf->count);
 -		goto out;
 +		return 0;
    }
-	newf = dup_fd(oldf, NR_OPEN_MAX, &error);
 -	if (!newf)
 -		goto out;
 +	newf = dup_fd(oldf, NULL);
 +	if (IS_ERR(newf))
 +		return PTR_ERR(newf);
tsk->files = newf;
 -	error = 0;
 -out:
 -	return error;
 +	return 0;
  }
static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
@@@ -1857,7 -1861,7 +1857,7 @@@ static int copy_signal(unsigned long cl
    prev_cputime_init(&sig->prev_cputime);
#ifdef CONFIG_POSIX_TIMERS
- 	INIT_LIST_HEAD(&sig->posix_timers);
+ 	INIT_HLIST_HEAD(&sig->posix_timers);
    hrtimer_init(&sig->real_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    sig->real_timer.function = it_real_fn;
  #endif
@@@ -3228,16 -3232,17 +3228,16 @@@ static int unshare_fs(unsigned long uns
  /*
   * Unshare file descriptor table if it is being shared
   */
 -int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
 -	       struct files_struct **new_fdp)
 +static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp)
  {
    struct files_struct *fd = current->files;
 -	int error = 0;
if ((unshare_flags & CLONE_FILES) &&
        (fd && atomic_read(&fd->count) > 1)) {
 -		*new_fdp = dup_fd(fd, max_fds, &error);
 -		if (!*new_fdp)
 -			return error;
 +		fd = dup_fd(fd, NULL);
 +		if (IS_ERR(fd))
 +			return PTR_ERR(fd);
 +		*new_fdp = fd;
    }
return 0;
@@@ -3295,7 -3300,7 +3295,7 @@@ int ksys_unshare(unsigned long unshare_
    err = unshare_fs(unshare_flags, &new_fs);
    if (err)
    	goto bad_unshare_out;
 -	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
 +	err = unshare_fd(unshare_flags, &new_fd);
    if (err)
    	goto bad_unshare_cleanup_fs;
    err = unshare_userns(unshare_flags, &new_cred);
@@@ -3387,7 -3392,7 +3387,7 @@@ int unshare_files(void
    struct files_struct *old, *copy = NULL;
    int error;
-	error = unshare_fd(CLONE_FILES, NR_OPEN_MAX, &copy);
 +	error = unshare_fd(CLONE_FILES, &copy);
    if (error || !copy)
    	return error;
diff --combined kernel/irq/msi.c
index ca6e2ae6d6fc0,1c7e5159064cc..3a24d6b5f559c
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@@ -82,7 -82,7 +82,7 @@@ static struct msi_desc *msi_alloc_desc(
    desc->dev = dev;
    desc->nvec_used = nvec;
    if (affinity) {
- 		desc->affinity = kmemdup(affinity, nvec * sizeof(*desc->affinity), GFP_KERNEL);
+ 		desc->affinity = kmemdup_array(affinity, nvec, sizeof(*desc->affinity), GFP_KERNEL);
    	if (!desc->affinity) {
    		kfree(desc);
    		return NULL;
@@@ -832,7 -832,7 +832,7 @@@ static void msi_domain_update_chip_ops(
    struct irq_chip *chip = info->chip;
BUG_ON(!chip || !chip->irq_mask || !chip->irq_unmask);
 -	if (!chip->irq_set_affinity)
 +	if (!chip->irq_set_affinity && !(info->flags & MSI_FLAG_NO_AFFINITY))
    	chip->irq_set_affinity = msi_domain_set_affinity;
  }
diff --combined kernel/sched/core.c
index 43e701f540130,b4c5d83e54d48..2d35fd27d3dfd
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@@ -163,7 -163,10 +163,10 @@@ static inline int __task_prio(const str
    if (p->sched_class == &stop_sched_class) /* trumps deadline */
    	return -2;
- 	if (rt_prio(p->prio)) /* includes deadline */
+ 	if (p->dl_server)
+ 		return -1; /* deadline */
+ 
+ 	if (rt_or_dl_prio(p->prio))
    	return p->prio; /* [-1, 99] */
if (p->sched_class == &idle_sched_class)
@@@ -192,8 -195,24 +195,24 @@@ static inline bool prio_less(const stru
    if (-pb < -pa)
    	return false;
- 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
- 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
+ 	if (pa == -1) { /* dl_prio() doesn't work because of stop_class above */
+ 		const struct sched_dl_entity *a_dl, *b_dl;
+ 
+ 		a_dl = &a->dl;
+ 		/*
+ 		 * Since,'a' and 'b' can be CFS tasks served by DL server,
+ 		 * __task_prio() can return -1 (for DL) even for those. In that
+ 		 * case, get to the dl_server's DL entity.
+ 		 */
+ 		if (a->dl_server)
+ 			a_dl = a->dl_server;
+ 
+ 		b_dl = &b->dl;
+ 		if (b->dl_server)
+ 			b_dl = b->dl_server;
+ 
+ 		return !dl_time_before(a_dl->deadline, b_dl->deadline);
+ 	}
if (pa == MAX_RT_PRIO + MAX_NICE)	/* fair */
    	return cfs_prio_less(a, b, in_fi);
@@@ -240,6 -259,9 +259,9 @@@ static inline int rb_sched_core_cmp(con
void sched_core_enqueue(struct rq *rq, struct task_struct *p)
  {
+ 	if (p->se.sched_delayed)
+ 		return;
+ 
    rq->core->core_task_seq++;
if (!p->core_cookie)
@@@ -250,6 -272,9 +272,9 @@@
void sched_core_dequeue(struct rq *rq, struct task_struct *p, int flags)
  {
+ 	if (p->se.sched_delayed)
+ 		return;
+ 
    rq->core->core_task_seq++;
if (sched_core_enqueued(p)) {
@@@ -1269,7 -1294,7 +1294,7 @@@ bool sched_can_stop_tick(struct rq *rq
     * dequeued by migrating while the constrained task continues to run.
     * E.g. going from 2->1 without going through pick_next_task().
     */
- 	if (sched_feat(HZ_BW) && __need_bw_check(rq, rq->curr)) {
+ 	if (__need_bw_check(rq, rq->curr)) {
    	if (cfs_task_bw_constrained(rq->curr))
    		return false;
    }
@@@ -1672,6 -1697,9 +1697,9 @@@ static inline void uclamp_rq_inc(struc
    if (unlikely(!p->sched_class->uclamp_enabled))
    	return;
+ 	if (p->se.sched_delayed)
+ 		return;
+ 
    for_each_clamp_id(clamp_id)
    	uclamp_rq_inc_id(rq, p, clamp_id);
@@@ -1696,6 -1724,9 +1724,9 @@@ static inline void uclamp_rq_dec(struc
    if (unlikely(!p->sched_class->uclamp_enabled))
    	return;
+ 	if (p->se.sched_delayed)
+ 		return;
+ 
    for_each_clamp_id(clamp_id)
    	uclamp_rq_dec_id(rq, p, clamp_id);
  }
@@@ -1975,14 -2006,21 +2006,21 @@@ void enqueue_task(struct rq *rq, struc
    	psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
    }
- 	uclamp_rq_inc(rq, p);
    p->sched_class->enqueue_task(rq, p, flags);
+ 	/*
+ 	 * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
+ 	 * ->sched_delayed.
+ 	 */
+ 	uclamp_rq_inc(rq, p);
if (sched_core_enabled(rq))
    	sched_core_enqueue(rq, p);
  }
- void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
+ /*
+  * Must only return false when DEQUEUE_SLEEP.
+  */
+ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
  {
    if (sched_core_enabled(rq))
    	sched_core_dequeue(rq, p, flags);
@@@ -1995,8 -2033,12 +2033,12 @@@
    	psi_dequeue(p, flags & DEQUEUE_SLEEP);
    }
+ 	/*
+ 	 * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
+ 	 * and mark the task ->sched_delayed.
+ 	 */
    uclamp_rq_dec(rq, p);
- 	p->sched_class->dequeue_task(rq, p, flags);
+ 	return p->sched_class->dequeue_task(rq, p, flags);
  }
void activate_task(struct rq *rq, struct task_struct *p, int flags)
@@@ -2014,12 -2056,25 +2056,25 @@@
void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
  {
- 	WRITE_ONCE(p->on_rq, (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING);
+ 	SCHED_WARN_ON(flags & DEQUEUE_SLEEP);
+ 
+ 	WRITE_ONCE(p->on_rq, TASK_ON_RQ_MIGRATING);
    ASSERT_EXCLUSIVE_WRITER(p->on_rq);
+ 	/*
+ 	 * Code explicitly relies on TASK_ON_RQ_MIGRATING begin set *before*
+ 	 * dequeue_task() and cleared *after* enqueue_task().
+ 	 */
+ 
    dequeue_task(rq, p, flags);
  }
+ static void block_task(struct rq *rq, struct task_struct *p, int flags)
+ {
+ 	if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
+ 		__block_task(rq, p);
+ }
+ 
  /**
   * task_curr - is this task currently executing on a CPU?
   * @p: the task in question.
@@@ -2233,6 -2288,12 +2288,12 @@@ void migrate_disable(void
    struct task_struct *p = current;
if (p->migration_disabled) {
+ #ifdef CONFIG_DEBUG_PREEMPT
+ 		/*
+ 		 *Warn about overflow half-way through the range.
+ 		 */
+ 		WARN_ON_ONCE((s16)p->migration_disabled < 0);
+ #endif
    	p->migration_disabled++;
    	return;
    }
@@@ -2251,14 -2312,20 +2312,20 @@@ void migrate_enable(void
    	.flags     = SCA_MIGRATE_ENABLE,
    };
+ #ifdef CONFIG_DEBUG_PREEMPT
+ 	/*
+ 	 * Check both overflow from migrate_disable() and superfluous
+ 	 * migrate_enable().
+ 	 */
+ 	if (WARN_ON_ONCE((s16)p->migration_disabled <= 0))
+ 		return;
+ #endif
+ 
    if (p->migration_disabled > 1) {
    	p->migration_disabled--;
    	return;
    }
- 	if (WARN_ON_ONCE(!p->migration_disabled))
- 		return;
- 
    /*
     * Ensure stop_task runs either before or after this, and that
     * __set_cpus_allowed_ptr(SCA_MIGRATE_ENABLE) doesn't schedule().
@@@ -3607,8 -3674,6 +3674,6 @@@ ttwu_do_activate(struct rq *rq, struct 
    	rq->idle_stamp = 0;
    }
  #endif
- 
- 	p->dl_server = NULL;
  }
/*
@@@ -3644,12 -3709,14 +3709,14 @@@ static int ttwu_runnable(struct task_st
rq = __task_rq_lock(p, &rf);
    if (task_on_rq_queued(p)) {
+ 		update_rq_clock(rq);
+ 		if (p->se.sched_delayed)
+ 			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
    	if (!task_on_cpu(rq, p)) {
    		/*
    		 * When on_rq && !on_cpu the task is preempted, see if
    		 * it should preempt the task that is current now.
    		 */
- 			update_rq_clock(rq);
    		wakeup_preempt(rq, p, wake_flags);
    	}
    	ttwu_do_wakeup(p);
@@@ -4029,11 -4096,16 +4096,16 @@@ int try_to_wake_up(struct task_struct *
    	 * case the whole 'p->on_rq && ttwu_runnable()' case below
    	 * without taking any locks.
    	 *
+ 		 * Specifically, given current runs ttwu() we must be before
+ 		 * schedule()'s block_task(), as such this must not observe
+ 		 * sched_delayed.
+ 		 *
    	 * In particular:
    	 *  - we rely on Program-Order guarantees for all the ordering,
    	 *  - we're serialized against set_special_state() by virtue of
    	 *    it disabling IRQs (this allows not taking ->pi_lock).
    	 */
+ 		SCHED_WARN_ON(p->se.sched_delayed);
    	if (!ttwu_state_match(p, state, &success))
    		goto out;
@@@ -4322,9 -4394,11 +4394,11 @@@ static void __sched_fork(unsigned long 
    p->se.nr_migrations		= 0;
    p->se.vruntime			= 0;
    p->se.vlag			= 0;
- 	p->se.slice			= sysctl_sched_base_slice;
    INIT_LIST_HEAD(&p->se.group_node);
+ 	/* A delayed task cannot be in clone(). */
+ 	SCHED_WARN_ON(p->se.sched_delayed);
+ 
  #ifdef CONFIG_FAIR_GROUP_SCHED
    p->se.cfs_rq			= NULL;
  #endif
@@@ -4572,6 -4646,8 +4646,8 @@@ int sched_fork(unsigned long clone_flag
p->prio = p->normal_prio = p->static_prio;
    	set_load_weight(p, false);
+ 		p->se.custom_slice = 0;
+ 		p->se.slice = sysctl_sched_base_slice;
/*
    	 * We don't need the reset flag anymore after the fork. It has
@@@ -4686,7 -4762,7 +4762,7 @@@ void wake_up_new_task(struct task_struc
    update_rq_clock(rq);
    post_init_entity_util_avg(p);
- 	activate_task(rq, p, ENQUEUE_NOCLOCK);
+ 	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_INITIAL);
    trace_sched_wakeup_new(p);
    wakeup_preempt(rq, p, WF_FORK);
  #ifdef CONFIG_SMP
@@@ -5769,8 -5845,8 +5845,8 @@@ static inline void schedule_debug(struc
    schedstat_inc(this_rq()->sched_count);
  }
- static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
- 				  struct rq_flags *rf)
+ static void prev_balance(struct rq *rq, struct task_struct *prev,
+ 			 struct rq_flags *rf)
  {
  #ifdef CONFIG_SMP
    const struct sched_class *class;
@@@ -5787,8 -5863,6 +5863,6 @@@
    		break;
    }
  #endif
- 
- 	put_prev_task(rq, prev);
  }
/*
@@@ -5800,6 -5874,8 +5874,8 @@@ __pick_next_task(struct rq *rq, struct 
    const struct sched_class *class;
    struct task_struct *p;
+ 	rq->dl_server = NULL;
+ 
    /*
     * Optimization: we know that if all tasks are in the fair class we can
     * call that function directly, but only if the @prev task wasn't of a
@@@ -5815,35 -5891,28 +5891,28 @@@
/* Assume the next prioritized class is idle_sched_class */
    	if (!p) {
- 			put_prev_task(rq, prev);
- 			p = pick_next_task_idle(rq);
+ 			p = pick_task_idle(rq);
+ 			put_prev_set_next_task(rq, prev, p);
    	}
- 		/*
- 		 * This is the fast path; it cannot be a DL server pick;
- 		 * therefore even if @p == @prev, ->dl_server must be NULL.
- 		 */
- 		if (p->dl_server)
- 			p->dl_server = NULL;
- 
    	return p;
    }
restart:
- 	put_prev_task_balance(rq, prev, rf);
- 
- 	/*
- 	 * We've updated @prev and no longer need the server link, clear it.
- 	 * Must be done before ->pick_next_task() because that can (re)set
- 	 * ->dl_server.
- 	 */
- 	if (prev->dl_server)
- 		prev->dl_server = NULL;
+ 	prev_balance(rq, prev, rf);
for_each_class(class) {
- 		p = class->pick_next_task(rq);
- 		if (p)
- 			return p;
+ 		if (class->pick_next_task) {
+ 			p = class->pick_next_task(rq, prev);
+ 			if (p)
+ 				return p;
+ 		} else {
+ 			p = class->pick_task(rq);
+ 			if (p) {
+ 				put_prev_set_next_task(rq, prev, p);
+ 				return p;
+ 			}
+ 		}
    }
BUG(); /* The idle class should always have a runnable task. */
@@@ -5873,6 -5942,8 +5942,8 @@@ static inline struct task_struct *pick_
    const struct sched_class *class;
    struct task_struct *p;
+ 	rq->dl_server = NULL;
+ 
    for_each_class(class) {
    	p = class->pick_task(rq);
    	if (p)
@@@ -5911,6 -5982,7 +5982,7 @@@ pick_next_task(struct rq *rq, struct ta
    	 * another cpu during offline.
    	 */
    	rq->core_pick = NULL;
+ 		rq->core_dl_server = NULL;
    	return __pick_next_task(rq, prev, rf);
    }
@@@ -5929,16 -6001,13 +6001,13 @@@
    	WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
next = rq->core_pick;
- 		if (next != prev) {
- 			put_prev_task(rq, prev);
- 			set_next_task(rq, next);
- 		}
- 
+ 		rq->dl_server = rq->core_dl_server;
    	rq->core_pick = NULL;
- 		goto out;
+ 		rq->core_dl_server = NULL;
+ 		goto out_set_next;
    }
- 	put_prev_task_balance(rq, prev, rf);
+ 	prev_balance(rq, prev, rf);
smt_mask = cpu_smt_mask(cpu);
    need_sync = !!rq->core->core_cookie;
@@@ -5979,6 -6048,7 +6048,7 @@@
    	next = pick_task(rq);
    	if (!next->core_cookie) {
    		rq->core_pick = NULL;
+ 			rq->core_dl_server = NULL;
    		/*
    		 * For robustness, update the min_vruntime_fi for
    		 * unconstrained picks as well.
@@@ -6006,7 -6076,9 +6076,9 @@@
    	if (i != cpu && (rq_i != rq->core || !core_clock_updated))
    		update_rq_clock(rq_i);
- 		p = rq_i->core_pick = pick_task(rq_i);
+ 		rq_i->core_pick = p = pick_task(rq_i);
+ 		rq_i->core_dl_server = rq_i->dl_server;
+ 
    	if (!max || prio_less(max, p, fi_before))
    		max = p;
    }
@@@ -6030,6 -6102,7 +6102,7 @@@
    	}
rq_i->core_pick = p;
+ 		rq_i->core_dl_server = NULL;
if (p == rq_i->idle) {
    		if (rq_i->nr_running) {
@@@ -6090,6 -6163,7 +6163,7 @@@
if (i == cpu) {
    		rq_i->core_pick = NULL;
+ 			rq_i->core_dl_server = NULL;
    		continue;
    	}
@@@ -6098,6 -6172,7 +6172,7 @@@
if (rq_i->curr == rq_i->core_pick) {
    		rq_i->core_pick = NULL;
+ 			rq_i->core_dl_server = NULL;
    		continue;
    	}
@@@ -6105,8 -6180,7 +6180,7 @@@
    }
out_set_next:
- 	set_next_task(rq, next);
- out:
+ 	put_prev_set_next_task(rq, prev, next);
    if (rq->core->core_forceidle_count && next == rq->idle)
    	queue_core_balance(rq);
@@@ -6342,19 -6416,12 +6416,12 @@@ pick_next_task(struct rq *rq, struct ta
   * Constants for the sched_mode argument of __schedule().
   *
   * The mode argument allows RT enabled kernels to differentiate a
-  * preemption from blocking on an 'sleeping' spin/rwlock. Note that
-  * SM_MASK_PREEMPT for !RT has all bits set, which allows the compiler to
-  * optimize the AND operation out and just check for zero.
+  * preemption from blocking on an 'sleeping' spin/rwlock.
   */
- #define SM_NONE			0x0
- #define SM_PREEMPT		0x1
- #define SM_RTLOCK_WAIT		0x2
- 
- #ifndef CONFIG_PREEMPT_RT
- # define SM_MASK_PREEMPT	(~0U)
- #else
- # define SM_MASK_PREEMPT	SM_PREEMPT
- #endif
+ #define SM_IDLE			(-1)
+ #define SM_NONE			0
+ #define SM_PREEMPT		1
+ #define SM_RTLOCK_WAIT		2
/*
   * __schedule() is the main scheduler function.
@@@ -6395,9 -6462,14 +6462,14 @@@
   *
   * WARNING: must be called with preemption disabled!
   */
- static void __sched notrace __schedule(unsigned int sched_mode)
+ static void __sched notrace __schedule(int sched_mode)
  {
    struct task_struct *prev, *next;
+ 	/*
+ 	 * On PREEMPT_RT kernel, SM_RTLOCK_WAIT is noted
+ 	 * as a preemption by schedule_debug() and RCU.
+ 	 */
+ 	bool preempt = sched_mode > SM_NONE;
    unsigned long *switch_count;
    unsigned long prev_state;
    struct rq_flags rf;
@@@ -6408,13 -6480,13 +6480,13 @@@
    rq = cpu_rq(cpu);
    prev = rq->curr;
- 	schedule_debug(prev, !!sched_mode);
+ 	schedule_debug(prev, preempt);
if (sched_feat(HRTICK) || sched_feat(HRTICK_DL))
    	hrtick_clear(rq);
local_irq_disable();
- 	rcu_note_context_switch(!!sched_mode);
+ 	rcu_note_context_switch(preempt);
/*
     * Make sure that signal_pending_state()->signal_pending() below
@@@ -6443,22 -6515,32 +6515,32 @@@
switch_count = &prev->nivcsw;
+ 	/* Task state changes only considers SM_PREEMPT as preemption */
+ 	preempt = sched_mode == SM_PREEMPT;
+ 
    /*
     * We must load prev->state once (task_struct::state is volatile), such
     * that we form a control dependency vs deactivate_task() below.
     */
    prev_state = READ_ONCE(prev->__state);
- 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
+ 	if (sched_mode == SM_IDLE) {
+ 		if (!rq->nr_running) {
+ 			next = prev;
+ 			goto picked;
+ 		}
+ 	} else if (!preempt && prev_state) {
    	if (signal_pending_state(prev_state, prev)) {
    		WRITE_ONCE(prev->__state, TASK_RUNNING);
    	} else {
+ 			int flags = DEQUEUE_NOCLOCK;
+ 
    		prev->sched_contributes_to_load =
    			(prev_state & TASK_UNINTERRUPTIBLE) &&
    			!(prev_state & TASK_NOLOAD) &&
    			!(prev_state & TASK_FROZEN);
- 			if (prev->sched_contributes_to_load)
- 				rq->nr_uninterruptible++;
+ 			if (unlikely(is_special_task_state(prev_state)))
+ 				flags |= DEQUEUE_SPECIAL;
/*
    		 * __schedule()			ttwu()
@@@ -6471,17 -6553,13 +6553,13 @@@
    		 *
    		 * After this, schedule() must not care about p->state any more.
    		 */
- 			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
- 
- 			if (prev->in_iowait) {
- 				atomic_inc(&rq->nr_iowait);
- 				delayacct_blkio_start();
- 			}
+ 			block_task(rq, prev, flags);
    	}
    	switch_count = &prev->nvcsw;
    }
next = pick_next_task(rq, prev, &rf);
+ picked:
    clear_tsk_need_resched(prev);
    clear_preempt_need_resched();
  #ifdef CONFIG_SCHED_DEBUG
@@@ -6523,7 -6601,7 +6601,7 @@@
    	psi_account_irqtime(rq, prev, next);
    	psi_sched_switch(prev, next, !task_on_rq_queued(prev));
- 		trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state);
+ 		trace_sched_switch(preempt, prev, next, prev_state);
/* Also unlocks the rq: */
    	rq = context_switch(rq, prev, next, &rf);
@@@ -6599,7 -6677,7 +6677,7 @@@ static void sched_update_worker(struct 
    }
  }
- static __always_inline void __schedule_loop(unsigned int sched_mode)
+ static __always_inline void __schedule_loop(int sched_mode)
  {
    do {
    	preempt_disable();
@@@ -6644,7 -6722,7 +6722,7 @@@ void __sched schedule_idle(void
     */
    WARN_ON_ONCE(current->__state);
    do {
- 		__schedule(SM_NONE);
+ 		__schedule(SM_IDLE);
    } while (need_resched());
  }
@@@ -7405,7 -7483,7 +7483,7 @@@ EXPORT_SYMBOL(io_schedule)
void sched_show_task(struct task_struct *p)
  {
 -	unsigned long free = 0;
 +	unsigned long free;
    int ppid;
if (!try_get_task_stack(p))
@@@ -7415,7 -7493,9 +7493,7 @@@
if (task_is_running(p))
    	pr_cont("  running task    ");
 -#ifdef CONFIG_DEBUG_STACK_USAGE
    free = stack_not_used(p);
 -#endif
    ppid = 0;
    rcu_read_lock();
    if (pid_alive(p))
@@@ -8226,8 -8306,6 +8304,6 @@@ void __init sched_init(void
  #endif /* CONFIG_RT_GROUP_SCHED */
    }
- 	init_rt_bandwidth(&def_rt_bandwidth, global_rt_period(), global_rt_runtime());
- 
  #ifdef CONFIG_SMP
    init_defrootdomain();
  #endif
@@@ -8282,8 -8360,13 +8358,13 @@@
    	init_tg_cfs_entry(&root_task_group, &rq->cfs, NULL, i, NULL);
  #endif /* CONFIG_FAIR_GROUP_SCHED */
- 		rq->rt.rt_runtime = def_rt_bandwidth.rt_runtime;
  #ifdef CONFIG_RT_GROUP_SCHED
+ 		/*
+ 		 * This is required for init cpu because rt.c:__enable_runtime()
+ 		 * starts working after scheduler_running, which is not the case
+ 		 * yet.
+ 		 */
+ 		rq->rt.rt_runtime = global_rt_runtime();
    	init_tg_rt_entry(&root_task_group, &rq->rt, NULL, i, NULL);
  #endif
  #ifdef CONFIG_SMP
@@@ -8315,10 -8398,12 +8396,12 @@@
  #endif /* CONFIG_SMP */
    	hrtick_rq_init(rq);
    	atomic_set(&rq->nr_iowait, 0);
+ 		fair_server_init(rq);
#ifdef CONFIG_SCHED_CORE
    	rq->core = rq;
    	rq->core_pick = NULL;
+ 		rq->core_dl_server = NULL;
    	rq->core_enabled = 0;
    	rq->core_tree = RB_ROOT;
    	rq->core_forceidle_count = 0;
@@@ -8331,6 -8416,7 +8414,7 @@@
    }
set_load_weight(&init_task, false);
+ 	init_task.se.slice = sysctl_sched_base_slice,
/*
     * The boot idle thread does lazy MMU switching as well:
@@@ -8546,7 -8632,7 +8630,7 @@@ void normalize_rt_tasks(void
    	schedstat_set(p->stats.sleep_start, 0);
    	schedstat_set(p->stats.block_start, 0);
- 		if (!dl_task(p) && !rt_task(p)) {
+ 		if (!rt_or_dl_task(p)) {
    		/*
    		 * Renice negative nice level userspace
    		 * tasks back to 0:
diff --combined kernel/sched/fair.c
index a1b756f927b23,b9784e13e6b6e..03d7c5e59a182
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@@ -511,7 -511,7 +511,7 @@@ static int cfs_rq_is_idle(struct cfs_r
static int se_is_idle(struct sched_entity *se)
  {
- 	return 0;
+ 	return task_has_idle_policy(task_of(se));
  }
#endif	/* CONFIG_FAIR_GROUP_SCHED */
@@@ -779,8 -779,22 +779,22 @@@ static void update_min_vruntime(struct 
    }
/* ensure we never gain time by being placed backwards. */
- 	u64_u32_store(cfs_rq->min_vruntime,
- 		      __update_min_vruntime(cfs_rq, vruntime));
+ 	cfs_rq->min_vruntime = __update_min_vruntime(cfs_rq, vruntime);
+ }
+ 
+ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
+ {
+ 	struct sched_entity *root = __pick_root_entity(cfs_rq);
+ 	struct sched_entity *curr = cfs_rq->curr;
+ 	u64 min_slice = ~0ULL;
+ 
+ 	if (curr && curr->on_rq)
+ 		min_slice = curr->slice;
+ 
+ 	if (root)
+ 		min_slice = min(min_slice, root->min_slice);
+ 
+ 	return min_slice;
  }
static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@@ -799,19 -813,34 +813,34 @@@ static inline void __min_vruntime_updat
    }
  }
+ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *node)
+ {
+ 	if (node) {
+ 		struct sched_entity *rse = __node_2_se(node);
+ 		if (rse->min_slice < se->min_slice)
+ 			se->min_slice = rse->min_slice;
+ 	}
+ }
+ 
  /*
   * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
   */
  static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
  {
    u64 old_min_vruntime = se->min_vruntime;
+ 	u64 old_min_slice = se->min_slice;
    struct rb_node *node = &se->run_node;
se->min_vruntime = se->vruntime;
    __min_vruntime_update(se, node->rb_right);
    __min_vruntime_update(se, node->rb_left);
- 	return se->min_vruntime == old_min_vruntime;
+ 	se->min_slice = se->slice;
+ 	__min_slice_update(se, node->rb_right);
+ 	__min_slice_update(se, node->rb_left);
+ 
+ 	return se->min_vruntime == old_min_vruntime &&
+ 	       se->min_slice == old_min_slice;
  }
RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
@@@ -824,6 -853,7 +853,7 @@@ static void __enqueue_entity(struct cfs
  {
    avg_vruntime_add(cfs_rq, se);
    se->min_vruntime = se->vruntime;
+ 	se->min_slice = se->slice;
    rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
    			__entity_less, &min_vruntime_cb);
  }
@@@ -974,17 -1004,18 +1004,18 @@@ static void clear_buddies(struct cfs_r
   * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
   * this is probably good enough.
   */
- static void update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
+ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
  {
    if ((s64)(se->vruntime - se->deadline) < 0)
- 		return;
+ 		return false;
/*
     * For EEVDF the virtual time slope is determined by w_i (iow.
     * nice) while the request time r_i is determined by
     * sysctl_sched_base_slice.
     */
- 	se->slice = sysctl_sched_base_slice;
+ 	if (!se->custom_slice)
+ 		se->slice = sysctl_sched_base_slice;
/*
     * EEVDF: vd_i = ve_i + r_i / w_i
@@@ -994,10 -1025,7 +1025,7 @@@
    /*
     * The task has consumed its request, reschedule.
     */
- 	if (cfs_rq->nr_running > 1) {
- 		resched_curr(rq_of(cfs_rq));
- 		clear_buddies(cfs_rq, se);
- 	}
+ 	return true;
  }
#include "pelt.h"
@@@ -1135,6 -1163,38 +1163,38 @@@ static inline void update_curr_task(str
    	dl_server_update(p->dl_server, delta_exec);
  }
+ static inline bool did_preempt_short(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+ {
+ 	if (!sched_feat(PREEMPT_SHORT))
+ 		return false;
+ 
+ 	if (curr->vlag == curr->deadline)
+ 		return false;
+ 
+ 	return !entity_eligible(cfs_rq, curr);
+ }
+ 
+ static inline bool do_preempt_short(struct cfs_rq *cfs_rq,
+ 				    struct sched_entity *pse, struct sched_entity *se)
+ {
+ 	if (!sched_feat(PREEMPT_SHORT))
+ 		return false;
+ 
+ 	if (pse->slice >= se->slice)
+ 		return false;
+ 
+ 	if (!entity_eligible(cfs_rq, pse))
+ 		return false;
+ 
+ 	if (entity_before(pse, se))
+ 		return true;
+ 
+ 	if (!entity_eligible(cfs_rq, se))
+ 		return true;
+ 
+ 	return false;
+ }
+ 
  /*
   * Used by other classes to account runtime.
   */
@@@ -1156,23 -1216,44 +1216,44 @@@ s64 update_curr_common(struct rq *rq
  static void update_curr(struct cfs_rq *cfs_rq)
  {
    struct sched_entity *curr = cfs_rq->curr;
+ 	struct rq *rq = rq_of(cfs_rq);
    s64 delta_exec;
+ 	bool resched;
if (unlikely(!curr))
    	return;
- 	delta_exec = update_curr_se(rq_of(cfs_rq), curr);
+ 	delta_exec = update_curr_se(rq, curr);
    if (unlikely(delta_exec <= 0))
    	return;
curr->vruntime += calc_delta_fair(delta_exec, curr);
- 	update_deadline(cfs_rq, curr);
+ 	resched = update_deadline(cfs_rq, curr);
    update_min_vruntime(cfs_rq);
- 	if (entity_is_task(curr))
- 		update_curr_task(task_of(curr), delta_exec);
+ 	if (entity_is_task(curr)) {
+ 		struct task_struct *p = task_of(curr);
+ 
+ 		update_curr_task(p, delta_exec);
+ 
+ 		/*
+ 		 * Any fair task that runs outside of fair_server should
+ 		 * account against fair_server such that it can account for
+ 		 * this time and possibly avoid running this period.
+ 		 */
+ 		if (p->dl_server != &rq->fair_server)
+ 			dl_server_update(&rq->fair_server, delta_exec);
+ 	}
account_cfs_rq_runtime(cfs_rq, delta_exec);
+ 
+ 	if (rq->nr_running == 1)
+ 		return;
+ 
+ 	if (resched || did_preempt_short(cfs_rq, curr)) {
+ 		resched_curr(rq);
+ 		clear_buddies(cfs_rq, curr);
+ 	}
  }
static void update_curr_fair(struct rq *rq)
@@@ -1742,7 -1823,7 +1823,7 @@@ static bool pgdat_free_space_enough(str
    		continue;
if (zone_watermark_ok(zone, 0,
 -				      wmark_pages(zone, WMARK_PROMO) + enough_wmark,
 +				      promo_wmark_pages(zone) + enough_wmark,
    			      ZONE_MOVABLE, 0))
    		return true;
    }
@@@ -1840,7 -1921,8 +1921,7 @@@ bool should_numa_migrate_memory(struct 
     * The pages in slow memory node should be migrated according
     * to hot/cold instead of private/shared.
     */
 -	if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
 -	    !node_is_toptier(src_nid)) {
 +	if (folio_use_access_time(folio)) {
    	struct pglist_data *pgdat;
    	unsigned long rate_limit;
    	unsigned int latency, th, def_th;
@@@ -3187,15 -3269,6 +3268,15 @@@ static bool vma_is_accessed(struct mm_s
    	return true;
    }
+	/*
 +	 * This vma has not been accessed for a while, and if the number
 +	 * the threads in the same process is low, which means no other
 +	 * threads can help scan this vma, force a vma scan.
 +	 */
 +	if (READ_ONCE(mm->numa_scan_seq) >
 +	   (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
 +		return true;
 +
    return false;
  }
@@@ -5186,7 -5259,8 +5267,8 @@@ place_entity(struct cfs_rq *cfs_rq, str
    u64 vslice, vruntime = avg_vruntime(cfs_rq);
    s64 lag = 0;
- 	se->slice = sysctl_sched_base_slice;
+ 	if (!se->custom_slice)
+ 		se->slice = sysctl_sched_base_slice;
    vslice = calc_delta_fair(se->slice, se);
/*
@@@ -5267,6 -5341,12 +5349,12 @@@
se->vruntime = vruntime - lag;
+ 	if (sched_feat(PLACE_REL_DEADLINE) && se->rel_deadline) {
+ 		se->deadline += se->vruntime;
+ 		se->rel_deadline = 0;
+ 		return;
+ 	}
+ 
    /*
     * When joining the competition; the existing tasks will be,
     * on average, halfway through their slice, as such start tasks
@@@ -5286,6 -5366,9 +5374,9 @@@ static inline int cfs_rq_throttled(stru
static inline bool cfs_bandwidth_used(void);
+ static void
+ requeue_delayed_entity(struct sched_entity *se);
+ 
  static void
  enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  {
@@@ -5373,19 -5456,47 +5464,47 @@@ static void clear_buddies(struct cfs_r
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
- static void
+ static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
+ {
+ 	se->sched_delayed = 0;
+ 	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
+ 		se->vlag = 0;
+ }
+ 
+ static bool
  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
  {
- 	int action = UPDATE_TG;
+ 	bool sleep = flags & DEQUEUE_SLEEP;
+ 	update_curr(cfs_rq);
+ 
+ 	if (flags & DEQUEUE_DELAYED) {
+ 		SCHED_WARN_ON(!se->sched_delayed);
+ 	} else {
+ 		bool delay = sleep;
+ 		/*
+ 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
+ 		 * states must not suffer spurious wakeups, excempt them.
+ 		 */
+ 		if (flags & DEQUEUE_SPECIAL)
+ 			delay = false;
+ 
+ 		SCHED_WARN_ON(delay && se->sched_delayed);
+ 
+ 		if (sched_feat(DELAY_DEQUEUE) && delay &&
+ 		    !entity_eligible(cfs_rq, se)) {
+ 			if (cfs_rq->next == se)
+ 				cfs_rq->next = NULL;
+ 			update_load_avg(cfs_rq, se, 0);
+ 			se->sched_delayed = 1;
+ 			return false;
+ 		}
+ 	}
+ 
+ 	int action = UPDATE_TG;
    if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
    	action |= DO_DETACH;
- 	/*
- 	 * Update run-time statistics of the 'current'.
- 	 */
- 	update_curr(cfs_rq);
- 
    /*
     * When dequeuing a sched_entity, we must:
     *   - Update loads to have both entity and cfs_rq synced with now.
@@@ -5403,6 -5514,11 +5522,11 @@@
    clear_buddies(cfs_rq, se);
update_entity_lag(cfs_rq, se);
+ 	if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
+ 		se->deadline -= se->vruntime;
+ 		se->rel_deadline = 1;
+ 	}
+ 
    if (se != cfs_rq->curr)
    	__dequeue_entity(cfs_rq, se);
    se->on_rq = 0;
@@@ -5422,8 -5538,13 +5546,13 @@@
    if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
    	update_min_vruntime(cfs_rq);
+ 	if (flags & DEQUEUE_DELAYED)
+ 		finish_delayed_dequeue_entity(se);
+ 
    if (cfs_rq->nr_running == 0)
    	update_idle_cfs_rq_clock_pelt(cfs_rq);
+ 
+ 	return true;
  }
static void
@@@ -5449,6 -5570,7 +5578,7 @@@ set_next_entity(struct cfs_rq *cfs_rq, 
    }
update_stats_curr_start(cfs_rq, se);
+ 	SCHED_WARN_ON(cfs_rq->curr);
    cfs_rq->curr = se;
/*
@@@ -5469,6 -5591,8 +5599,8 @@@
    se->prev_sum_exec_runtime = se->sum_exec_runtime;
  }
+ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags);
+ 
  /*
   * Pick the next process, keeping these things in mind, in this order:
   * 1) keep things fair between processes/task groups
@@@ -5477,16 -5601,26 +5609,26 @@@
   * 4) do not run the "skip" process, if something else is available
   */
  static struct sched_entity *
- pick_next_entity(struct cfs_rq *cfs_rq)
+ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
  {
    /*
     * Enabling NEXT_BUDDY will affect latency but not fairness.
     */
    if (sched_feat(NEXT_BUDDY) &&
- 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
+ 	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next)) {
+ 		/* ->next will never be delayed */
+ 		SCHED_WARN_ON(cfs_rq->next->sched_delayed);
    	return cfs_rq->next;
+ 	}
- 	return pick_eevdf(cfs_rq);
+ 	struct sched_entity *se = pick_eevdf(cfs_rq);
+ 	if (se->sched_delayed) {
+ 		dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+ 		SCHED_WARN_ON(se->sched_delayed);
+ 		SCHED_WARN_ON(se->on_rq);
+ 		return NULL;
+ 	}
+ 	return se;
  }
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
@@@ -5510,6 -5644,7 +5652,7 @@@ static void put_prev_entity(struct cfs_
    	/* in !on_rq case, update occurred at dequeue */
    	update_load_avg(cfs_rq, prev, 0);
    }
+ 	SCHED_WARN_ON(cfs_rq->curr != prev);
    cfs_rq->curr = NULL;
  }
@@@ -5773,6 -5908,7 +5916,7 @@@ static bool throttle_cfs_rq(struct cfs_
    struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
    struct sched_entity *se;
    long task_delta, idle_task_delta, dequeue = 1;
+ 	long rq_h_nr_running = rq->cfs.h_nr_running;
raw_spin_lock(&cfs_b->lock);
    /* This will start the period timer if necessary */
@@@ -5806,11 -5942,21 +5950,21 @@@
    idle_task_delta = cfs_rq->idle_h_nr_running;
    for_each_sched_entity(se) {
    	struct cfs_rq *qcfs_rq = cfs_rq_of(se);
+ 		int flags;
+ 
    	/* throttled entity or throttle-on-deactivate */
    	if (!se->on_rq)
    		goto done;
- 		dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
+ 		/*
+ 		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
+ 		 * This avoids teaching dequeue_entities() about throttled
+ 		 * entities and keeps things relatively simple.
+ 		 */
+ 		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
+ 		if (se->sched_delayed)
+ 			flags |= DEQUEUE_DELAYED;
+ 		dequeue_entity(qcfs_rq, se, flags);
if (cfs_rq_is_idle(group_cfs_rq(se)))
    		idle_task_delta = cfs_rq->h_nr_running;
@@@ -5844,6 -5990,9 +5998,9 @@@
    /* At this point se is NULL and we are at root level*/
    sub_nr_running(rq, task_delta);
+ 	/* Stop the fair server if throttling resulted in no runnable tasks */
+ 	if (rq_h_nr_running && !rq->cfs.h_nr_running)
+ 		dl_server_stop(&rq->fair_server);
  done:
    /*
     * Note: distribution will already see us throttled via the
@@@ -5862,6 -6011,7 +6019,7 @@@ void unthrottle_cfs_rq(struct cfs_rq *c
    struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
    struct sched_entity *se;
    long task_delta, idle_task_delta;
+ 	long rq_h_nr_running = rq->cfs.h_nr_running;
se = cfs_rq->tg->se[cpu_of(rq)];
@@@ -5899,8 -6049,10 +6057,10 @@@
    for_each_sched_entity(se) {
    	struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- 		if (se->on_rq)
+ 		if (se->on_rq) {
+ 			SCHED_WARN_ON(se->sched_delayed);
    		break;
+ 		}
    	enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
if (cfs_rq_is_idle(group_cfs_rq(se)))
@@@ -5931,6 -6083,10 +6091,10 @@@
    		goto unthrottle_throttle;
    }
+ 	/* Start the fair server if un-throttling resulted in new runnable tasks */
+ 	if (!rq_h_nr_running && rq->cfs.h_nr_running)
+ 		dl_server_start(&rq->fair_server);
+ 
    /* At this point se is NULL and we are at root level*/
    add_nr_running(rq, task_delta);
@@@ -6563,7 -6719,7 +6727,7 @@@ static void sched_fair_update_stop_tick
  {
    int cpu = cpu_of(rq);
- 	if (!sched_feat(HZ_BW) || !cfs_bandwidth_used())
+ 	if (!cfs_bandwidth_used())
    	return;
if (!tick_nohz_full_cpu(cpu))
@@@ -6746,6 -6902,37 +6910,37 @@@ static int sched_idle_cpu(int cpu
  }
  #endif
+ static void
+ requeue_delayed_entity(struct sched_entity *se)
+ {
+ 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ 
+ 	/*
+ 	 * se->sched_delayed should imply: se->on_rq == 1.
+ 	 * Because a delayed entity is one that is still on
+ 	 * the runqueue competing until elegibility.
+ 	 */
+ 	SCHED_WARN_ON(!se->sched_delayed);
+ 	SCHED_WARN_ON(!se->on_rq);
+ 
+ 	if (sched_feat(DELAY_ZERO)) {
+ 		update_entity_lag(cfs_rq, se);
+ 		if (se->vlag > 0) {
+ 			cfs_rq->nr_running--;
+ 			if (se != cfs_rq->curr)
+ 				__dequeue_entity(cfs_rq, se);
+ 			se->vlag = 0;
+ 			place_entity(cfs_rq, se, 0);
+ 			if (se != cfs_rq->curr)
+ 				__enqueue_entity(cfs_rq, se);
+ 			cfs_rq->nr_running++;
+ 		}
+ 	}
+ 
+ 	update_load_avg(cfs_rq, se, 0);
+ 	se->sched_delayed = 0;
+ }
+ 
  /*
   * The enqueue_task method is called before nr_running is
   * increased. Here we update the fair scheduling stats and
@@@ -6758,6 -6945,8 +6953,8 @@@ enqueue_task_fair(struct rq *rq, struc
    struct sched_entity *se = &p->se;
    int idle_h_nr_running = task_has_idle_policy(p);
    int task_new = !(flags & ENQUEUE_WAKEUP);
+ 	int rq_h_nr_running = rq->cfs.h_nr_running;
+ 	u64 slice = 0;
/*
     * The code below (indirectly) updates schedutil which looks at
@@@ -6765,7 -6954,13 +6962,13 @@@
     * Let's add the task's estimated utilization to the cfs_rq's
     * estimated utilization, before we update schedutil.
     */
- 	util_est_enqueue(&rq->cfs, p);
+ 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & ENQUEUE_RESTORE))))
+ 		util_est_enqueue(&rq->cfs, p);
+ 
+ 	if (flags & ENQUEUE_DELAYED) {
+ 		requeue_delayed_entity(se);
+ 		return;
+ 	}
/*
     * If in_iowait is set, the code below may not trigger any cpufreq
@@@ -6776,10 -6971,24 +6979,24 @@@
    	cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
for_each_sched_entity(se) {
- 		if (se->on_rq)
+ 		if (se->on_rq) {
+ 			if (se->sched_delayed)
+ 				requeue_delayed_entity(se);
    		break;
+ 		}
    	cfs_rq = cfs_rq_of(se);
+ 
+ 		/*
+ 		 * Basically set the slice of group entries to the min_slice of
+ 		 * their respective cfs_rq. This ensures the group can service
+ 		 * its entities in the desired time-frame.
+ 		 */
+ 		if (slice) {
+ 			se->slice = slice;
+ 			se->custom_slice = 1;
+ 		}
    	enqueue_entity(cfs_rq, se, flags);
+ 		slice = cfs_rq_min_slice(cfs_rq);
cfs_rq->h_nr_running++;
    	cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@@ -6801,6 -7010,9 +7018,9 @@@
    	se_update_runnable(se);
    	update_cfs_group(se);
+ 		se->slice = slice;
+ 		slice = cfs_rq_min_slice(cfs_rq);
+ 
    	cfs_rq->h_nr_running++;
    	cfs_rq->idle_h_nr_running += idle_h_nr_running;
@@@ -6812,6 -7024,13 +7032,13 @@@
    		goto enqueue_throttle;
    }
+ 	if (!rq_h_nr_running && rq->cfs.h_nr_running) {
+ 		/* Account for idle runtime */
+ 		if (!rq->nr_running)
+ 			dl_server_update_idle_time(rq, rq->curr);
+ 		dl_server_start(&rq->fair_server);
+ 	}
+ 
    /* At this point se is NULL and we are at root level*/
    add_nr_running(rq, 1);
@@@ -6841,36 -7060,59 +7068,59 @@@ enqueue_throttle
  static void set_next_buddy(struct sched_entity *se);
/*
-  * The dequeue_task method is called before nr_running is
-  * decreased. We remove the task from the rbtree and
-  * update the fair scheduling stats:
+  * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
+  * failing half-way through and resume the dequeue later.
+  *
+  * Returns:
+  * -1 - dequeue delayed
+  *  0 - dequeue throttled
+  *  1 - dequeue complete
   */
- static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  {
- 	struct cfs_rq *cfs_rq;
- 	struct sched_entity *se = &p->se;
- 	int task_sleep = flags & DEQUEUE_SLEEP;
- 	int idle_h_nr_running = task_has_idle_policy(p);
    bool was_sched_idle = sched_idle_rq(rq);
+ 	int rq_h_nr_running = rq->cfs.h_nr_running;
+ 	bool task_sleep = flags & DEQUEUE_SLEEP;
+ 	bool task_delayed = flags & DEQUEUE_DELAYED;
+ 	struct task_struct *p = NULL;
+ 	int idle_h_nr_running = 0;
+ 	int h_nr_running = 0;
+ 	struct cfs_rq *cfs_rq;
+ 	u64 slice = 0;
- 	util_est_dequeue(&rq->cfs, p);
+ 	if (entity_is_task(se)) {
+ 		p = task_of(se);
+ 		h_nr_running = 1;
+ 		idle_h_nr_running = task_has_idle_policy(p);
+ 	} else {
+ 		cfs_rq = group_cfs_rq(se);
+ 		slice = cfs_rq_min_slice(cfs_rq);
+ 	}
for_each_sched_entity(se) {
    	cfs_rq = cfs_rq_of(se);
- 		dequeue_entity(cfs_rq, se, flags);
- 		cfs_rq->h_nr_running--;
+ 		if (!dequeue_entity(cfs_rq, se, flags)) {
+ 			if (p && &p->se == se)
+ 				return -1;
+ 
+ 			break;
+ 		}
+ 
+ 		cfs_rq->h_nr_running -= h_nr_running;
    	cfs_rq->idle_h_nr_running -= idle_h_nr_running;
if (cfs_rq_is_idle(cfs_rq))
- 			idle_h_nr_running = 1;
+ 			idle_h_nr_running = h_nr_running;
/* end evaluation on encountering a throttled cfs_rq */
    	if (cfs_rq_throttled(cfs_rq))
- 			goto dequeue_throttle;
+ 			return 0;
/* Don't dequeue parent if it has other entities besides us */
    	if (cfs_rq->load.weight) {
+ 			slice = cfs_rq_min_slice(cfs_rq);
+ 
    		/* Avoid re-evaluating load for this entity: */
    		se = parent_entity(se);
    		/*
@@@ -6882,6 -7124,7 +7132,7 @@@
    		break;
    	}
    	flags |= DEQUEUE_SLEEP;
+ 		flags &= ~(DEQUEUE_DELAYED | DEQUEUE_SPECIAL);
    }
for_each_sched_entity(se) {
@@@ -6891,28 -7134,61 +7142,61 @@@
    	se_update_runnable(se);
    	update_cfs_group(se);
- 		cfs_rq->h_nr_running--;
+ 		se->slice = slice;
+ 		slice = cfs_rq_min_slice(cfs_rq);
+ 
+ 		cfs_rq->h_nr_running -= h_nr_running;
    	cfs_rq->idle_h_nr_running -= idle_h_nr_running;
if (cfs_rq_is_idle(cfs_rq))
- 			idle_h_nr_running = 1;
+ 			idle_h_nr_running = h_nr_running;
/* end evaluation on encountering a throttled cfs_rq */
    	if (cfs_rq_throttled(cfs_rq))
- 			goto dequeue_throttle;
- 
+ 			return 0;
    }
- 	/* At this point se is NULL and we are at root level*/
- 	sub_nr_running(rq, 1);
+ 	sub_nr_running(rq, h_nr_running);
+ 
+ 	if (rq_h_nr_running && !rq->cfs.h_nr_running)
+ 		dl_server_stop(&rq->fair_server);
/* balance early to pull high priority tasks */
    if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
    	rq->next_balance = jiffies;
- dequeue_throttle:
- 	util_est_update(&rq->cfs, p, task_sleep);
+ 	if (p && task_delayed) {
+ 		SCHED_WARN_ON(!task_sleep);
+ 		SCHED_WARN_ON(p->on_rq != 1);
+ 
+ 		/* Fix-up what dequeue_task_fair() skipped */
+ 		hrtick_update(rq);
+ 
+ 		/* Fix-up what block_task() skipped. */
+ 		__block_task(rq, p);
+ 	}
+ 
+ 	return 1;
+ }
+ 
+ /*
+  * The dequeue_task method is called before nr_running is
+  * decreased. We remove the task from the rbtree and
+  * update the fair scheduling stats:
+  */
+ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
+ {
+ 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
+ 		util_est_dequeue(&rq->cfs, p);
+ 
+ 	if (dequeue_entities(rq, &p->se, flags) < 0) {
+ 		util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
+ 		return false;
+ 	}
+ 
+ 	util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
    hrtick_update(rq);
+ 	return true;
  }
#ifdef CONFIG_SMP
@@@ -7810,6 -8086,105 +8094,105 @@@ static unsigned long cpu_util_without(i
    return cpu_util(cpu, p, -1, 0);
  }
+ /*
+  * This function computes an effective utilization for the given CPU, to be
+  * used for frequency selection given the linear relation: f = u * f_max.
+  *
+  * The scheduler tracks the following metrics:
+  *
+  *   cpu_util_{cfs,rt,dl,irq}()
+  *   cpu_bw_dl()
+  *
+  * Where the cfs,rt and dl util numbers are tracked with the same metric and
+  * synchronized windows and are thus directly comparable.
+  *
+  * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+  * which excludes things like IRQ and steal-time. These latter are then accrued
+  * in the IRQ utilization.
+  *
+  * The DL bandwidth number OTOH is not a measured metric but a value computed
+  * based on the task model parameters and gives the minimal utilization
+  * required to meet deadlines.
+  */
+ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
+ 				 unsigned long *min,
+ 				 unsigned long *max)
+ {
+ 	unsigned long util, irq, scale;
+ 	struct rq *rq = cpu_rq(cpu);
+ 
+ 	scale = arch_scale_cpu_capacity(cpu);
+ 
+ 	/*
+ 	 * Early check to see if IRQ/steal time saturates the CPU, can be
+ 	 * because of inaccuracies in how we track these -- see
+ 	 * update_irq_load_avg().
+ 	 */
+ 	irq = cpu_util_irq(rq);
+ 	if (unlikely(irq >= scale)) {
+ 		if (min)
+ 			*min = scale;
+ 		if (max)
+ 			*max = scale;
+ 		return scale;
+ 	}
+ 
+ 	if (min) {
+ 		/*
+ 		 * The minimum utilization returns the highest level between:
+ 		 * - the computed DL bandwidth needed with the IRQ pressure which
+ 		 *   steals time to the deadline task.
+ 		 * - The minimum performance requirement for CFS and/or RT.
+ 		 */
+ 		*min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
+ 
+ 		/*
+ 		 * When an RT task is runnable and uclamp is not used, we must
+ 		 * ensure that the task will run at maximum compute capacity.
+ 		 */
+ 		if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt))
+ 			*min = max(*min, scale);
+ 	}
+ 
+ 	/*
+ 	 * Because the time spend on RT/DL tasks is visible as 'lost' time to
+ 	 * CFS tasks and we use the same metric to track the effective
+ 	 * utilization (PELT windows are synchronized) we can directly add them
+ 	 * to obtain the CPU's actual utilization.
+ 	 */
+ 	util = util_cfs + cpu_util_rt(rq);
+ 	util += cpu_util_dl(rq);
+ 
+ 	/*
+ 	 * The maximum hint is a soft bandwidth requirement, which can be lower
+ 	 * than the actual utilization because of uclamp_max requirements.
+ 	 */
+ 	if (max)
+ 		*max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
+ 
+ 	if (util >= scale)
+ 		return scale;
+ 
+ 	/*
+ 	 * There is still idle time; further improve the number by using the
+ 	 * IRQ metric. Because IRQ/steal time is hidden from the task clock we
+ 	 * need to scale the task numbers:
+ 	 *
+ 	 *              max - irq
+ 	 *   U' = irq + --------- * U
+ 	 *                 max
+ 	 */
+ 	util = scale_irq_capacity(util, irq, scale);
+ 	util += irq;
+ 
+ 	return min(scale, util);
+ }
+ 
+ unsigned long sched_cpu_util(int cpu)
+ {
+ 	return effective_cpu_util(cpu, cpu_util_cfs(cpu), NULL, NULL);
+ }
+ 
  /*
   * energy_env - Utilization landscape for energy estimation.
   * @task_busy_time: Utilization contribution by the task for which we test the
@@@ -8294,7 -8669,21 +8677,21 @@@ static void migrate_task_rq_fair(struc
static void task_dead_fair(struct task_struct *p)
  {
- 	remove_entity_load_avg(&p->se);
+ 	struct sched_entity *se = &p->se;
+ 
+ 	if (se->sched_delayed) {
+ 		struct rq_flags rf;
+ 		struct rq *rq;
+ 
+ 		rq = task_rq_lock(p, &rf);
+ 		if (se->sched_delayed) {
+ 			update_rq_clock(rq);
+ 			dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+ 		}
+ 		task_rq_unlock(rq, p, &rf);
+ 	}
+ 
+ 	remove_entity_load_avg(se);
  }
/*
@@@ -8330,7 -8719,7 +8727,7 @@@ static void set_cpus_allowed_fair(struc
  static int
  balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  {
- 	if (rq->nr_running)
+ 	if (sched_fair_runnable(rq))
    	return 1;
return sched_balance_newidle(rq, rf) != 0;
@@@ -8389,16 -8778,7 +8786,7 @@@ static void check_preempt_wakeup_fair(s
    if (test_tsk_need_resched(curr))
    	return;
- 	/* Idle tasks are by definition preempted by non-idle tasks. */
- 	if (unlikely(task_has_idle_policy(curr)) &&
- 	    likely(!task_has_idle_policy(p)))
- 		goto preempt;
- 
- 	/*
- 	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
- 	 * is driven by the tick):
- 	 */
- 	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
+ 	if (!sched_feat(WAKEUP_PREEMPTION))
    	return;
find_matching_se(&se, &pse);
@@@ -8408,7 -8788,7 +8796,7 @@@
    pse_is_idle = se_is_idle(pse);
/*
- 	 * Preempt an idle group in favor of a non-idle group (and don't preempt
+ 	 * Preempt an idle entity in favor of a non-idle entity (and don't preempt
     * in the inverse case).
     */
    if (cse_is_idle && !pse_is_idle)
@@@ -8416,11 -8796,26 +8804,26 @@@
    if (cse_is_idle != pse_is_idle)
    	return;
+ 	/*
+ 	 * BATCH and IDLE tasks do not preempt others.
+ 	 */
+ 	if (unlikely(p->policy != SCHED_NORMAL))
+ 		return;
+ 
    cfs_rq = cfs_rq_of(se);
    update_curr(cfs_rq);
+ 	/*
+ 	 * If @p has a shorter slice than current and @p is eligible, override
+ 	 * current's slice protection in order to allow preemption.
+ 	 *
+ 	 * Note that even if @p does not turn out to be the most eligible
+ 	 * task at this moment, current's slice protection will be lost.
+ 	 */
+ 	if (do_preempt_short(cfs_rq, pse, se) && se->vlag == se->deadline)
+ 		se->vlag = se->deadline + 1;
/*
- 	 * XXX pick_eevdf(cfs_rq) != se ?
+ 	 * If @p has become the most eligible task, force preemption.
     */
    if (pick_eevdf(cfs_rq) == pse)
    	goto preempt;
@@@ -8431,7 -8826,6 +8834,6 @@@ preempt
    resched_curr(rq);
  }
- #ifdef CONFIG_SMP
  static struct task_struct *pick_task_fair(struct rq *rq)
  {
    struct sched_entity *se;
@@@ -8443,95 -8837,58 +8845,58 @@@ again
    	return NULL;
do {
- 		struct sched_entity *curr = cfs_rq->curr;
- 
- 		/* When we pick for a remote RQ, we'll not have done put_prev_entity() */
- 		if (curr) {
- 			if (curr->on_rq)
- 				update_curr(cfs_rq);
- 			else
- 				curr = NULL;
+ 		/* Might not have done put_prev_entity() */
+ 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
+ 			update_curr(cfs_rq);
- 			if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- 				goto again;
- 		}
+ 		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
+ 			goto again;
- 		se = pick_next_entity(cfs_rq);
+ 		se = pick_next_entity(rq, cfs_rq);
+ 		if (!se)
+ 			goto again;
    	cfs_rq = group_cfs_rq(se);
    } while (cfs_rq);
return task_of(se);
  }
- #endif
+ 
+ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
+ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
struct task_struct *
  pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  {
- 	struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    struct task_struct *p;
    int new_tasks;
again:
- 	if (!sched_fair_runnable(rq))
+ 	p = pick_task_fair(rq);
+ 	if (!p)
    	goto idle;
+ 	se = &p->se;
#ifdef CONFIG_FAIR_GROUP_SCHED
- 	if (!prev || prev->sched_class != &fair_sched_class)
+ 	if (prev->sched_class != &fair_sched_class)
    	goto simple;
+ 	__put_prev_set_next_dl_server(rq, prev, p);
+ 
    /*
     * Because of the set_next_buddy() in dequeue_task_fair() it is rather
     * likely that a next task is from the same cgroup as the current.
     *
     * Therefore attempt to avoid putting and setting the entire cgroup
     * hierarchy, only change the part that actually changes.
- 	 */
- 
- 	do {
- 		struct sched_entity *curr = cfs_rq->curr;
- 
- 		/*
- 		 * Since we got here without doing put_prev_entity() we also
- 		 * have to consider cfs_rq->curr. If it is still a runnable
- 		 * entity, update_curr() will update its vruntime, otherwise
- 		 * forget we've ever seen it.
- 		 */
- 		if (curr) {
- 			if (curr->on_rq)
- 				update_curr(cfs_rq);
- 			else
- 				curr = NULL;
- 
- 			/*
- 			 * This call to check_cfs_rq_runtime() will do the
- 			 * throttle and dequeue its entity in the parent(s).
- 			 * Therefore the nr_running test will indeed
- 			 * be correct.
- 			 */
- 			if (unlikely(check_cfs_rq_runtime(cfs_rq))) {
- 				cfs_rq = &rq->cfs;
- 
- 				if (!cfs_rq->nr_running)
- 					goto idle;
- 
- 				goto simple;
- 			}
- 		}
- 
- 		se = pick_next_entity(cfs_rq);
- 		cfs_rq = group_cfs_rq(se);
- 	} while (cfs_rq);
- 
- 	p = task_of(se);
- 
- 	/*
+ 	 *
     * Since we haven't yet done put_prev_entity and if the selected task
     * is a different task than we started out with, try and touch the
     * least amount of cfs_rqs.
     */
    if (prev != p) {
    	struct sched_entity *pse = &prev->se;
+ 		struct cfs_rq *cfs_rq;
while (!(cfs_rq = is_same_group(se, pse))) {
    		int se_depth = se->depth;
@@@ -8549,38 -8906,15 +8914,15 @@@
put_prev_entity(cfs_rq, pse);
    	set_next_entity(cfs_rq, se);
- 	}
- 
- 	goto done;
- simple:
- #endif
- 	if (prev)
- 		put_prev_task(rq, prev);
- 	do {
- 		se = pick_next_entity(cfs_rq);
- 		set_next_entity(cfs_rq, se);
- 		cfs_rq = group_cfs_rq(se);
- 	} while (cfs_rq);
+ 		__set_next_task_fair(rq, p, true);
+ 	}
- 	p = task_of(se);
+ 	return p;
- done: __maybe_unused;
- #ifdef CONFIG_SMP
- 	/*
- 	 * Move the next running task to the front of
- 	 * the list, so our cfs_tasks list becomes MRU
- 	 * one.
- 	 */
- 	list_move(&p->se.group_node, &rq->cfs_tasks);
+ simple:
  #endif
- 
- 	if (hrtick_enabled_fair(rq))
- 		hrtick_start_fair(rq, p);
- 
- 	update_misfit_status(p, rq);
- 	sched_fair_update_stop_tick(rq, p);
- 
+ 	put_prev_set_next_task(rq, prev, p);
    return p;
idle:
@@@ -8609,15 -8943,34 +8951,34 @@@
    return NULL;
  }
- static struct task_struct *__pick_next_task_fair(struct rq *rq)
+ static struct task_struct *__pick_next_task_fair(struct rq *rq, struct task_struct *prev)
  {
- 	return pick_next_task_fair(rq, NULL, NULL);
+ 	return pick_next_task_fair(rq, prev, NULL);
+ }
+ 
+ static bool fair_server_has_tasks(struct sched_dl_entity *dl_se)
+ {
+ 	return !!dl_se->rq->cfs.nr_running;
+ }
+ 
+ static struct task_struct *fair_server_pick_task(struct sched_dl_entity *dl_se)
+ {
+ 	return pick_task_fair(dl_se->rq);
+ }
+ 
+ void fair_server_init(struct rq *rq)
+ {
+ 	struct sched_dl_entity *dl_se = &rq->fair_server;
+ 
+ 	init_dl_entity(dl_se);
+ 
+ 	dl_server_init(dl_se, rq, fair_server_has_tasks, fair_server_pick_task);
  }
/*
   * Account for a descheduled task:
   */
- static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
+ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, struct task_struct *next)
  {
    struct sched_entity *se = &prev->se;
    struct cfs_rq *cfs_rq;
@@@ -9368,9 -9721,10 +9729,10 @@@ static bool __update_blocked_others(str
hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
+ 	/* hw_pressure doesn't care about invariance */
    decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
    	  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
- 		  update_hw_load_avg(now, rq, hw_pressure) |
+ 		  update_hw_load_avg(rq_clock_task(rq), rq, hw_pressure) |
    	  update_irq_load_avg(rq, 0);
if (others_have_blocked(rq))
@@@ -12491,7 -12845,7 +12853,7 @@@ out
   * - indirectly from a remote scheduler_tick() for NOHZ idle balancing
   *   through the SMP cross-call nohz_csd_func()
   */
- static __latent_entropy void sched_balance_softirq(struct softirq_action *h)
+ static __latent_entropy void sched_balance_softirq(void)
  {
    struct rq *this_rq = this_rq();
    enum cpu_idle_type idle = this_rq->idle_balance;
@@@ -12710,22 -13064,7 +13072,7 @@@ static void task_tick_fair(struct rq *r
   */
  static void task_fork_fair(struct task_struct *p)
  {
- 	struct sched_entity *se = &p->se, *curr;
- 	struct cfs_rq *cfs_rq;
- 	struct rq *rq = this_rq();
- 	struct rq_flags rf;
- 
- 	rq_lock(rq, &rf);
- 	update_rq_clock(rq);
- 
    set_task_max_allowed_capacity(p);
- 
- 	cfs_rq = task_cfs_rq(current);
- 	curr = cfs_rq->curr;
- 	if (curr)
- 		update_curr(cfs_rq);
- 	place_entity(cfs_rq, se, ENQUEUE_INITIAL);
- 	rq_unlock(rq, &rf);
  }
/*
@@@ -12837,10 -13176,28 +13184,28 @@@ static void attach_task_cfs_rq(struct t
  static void switched_from_fair(struct rq *rq, struct task_struct *p)
  {
    detach_task_cfs_rq(p);
+ 	/*
+ 	 * Since this is called after changing class, this is a little weird
+ 	 * and we cannot use DEQUEUE_DELAYED.
+ 	 */
+ 	if (p->se.sched_delayed) {
+ 		/* First, dequeue it from its new class' structures */
+ 		dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);
+ 		/*
+ 		 * Now, clean up the fair_sched_class side of things
+ 		 * related to sched_delayed being true and that wasn't done
+ 		 * due to the generic dequeue not using DEQUEUE_DELAYED.
+ 		 */
+ 		finish_delayed_dequeue_entity(&p->se);
+ 		p->se.rel_deadline = 0;
+ 		__block_task(rq, p);
+ 	}
  }
static void switched_to_fair(struct rq *rq, struct task_struct *p)
  {
+ 	SCHED_WARN_ON(p->se.sched_delayed);
+ 
    attach_task_cfs_rq(p);
set_task_max_allowed_capacity(p);
@@@ -12858,12 -13215,7 +13223,7 @@@
    }
  }
- /* Account for a task changing its policy or group.
-  *
-  * This routine is mostly called to set cfs_rq->curr field when a task
-  * migrates between groups/classes.
-  */
- static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
+ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
  {
    struct sched_entity *se = &p->se;
@@@ -12876,6 -13228,27 +13236,27 @@@
    	list_move(&se->group_node, &rq->cfs_tasks);
    }
  #endif
+ 	if (!first)
+ 		return;
+ 
+ 	SCHED_WARN_ON(se->sched_delayed);
+ 
+ 	if (hrtick_enabled_fair(rq))
+ 		hrtick_start_fair(rq, p);
+ 
+ 	update_misfit_status(p, rq);
+ 	sched_fair_update_stop_tick(rq, p);
+ }
+ 
+ /*
+  * Account for a task changing its policy or group.
+  *
+  * This routine is mostly called to set cfs_rq->curr field when a task
+  * migrates between groups/classes.
+  */
+ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
+ {
+ 	struct sched_entity *se = &p->se;
for_each_sched_entity(se) {
    	struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@@ -12884,12 -13257,14 +13265,14 @@@
    	/* ensure bandwidth has been allocated on our new cfs_rq */
    	account_cfs_rq_runtime(cfs_rq, 0);
    }
+ 
+ 	__set_next_task_fair(rq, p, first);
  }
void init_cfs_rq(struct cfs_rq *cfs_rq)
  {
    cfs_rq->tasks_timeline = RB_ROOT_CACHED;
- 	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
+ 	cfs_rq->min_vruntime = (u64)(-(1LL << 20));
  #ifdef CONFIG_SMP
    raw_spin_lock_init(&cfs_rq->removed.lock);
  #endif
@@@ -12991,28 -13366,35 +13374,35 @@@ void online_fair_sched_group(struct tas
void unregister_fair_sched_group(struct task_group *tg)
  {
- 	unsigned long flags;
- 	struct rq *rq;
    int cpu;
destroy_cfs_bandwidth(tg_cfs_bandwidth(tg));
for_each_possible_cpu(cpu) {
- 		if (tg->se[cpu])
- 			remove_entity_load_avg(tg->se[cpu]);
+ 		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+ 		struct sched_entity *se = tg->se[cpu];
+ 		struct rq *rq = cpu_rq(cpu);
+ 
+ 		if (se) {
+ 			if (se->sched_delayed) {
+ 				guard(rq_lock_irqsave)(rq);
+ 				if (se->sched_delayed) {
+ 					update_rq_clock(rq);
+ 					dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+ 				}
+ 				list_del_leaf_cfs_rq(cfs_rq);
+ 			}
+ 			remove_entity_load_avg(se);
+ 		}
/*
    	 * Only empty task groups can be destroyed; so we can speculatively
    	 * check on_list without danger of it being re-added.
    	 */
- 		if (!tg->cfs_rq[cpu]->on_list)
- 			continue;
- 
- 		rq = cpu_rq(cpu);
- 
- 		raw_spin_rq_lock_irqsave(rq, flags);
- 		list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
- 		raw_spin_rq_unlock_irqrestore(rq, flags);
+ 		if (cfs_rq->on_list) {
+ 			guard(rq_lock_irqsave)(rq);
+ 			list_del_leaf_cfs_rq(cfs_rq);
+ 		}
    }
  }
@@@ -13202,13 -13584,13 +13592,13 @@@ DEFINE_SCHED_CLASS(fair) =
.wakeup_preempt		= check_preempt_wakeup_fair,
+ 	.pick_task		= pick_task_fair,
    .pick_next_task		= __pick_next_task_fair,
    .put_prev_task		= put_prev_task_fair,
    .set_next_task          = set_next_task_fair,
#ifdef CONFIG_SMP
    .balance		= balance_fair,
- 	.pick_task		= pick_task_fair,
    .select_task_rq		= select_task_rq_fair,
    .migrate_task_rq	= migrate_task_rq_fair,
diff --combined kernel/signal.c
index cc5d87cfa7c0a,6f3a5aa39b091..4344860ffcac7
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@@ -618,20 -618,18 +618,18 @@@ static int __dequeue_signal(struct sigp
  }
/*
-  * Dequeue a signal and return the element to the caller, which is
-  * expected to free it.
-  *
-  * All callers have to hold the siglock.
+  * Try to dequeue a signal. If a deliverable signal is found fill in the
+  * caller provided siginfo and return the signal number. Otherwise return
+  * 0.
   */
- int dequeue_signal(struct task_struct *tsk, sigset_t *mask,
- 		   kernel_siginfo_t *info, enum pid_type *type)
+ int dequeue_signal(sigset_t *mask, kernel_siginfo_t *info, enum pid_type *type)
  {
+ 	struct task_struct *tsk = current;
    bool resched_timer = false;
    int signr;
- 	/* We only dequeue private signals from ourselves, we don't let
- 	 * signalfd steal them
- 	 */
+ 	lockdep_assert_held(&tsk->sighand->siglock);
+ 
    *type = PIDTYPE_PID;
    signr = __dequeue_signal(&tsk->pending, mask, info, &resched_timer);
    if (!signr) {
@@@ -1940,10 -1938,11 +1938,11 @@@ struct sigqueue *sigqueue_alloc(void
void sigqueue_free(struct sigqueue *q)
  {
- 	unsigned long flags;
    spinlock_t *lock = &current->sighand->siglock;
+ 	unsigned long flags;
- 	BUG_ON(!(q->flags & SIGQUEUE_PREALLOC));
+ 	if (WARN_ON_ONCE(!(q->flags & SIGQUEUE_PREALLOC)))
+ 		return;
    /*
     * We must hold ->siglock while testing q->list
     * to serialize with collect_signal() or with
@@@ -1971,7 -1970,10 +1970,10 @@@ int send_sigqueue(struct sigqueue *q, s
    unsigned long flags;
    int ret, result;
- 	BUG_ON(!(q->flags & SIGQUEUE_PREALLOC));
+ 	if (WARN_ON_ONCE(!(q->flags & SIGQUEUE_PREALLOC)))
+ 		return 0;
+ 	if (WARN_ON_ONCE(q->info.si_code != SI_TIMER))
+ 		return 0;
ret = -1;
    rcu_read_lock();
@@@ -2006,7 -2008,6 +2008,6 @@@
    	 * If an SI_TIMER entry is already queue just increment
    	 * the overrun count.
    	 */
- 		BUG_ON(q->info.si_code != SI_TIMER);
    	q->info.si_overrun++;
    	result = TRACE_SIGNAL_ALREADY_PENDING;
    	goto out;
@@@ -2793,8 -2794,7 +2794,7 @@@ relock
    	type = PIDTYPE_PID;
    	signr = dequeue_synchronous_signal(&ksig->info);
    	if (!signr)
- 			signr = dequeue_signal(current, &current->blocked,
- 					       &ksig->info, &type);
+ 			signr = dequeue_signal(&current->blocked, &ksig->info, &type);
if (!signr)
    		break; /* will return 0 */
@@@ -3648,7 -3648,7 +3648,7 @@@ static int do_sigtimedwait(const sigset
    signotset(&mask);
spin_lock_irq(&tsk->sighand->siglock);
- 	sig = dequeue_signal(tsk, &mask, info, &type);
+ 	sig = dequeue_signal(&mask, info, &type);
    if (!sig && timeout) {
    	/*
    	 * None ready, temporarily unblock those we're interested
@@@ -3667,7 -3667,7 +3667,7 @@@
    	spin_lock_irq(&tsk->sighand->siglock);
    	__set_task_blocked(tsk, &tsk->real_blocked);
    	sigemptyset(&tsk->real_blocked);
- 		sig = dequeue_signal(tsk, &mask, info, &type);
+ 		sig = dequeue_signal(&mask, info, &type);
    }
    spin_unlock_irq(&tsk->sighand->siglock);
@@@ -3922,11 -3922,11 +3922,11 @@@ SYSCALL_DEFINE4(pidfd_send_signal, int
    	return -EINVAL;
f = fdget(pidfd);
 -	if (!f.file)
 +	if (!fd_file(f))
    	return -EBADF;
/* Is this a pidfd? */
 -	pid = pidfd_to_pid(f.file);
 +	pid = pidfd_to_pid(fd_file(f));
    if (IS_ERR(pid)) {
    	ret = PTR_ERR(pid);
    	goto err;
@@@ -3939,7 -3939,7 +3939,7 @@@
    switch (flags) {
    case 0:
    	/* Infer scope from the type of pidfd. */
 -		if (f.file->f_flags & PIDFD_THREAD)
 +		if (fd_file(f)->f_flags & PIDFD_THREAD)
    		type = PIDTYPE_PID;
    	else
    		type = PIDTYPE_TGID;
diff --combined kernel/sys.c
index a4be1e568ff5c,b7e096e1c3a13..4da31f28fda81
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@@ -1916,10 -1916,10 +1916,10 @@@ static int prctl_set_mm_exe_file(struc
    int err;
exe = fdget(fd);
 -	if (!exe.file)
 +	if (!fd_file(exe))
    	return -EBADF;
-	inode = file_inode(exe.file);
 +	inode = file_inode(fd_file(exe));
/*
     * Because the original mm->exe_file points to executable file, make
@@@ -1927,14 -1927,14 +1927,14 @@@
     * overall picture.
     */
    err = -EACCES;
 -	if (!S_ISREG(inode->i_mode) || path_noexec(&exe.file->f_path))
 +	if (!S_ISREG(inode->i_mode) || path_noexec(&fd_file(exe)->f_path))
    	goto exit;
-	err = file_permission(exe.file, MAY_EXEC);
 +	err = file_permission(fd_file(exe), MAY_EXEC);
    if (err)
    	goto exit;
-	err = replace_mm_exe_file(mm, exe.file);
 +	err = replace_mm_exe_file(mm, fd_file(exe));
  exit:
    fdput(exe);
    return err;
@@@ -2557,6 -2557,8 +2557,8 @@@ SYSCALL_DEFINE5(prctl, int, option, uns
    		error = current->timer_slack_ns;
    	break;
    case PR_SET_TIMERSLACK:
+ 		if (rt_or_dl_task_policy(current))
+ 			break;
    	if (arg2 <= 0)
    		current->timer_slack_ns =
    				current->default_timer_slack_ns;
diff --combined kernel/trace/bpf_trace.c
index 68b5905c6eb52,ac0a01cc8634a..9c330e7a101a4
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@@ -24,6 -24,7 +24,6 @@@
  #include <linux/key.h>
  #include <linux/verification.h>
  #include <linux/namei.h>
 -#include <linux/fileattr.h>
#include <net/bpf_sk_storage.h>
@@@ -797,6 -798,29 +797,6 @@@ const struct bpf_func_proto bpf_task_pt
    .ret_btf_id	= &bpf_task_pt_regs_ids[0],
  };
-BPF_CALL_2(bpf_current_task_under_cgroup, struct bpf_map *, map, u32, idx)
 -{
 -	struct bpf_array *array = container_of(map, struct bpf_array, map);
 -	struct cgroup *cgrp;
 -
 -	if (unlikely(idx >= array->map.max_entries))
 -		return -E2BIG;
 -
 -	cgrp = READ_ONCE(array->ptrs[idx]);
 -	if (unlikely(!cgrp))
 -		return -EAGAIN;
 -
 -	return task_under_cgroup_hierarchy(current, cgrp);
 -}
 -
 -static const struct bpf_func_proto bpf_current_task_under_cgroup_proto = {
 -	.func           = bpf_current_task_under_cgroup,
 -	.gpl_only       = false,
 -	.ret_type       = RET_INTEGER,
 -	.arg1_type      = ARG_CONST_MAP_PTR,
 -	.arg2_type      = ARG_ANYTHING,
 -};
 -
  struct send_signal_irq_work {
    struct irq_work irq_work;
    struct task_struct *task;
@@@ -1415,6 -1439,73 +1415,6 @@@ static int __init bpf_key_sig_kfuncs_in
  late_initcall(bpf_key_sig_kfuncs_init);
  #endif /* CONFIG_KEYS */
-/* filesystem kfuncs */
 -__bpf_kfunc_start_defs();
 -
 -/**
 - * bpf_get_file_xattr - get xattr of a file
 - * @file: file to get xattr from
 - * @name__str: name of the xattr
 - * @value_p: output buffer of the xattr value
 - *
 - * Get xattr *name__str* of *file* and store the output in *value_ptr*.
 - *
 - * For security reasons, only *name__str* with prefix "user." is allowed.
 - *
 - * Return: 0 on success, a negative value on error.
 - */
 -__bpf_kfunc int bpf_get_file_xattr(struct file *file, const char *name__str,
 -				   struct bpf_dynptr *value_p)
 -{
 -	struct bpf_dynptr_kern *value_ptr = (struct bpf_dynptr_kern *)value_p;
 -	struct dentry *dentry;
 -	u32 value_len;
 -	void *value;
 -	int ret;
 -
 -	if (strncmp(name__str, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN))
 -		return -EPERM;
 -
 -	value_len = __bpf_dynptr_size(value_ptr);
 -	value = __bpf_dynptr_data_rw(value_ptr, value_len);
 -	if (!value)
 -		return -EINVAL;
 -
 -	dentry = file_dentry(file);
 -	ret = inode_permission(&nop_mnt_idmap, dentry->d_inode, MAY_READ);
 -	if (ret)
 -		return ret;
 -	return __vfs_getxattr(dentry, dentry->d_inode, name__str, value, value_len);
 -}
 -
 -__bpf_kfunc_end_defs();
 -
 -BTF_KFUNCS_START(fs_kfunc_set_ids)
 -BTF_ID_FLAGS(func, bpf_get_file_xattr, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 -BTF_KFUNCS_END(fs_kfunc_set_ids)
 -
 -static int bpf_get_file_xattr_filter(const struct bpf_prog *prog, u32 kfunc_id)
 -{
 -	if (!btf_id_set8_contains(&fs_kfunc_set_ids, kfunc_id))
 -		return 0;
 -
 -	/* Only allow to attach from LSM hooks, to avoid recursion */
 -	return prog->type != BPF_PROG_TYPE_LSM ? -EACCES : 0;
 -}
 -
 -static const struct btf_kfunc_id_set bpf_fs_kfunc_set = {
 -	.owner = THIS_MODULE,
 -	.set = &fs_kfunc_set_ids,
 -	.filter = bpf_get_file_xattr_filter,
 -};
 -
 -static int __init bpf_fs_kfuncs_init(void)
 -{
 -	return register_btf_kfunc_id_set(BPF_PROG_TYPE_LSM, &bpf_fs_kfunc_set);
 -}
 -
 -late_initcall(bpf_fs_kfuncs_init);
 -
  static const struct bpf_func_proto *
  bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
  {
@@@ -1457,6 -1548,8 +1457,6 @@@
    	return &bpf_get_numa_node_id_proto;
    case BPF_FUNC_perf_event_read:
    	return &bpf_perf_event_read_proto;
 -	case BPF_FUNC_current_task_under_cgroup:
 -		return &bpf_current_task_under_cgroup_proto;
    case BPF_FUNC_get_prandom_u32:
    	return &bpf_get_prandom_u32_proto;
    case BPF_FUNC_probe_write_user:
@@@ -1485,8 -1578,6 +1485,8 @@@
    	return &bpf_cgrp_storage_get_proto;
    case BPF_FUNC_cgrp_storage_delete:
    	return &bpf_cgrp_storage_delete_proto;
 +	case BPF_FUNC_current_task_under_cgroup:
 +		return &bpf_current_task_under_cgroup_proto;
  #endif
    case BPF_FUNC_send_signal:
    	return &bpf_send_signal_proto;
@@@ -1507,8 -1598,7 +1507,8 @@@
    case BPF_FUNC_jiffies64:
    	return &bpf_jiffies64_proto;
    case BPF_FUNC_get_task_stack:
 -		return &bpf_get_task_stack_proto;
 +		return prog->sleepable ? &bpf_get_task_stack_sleepable_proto
 +				       : &bpf_get_task_stack_proto;
    case BPF_FUNC_copy_from_user:
    	return &bpf_copy_from_user_proto;
    case BPF_FUNC_copy_from_user_task:
@@@ -1564,7 -1654,7 +1564,7 @@@ kprobe_prog_func_proto(enum bpf_func_i
    case BPF_FUNC_get_stackid:
    	return &bpf_get_stackid_proto;
    case BPF_FUNC_get_stack:
 -		return &bpf_get_stack_proto;
 +		return prog->sleepable ? &bpf_get_stack_sleepable_proto : &bpf_get_stack_proto;
  #ifdef CONFIG_BPF_KPROBE_OVERRIDE
    case BPF_FUNC_override_return:
    	return &bpf_override_return_proto;
@@@ -3070,6 -3160,7 +3070,7 @@@ struct bpf_uprobe 
    loff_t offset;
    unsigned long ref_ctr_offset;
    u64 cookie;
+ 	struct uprobe *uprobe;
    struct uprobe_consumer consumer;
  };
@@@ -3088,15 -3179,15 +3089,15 @@@ struct bpf_uprobe_multi_run_ctx 
    struct bpf_uprobe *uprobe;
  };
- static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes,
- 				  u32 cnt)
+ static void bpf_uprobe_unregister(struct bpf_uprobe *uprobes, u32 cnt)
  {
    u32 i;
- 	for (i = 0; i < cnt; i++) {
- 		uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset,
- 				  &uprobes[i].consumer);
- 	}
+ 	for (i = 0; i < cnt; i++)
+ 		uprobe_unregister_nosync(uprobes[i].uprobe, &uprobes[i].consumer);
+ 
+ 	if (cnt)
+ 		uprobe_unregister_sync();
  }
static void bpf_uprobe_multi_link_release(struct bpf_link *link)
@@@ -3104,7 -3195,7 +3105,7 @@@
    struct bpf_uprobe_multi_link *umulti_link;
umulti_link = container_of(link, struct bpf_uprobe_multi_link, link);
- 	bpf_uprobe_unregister(&umulti_link->path, umulti_link->uprobes, umulti_link->cnt);
+ 	bpf_uprobe_unregister(umulti_link->uprobes, umulti_link->cnt);
    if (umulti_link->task)
    	put_task_struct(umulti_link->task);
    path_put(&umulti_link->path);
@@@ -3208,7 -3299,7 +3209,7 @@@ static int uprobe_prog_run(struct bpf_u
    struct bpf_run_ctx *old_run_ctx;
    int err = 0;
-	if (link->task && current->mm != link->task->mm)
 +	if (link->task && !same_thread_group(current, link->task))
    	return 0;
if (sleepable)
@@@ -3232,8 -3323,7 +3233,7 @@@
  }
static bool
- uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx,
- 			 struct mm_struct *mm)
+ uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
  {
    struct bpf_uprobe *uprobe;
@@@ -3390,22 -3480,26 +3390,26 @@@ int bpf_uprobe_multi_link_attach(const 
    	      &bpf_uprobe_multi_link_lops, prog);
for (i = 0; i < cnt; i++) {
- 		err = uprobe_register_refctr(d_real_inode(link->path.dentry),
- 					     uprobes[i].offset,
- 					     uprobes[i].ref_ctr_offset,
- 					     &uprobes[i].consumer);
- 		if (err) {
- 			bpf_uprobe_unregister(&path, uprobes, i);
- 			goto error_free;
+ 		uprobes[i].uprobe = uprobe_register(d_real_inode(link->path.dentry),
+ 						    uprobes[i].offset,
+ 						    uprobes[i].ref_ctr_offset,
+ 						    &uprobes[i].consumer);
+ 		if (IS_ERR(uprobes[i].uprobe)) {
+ 			err = PTR_ERR(uprobes[i].uprobe);
+ 			link->cnt = i;
+ 			goto error_unregister;
    	}
    }
err = bpf_link_prime(&link->link, &link_primer);
    if (err)
- 		goto error_free;
+ 		goto error_unregister;
return bpf_link_settle(&link_primer);
+ error_unregister:
+ 	bpf_uprobe_unregister(uprobes, link->cnt);
+ 
  error_free:
    kvfree(uprobes);
    kfree(link);
diff --combined lib/Kconfig.debug
index 3524b8be84029,26354671b37df..48d2a7498d14b
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@@ -97,7 -97,7 +97,7 @@@ config BOOT_PRINTK_DELA
      using "boot_delay=N".
It is likely that you would also need to use "lpj=M" to preset
- 	  the "loops per jiffie" value.
+ 	  the "loops per jiffy" value.
      See a previous boot log for the "lpj" value to use for your
      system, and then set "lpj=M" before setting "boot_delay=N".
      NOTE:  Using this option may adversely affect SMP systems.
@@@ -571,21 -571,6 +571,21 @@@ config VMLINUX_MA
      pieces of code get eliminated with
      CONFIG_LD_DEAD_CODE_DATA_ELIMINATION.
+config BUILTIN_MODULE_RANGES
 +	bool "Generate address range information for builtin modules"
 +	depends on !LTO
 +	depends on VMLINUX_MAP
 +	help
 +	 When modules are built into the kernel, there will be no module name
 +	 associated with its symbols in /proc/kallsyms.  Tracers may want to
 +	 identify symbols by module name and symbol name regardless of whether
 +	 the module is configured as loadable or not.
 +
 +	 This option generates modules.builtin.ranges in the build tree with
 +	 offset ranges (per ELF section) for the module(s) they belong to.
 +	 It also records an anchor symbol to determine the load address of the
 +	 section.
 +
  config DEBUG_FORCE_WEAK_PER_CPU
    bool "Force weak per-cpu definitions"
    depends on DEBUG_KERNEL
@@@ -1530,7 -1515,7 +1530,7 @@@ config LOCKDEP_BIT
  config LOCKDEP_CHAINS_BITS
    int "Bitsize for MAX_LOCKDEP_CHAINS"
    depends on LOCKDEP && !LOCKDEP_SMALL
 -	range 10 30
 +	range 10 21
    default 16
    help
      Try increasing this value if you hit "BUG: MAX_LOCKDEP_CHAINS too low!" message.
@@@ -2034,7 -2019,7 +2034,7 @@@ config FAULT_INJECTIO
    depends on DEBUG_KERNEL
    help
      Provide fault-injection framework.
 -	  For more details, see Documentation/fault-injection/.
 +	  For more details, see Documentation/dev-tools/fault-injection/.
config FAILSLAB
    bool "Fault-injection capability for kmalloc"
@@@ -2188,6 -2173,14 +2188,14 @@@ config KCOV_IRQ_AREA_SIZ
      soft interrupts. This specifies the size of those areas in the
      number of unsigned long words.
+ config KCOV_SELFTEST
+ 	bool "Perform short selftests on boot"
+ 	depends on KCOV
+ 	help
+ 	  Run short KCOV coverage collection selftests on boot.
+ 	  On test failure, causes the kernel to panic. Recommended to be
+ 	  enabled, ensuring critical functionality works as intended.
+ 
  menuconfig RUNTIME_TESTING_MENU
    bool "Runtime Testing"
    default y
@@@ -2240,7 -2233,7 +2248,7 @@@ config LKDT
    called lkdtm.
Documentation on how to use the module can be found in
 -	Documentation/fault-injection/provoke-crashes.rst
 +	Documentation/dev-tools/fault-injection/provoke-crashes.rst
config CPUMASK_KUNIT_TEST
    tristate "KUnit test for cpumask" if !KUNIT_ALL_TESTS
@@@ -2295,16 -2288,6 +2303,16 @@@ config TEST_DIV6
If unsure, say N.
+config TEST_MULDIV64
 +	tristate "mul_u64_u64_div_u64() test"
 +	depends on DEBUG_KERNEL || m
 +	help
 +	  Enable this to turn on 'mul_u64_u64_div_u64()' function test.
 +	  This test is executed only once during system boot (so affects
 +	  only boot time), or at module load time.
 +
 +	  If unsure, say N.
 +
  config TEST_IOV_ITER
    tristate "Test iov_iter operation" if !KUNIT_ALL_TESTS
    depends on KUNIT
@@@ -2641,7 -2624,6 +2649,7 @@@ config RESOURCE_KUNIT_TES
    tristate "KUnit test for resource API" if !KUNIT_ALL_TESTS
    depends on KUNIT
    default KUNIT_ALL_TESTS
 +	select GET_FREE_REGION
    help
      This builds the resource API unit test.
      Tests the logic of API provided by resource.c and ioport.h.
diff --combined mm/page-writeback.c
index f5448311c89eb,7a04cb1918fd5..fcd4c1439cb9c
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@@ -418,7 -418,7 +418,7 @@@ static void domain_dirty_limits(struct 
    	bg_thresh = (bg_ratio * available_memory) / PAGE_SIZE;
tsk = current;
- 	if (rt_task(tsk)) {
+ 	if (rt_or_dl_task(tsk)) {
    	bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
    	thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
    }
@@@ -477,7 -477,7 +477,7 @@@ static unsigned long node_dirty_limit(s
    else
    	dirty = vm_dirty_ratio * node_memory / 100;
- 	if (rt_task(tsk))
+ 	if (rt_or_dl_task(tsk))
    	dirty += dirty / 4;
/*
@@@ -2612,7 -2612,7 +2612,7 @@@ struct folio *writeback_iter(struct add
done:
    if (wbc->range_cyclic)
 -		mapping->writeback_index = folio->index + folio_nr_pages(folio);
 +		mapping->writeback_index = folio_next_index(folio);
    folio_batch_release(&wbc->fbatch);
    return NULL;
  }
diff --combined mm/page_alloc.c
index aebc4529d5fcd,0aefae4a26b20..0f33dab6d344f
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@@ -286,7 -286,9 +286,7 @@@ EXPORT_SYMBOL(nr_online_nodes)
  #endif
static bool page_contains_unaccepted(struct page *page, unsigned int order);
 -static void accept_page(struct page *page, unsigned int order);
  static bool cond_accept_memory(struct zone *zone, unsigned int order);
 -static inline bool has_unaccepted_memory(void);
  static bool __free_unaccepted(struct page *page);
int page_group_by_mobility_disabled __read_mostly;
@@@ -320,11 -322,6 +320,11 @@@ static inline bool deferred_pages_enabl
  {
    return false;
  }
 +
 +static inline bool _deferred_grow_zone(struct zone *zone, unsigned int order)
 +{
 +	return false;
 +}
  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
/* Return a pointer to the bitmap storing bits affecting a block of pages */
@@@ -961,9 -958,8 +961,9 @@@ static int free_tail_page_prepare(struc
    	break;
    case 2:
    	/* the second tail page: deferred_list overlaps ->mapping */
 -		if (unlikely(!list_empty(&folio->_deferred_list))) {
 -			bad_page(page, "on deferred list");
 +		if (unlikely(!list_empty(&folio->_deferred_list) &&
 +		    folio_test_partially_mapped(folio))) {
 +			bad_page(page, "partially mapped folio on deferred list");
    		goto out;
    	}
    	break;
@@@ -1091,11 -1087,8 +1091,11 @@@ __always_inline bool free_pages_prepare
    		(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
    	}
    }
 -	if (PageMappingFlags(page))
 +	if (PageMappingFlags(page)) {
 +		if (PageAnon(page))
 +			mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
    	page->mapping = NULL;
 +	}
    if (is_check_pages_enabled()) {
    	if (free_page_is_bad(page))
    		bad++;
@@@ -1206,39 -1199,17 +1206,39 @@@ static void free_pcppages_bulk(struct z
    spin_unlock_irqrestore(&zone->lock, flags);
  }
+/* Split a multi-block free page into its individual pageblocks. */
 +static void split_large_buddy(struct zone *zone, struct page *page,
 +			      unsigned long pfn, int order, fpi_t fpi)
 +{
 +	unsigned long end = pfn + (1 << order);
 +
 +	VM_WARN_ON_ONCE(!IS_ALIGNED(pfn, 1 << order));
 +	/* Caller removed page from freelist, buddy info cleared! */
 +	VM_WARN_ON_ONCE(PageBuddy(page));
 +
 +	if (order > pageblock_order)
 +		order = pageblock_order;
 +
 +	while (pfn != end) {
 +		int mt = get_pfnblock_migratetype(page, pfn);
 +
 +		__free_one_page(page, pfn, zone, order, mt, fpi);
 +		pfn += 1 << order;
 +		page = pfn_to_page(pfn);
 +	}
 +}
 +
  static void free_one_page(struct zone *zone, struct page *page,
    		  unsigned long pfn, unsigned int order,
    		  fpi_t fpi_flags)
  {
    unsigned long flags;
 -	int migratetype;
spin_lock_irqsave(&zone->lock, flags);
 -	migratetype = get_pfnblock_migratetype(page, pfn);
 -	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 +	split_large_buddy(zone, page, pfn, order, fpi_flags);
    spin_unlock_irqrestore(&zone->lock, flags);
 +
 +	__count_vm_events(PGFREE, 1 << order);
  }
static void __free_pages_ok(struct page *page, unsigned int order,
@@@ -1247,8 -1218,12 +1247,8 @@@
    unsigned long pfn = page_to_pfn(page);
    struct zone *zone = page_zone(page);
-	if (!free_pages_prepare(page, order))
 -		return;
 -
 -	free_one_page(zone, page, pfn, order, fpi_flags);
 -
 -	__count_vm_events(PGFREE, 1 << order);
 +	if (free_pages_prepare(page, order))
 +		free_one_page(zone, page, pfn, order, fpi_flags);
  }
void __meminit __free_pages_core(struct page *page, unsigned int order,
@@@ -1295,7 -1270,7 +1295,7 @@@
    	if (order == MAX_PAGE_ORDER && __free_unaccepted(page))
    		return;
-		accept_page(page, order);
 +		accept_memory(page_to_phys(page), PAGE_SIZE << order);
    }
/*
@@@ -1371,11 -1346,11 +1371,11 @@@ struct page *__pageblock_pfn_to_page(un
   *
   * -- nyc
   */
 -static inline void expand(struct zone *zone, struct page *page,
 -	int low, int high, int migratetype)
 +static inline unsigned int expand(struct zone *zone, struct page *page, int low,
 +				  int high, int migratetype)
  {
 -	unsigned long size = 1 << high;
 -	unsigned long nr_added = 0;
 +	unsigned int size = 1 << high;
 +	unsigned int nr_added = 0;
while (high > low) {
    	high--;
@@@ -1395,19 -1370,7 +1395,19 @@@
    	set_buddy_order(&page[size], high);
    	nr_added += size;
    }
 -	account_freepages(zone, nr_added, migratetype);
 +
 +	return nr_added;
 +}
 +
 +static __always_inline void page_del_and_expand(struct zone *zone,
 +						struct page *page, int low,
 +						int high, int migratetype)
 +{
 +	int nr_pages = 1 << high;
 +
 +	__del_page_from_free_list(page, zone, high, migratetype);
 +	nr_pages -= expand(zone, page, low, high, migratetype);
 +	account_freepages(zone, -nr_pages, migratetype);
  }
static void check_new_page_bad(struct page *page)
@@@ -1577,9 -1540,8 +1577,9 @@@ struct page *__rmqueue_smallest(struct 
    	page = get_page_from_free_area(area, migratetype);
    	if (!page)
    		continue;
 -		del_page_from_free_list(page, zone, current_order, migratetype);
 -		expand(zone, page, order, current_order, migratetype);
 +
 +		page_del_and_expand(zone, page, order, current_order,
 +				    migratetype);
    	trace_mm_page_alloc_zone_locked(page, order, migratetype,
    			pcp_allowed_order(order) &&
    			migratetype < MIGRATE_PCPTYPES);
@@@ -1738,6 -1700,27 +1738,6 @@@ static unsigned long find_large_buddy(u
    return start_pfn;
  }
-/* Split a multi-block free page into its individual pageblocks */
 -static void split_large_buddy(struct zone *zone, struct page *page,
 -			      unsigned long pfn, int order)
 -{
 -	unsigned long end_pfn = pfn + (1 << order);
 -
 -	VM_WARN_ON_ONCE(order <= pageblock_order);
 -	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
 -
 -	/* Caller removed page from freelist, buddy info cleared! */
 -	VM_WARN_ON_ONCE(PageBuddy(page));
 -
 -	while (pfn != end_pfn) {
 -		int mt = get_pfnblock_migratetype(page, pfn);
 -
 -		__free_one_page(page, pfn, zone, pageblock_order, mt, FPI_NONE);
 -		pfn += pageblock_nr_pages;
 -		page = pfn_to_page(pfn);
 -	}
 -}
 -
  /**
   * move_freepages_block_isolate - move free pages in block for page isolation
   * @zone: the zone
@@@ -1778,7 -1761,7 +1778,7 @@@ bool move_freepages_block_isolate(struc
    	del_page_from_free_list(buddy, zone, order,
    				get_pfnblock_migratetype(buddy, pfn));
    	set_pageblock_migratetype(page, migratetype);
 -		split_large_buddy(zone, buddy, pfn, order);
 +		split_large_buddy(zone, buddy, pfn, order, FPI_NONE);
    	return true;
    }
@@@ -1789,7 -1772,7 +1789,7 @@@
    	del_page_from_free_list(page, zone, order,
    				get_pfnblock_migratetype(page, pfn));
    	set_pageblock_migratetype(page, migratetype);
 -		split_large_buddy(zone, page, pfn, order);
 +		split_large_buddy(zone, page, pfn, order, FPI_NONE);
    	return true;
    }
  move:
@@@ -1909,12 -1892,9 +1909,12 @@@ steal_suitable_fallback(struct zone *zo
/* Take ownership for orders >= pageblock_order */
    if (current_order >= pageblock_order) {
 +		unsigned int nr_added;
 +
    	del_page_from_free_list(page, zone, current_order, block_type);
    	change_pageblock_range(page, current_order, start_type);
 -		expand(zone, page, order, current_order, start_type);
 +		nr_added = expand(zone, page, order, current_order, start_type);
 +		account_freepages(zone, nr_added, start_type);
    	return page;
    }
@@@ -1967,7 -1947,8 +1967,7 @@@
    }
single_page:
 -	del_page_from_free_list(page, zone, current_order, block_type);
 -	expand(zone, page, order, current_order, block_type);
 +	page_del_and_expand(zone, page, order, current_order, block_type);
    return page;
  }
@@@ -2236,43 -2217,6 +2236,43 @@@ do_steal
    return page;
  }
+#ifdef CONFIG_CMA
 +/*
 + * GFP_MOVABLE allocation could drain UNMOVABLE & RECLAIMABLE page blocks via
 + * the help of CMA which makes GFP_KERNEL failed. Checking if zone_watermark_ok
 + * again without ALLOC_CMA to see if to use CMA first.
 + */
 +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
 +{
 +	unsigned long watermark;
 +	bool cma_first = false;
 +
 +	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 +	/* check if GFP_MOVABLE pass previous zone_watermark_ok via the help of CMA */
 +	if (zone_watermark_ok(zone, order, watermark, 0, alloc_flags & (~ALLOC_CMA))) {
 +		/*
 +		 * Balance movable allocations between regular and CMA areas by
 +		 * allocating from CMA when over half of the zone's free memory
 +		 * is in the CMA area.
 +		 */
 +		cma_first = (zone_page_state(zone, NR_FREE_CMA_PAGES) >
 +				zone_page_state(zone, NR_FREE_PAGES) / 2);
 +	} else {
 +		/*
 +		 * watermark failed means UNMOVABLE & RECLAIMBLE is not enough
 +		 * now, we should use cma first to keep them stay around the
 +		 * corresponding watermark
 +		 */
 +		cma_first = true;
 +	}
 +	return cma_first;
 +}
 +#else
 +static bool use_cma_first(struct zone *zone, unsigned int order, unsigned int alloc_flags)
 +{
 +	return false;
 +}
 +#endif
  /*
   * Do the hard work of removing an element from the buddy allocator.
   * Call me with the zone->lock already held.
@@@ -2286,11 -2230,12 +2286,11 @@@ __rmqueue(struct zone *zone, unsigned i
    if (IS_ENABLED(CONFIG_CMA)) {
    	/*
    	 * Balance movable allocations between regular and CMA areas by
 -		 * allocating from CMA when over half of the zone's free memory
 -		 * is in the CMA area.
 +		 * allocating from CMA base on judging zone_watermark_ok again
 +		 * to see if the latest check got pass via the help of CMA
    	 */
    	if (alloc_flags & ALLOC_CMA &&
 -		    zone_page_state(zone, NR_FREE_CMA_PAGES) >
 -		    zone_page_state(zone, NR_FREE_PAGES) / 2) {
 +			use_cma_first(zone, order, alloc_flags)) {
    		page = __rmqueue_cma_fallback(zone, order);
    		if (page)
    			return page;
@@@ -2819,7 -2764,7 +2819,7 @@@ void split_page(struct page *page, unsi
    for (i = 1; i < (1 << order); i++)
    	set_page_refcounted(page + i);
    split_page_owner(page, order, 0);
 -	pgalloc_tag_split(page, 1 << order);
 +	pgalloc_tag_split(page_folio(page), order, 0);
    split_page_memcg(page, order, 0);
  }
  EXPORT_SYMBOL_GPL(split_page);
@@@ -3088,6 -3033,12 +3088,6 @@@ struct page *rmqueue(struct zone *prefe
  {
    struct page *page;
-	/*
 -	 * We most definitely don't want callers attempting to
 -	 * allocate greater than order-1 page units with __GFP_NOFAIL.
 -	 */
 -	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
 -
    if (likely(pcp_allowed_order(order))) {
    	page = rmqueue_pcplist(preferred_zone, zone, order,
    			       migratetype, alloc_flags);
@@@ -3406,7 -3357,7 +3406,7 @@@ retry
    	}
if (no_fallback && nr_online_nodes > 1 &&
 -		    zone != ac->preferred_zoneref->zone) {
 +		    zone != zonelist_zone(ac->preferred_zoneref)) {
    		int local_nid;
/*
@@@ -3414,7 -3365,7 +3414,7 @@@
    		 * fragmenting fallbacks. Locality is more important
    		 * than fragmentation avoidance.
    		 */
 -			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
 +			local_nid = zonelist_node_idx(ac->preferred_zoneref);
    		if (zone_to_nid(zone) != local_nid) {
    			alloc_flags &= ~ALLOC_NOFRAGMENT;
    			goto retry;
@@@ -3451,6 -3402,7 +3451,6 @@@ check_alloc_wmark
    		if (cond_accept_memory(zone, order))
    			goto try_this_zone;
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
    		/*
    		 * Watermark failed for this zone, but see if we can
    		 * grow this zone if it contains deferred pages.
@@@ -3459,13 -3411,14 +3459,13 @@@
    			if (_deferred_grow_zone(zone, order))
    				goto try_this_zone;
    		}
 -#endif
    		/* Checked here to keep the fast path fast */
    		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
    		if (alloc_flags & ALLOC_NO_WATERMARKS)
    			goto try_this_zone;
if (!node_reclaim_enabled() ||
 -			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
 +			    !zone_allows_reclaim(zonelist_zone(ac->preferred_zoneref), zone))
    			continue;
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
@@@ -3487,7 -3440,7 +3487,7 @@@
    	}
try_this_zone:
 -		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
 +		page = rmqueue(zonelist_zone(ac->preferred_zoneref), zone, order,
    			gfp_mask, alloc_flags, ac->migratetype);
    	if (page) {
    		prep_new_page(page, order, gfp_mask, alloc_flags);
@@@ -3504,11 -3457,13 +3504,11 @@@
    		if (cond_accept_memory(zone, order))
    			goto try_this_zone;
-#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
    		/* Try again if zone has deferred pages */
    		if (deferred_pages_enabled()) {
    			if (_deferred_grow_zone(zone, order))
    				goto try_this_zone;
    		}
 -#endif
    	}
    }
@@@ -4049,7 -4004,7 +4049,7 @@@ gfp_to_alloc_flags(gfp_t gfp_mask, unsi
    	 */
    	if (alloc_flags & ALLOC_MIN_RESERVE)
    		alloc_flags &= ~ALLOC_CPUSET;
- 	} else if (unlikely(rt_task(current)) && in_task())
+ 	} else if (unlikely(rt_or_dl_task(current)) && in_task())
    	alloc_flags |= ALLOC_MIN_RESERVE;
alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, alloc_flags);
@@@ -4145,11 -4100,6 +4145,11 @@@ should_reclaim_retry(gfp_t gfp_mask, un
    	unsigned long min_wmark = min_wmark_pages(zone);
    	bool wmark;
+		if (cpusets_enabled() &&
 +			(alloc_flags & ALLOC_CPUSET) &&
 +			!__cpuset_zone_allowed(zone, gfp_mask))
 +				continue;
 +
    	available = reclaimable = zone_reclaimable_pages(zone);
    	available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@@ -4225,7 -4175,6 +4225,7 @@@ __alloc_pages_slowpath(gfp_t gfp_mask, 
  {
    bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
    bool can_compact = gfp_compaction_allowed(gfp_mask);
 +	bool nofail = gfp_mask & __GFP_NOFAIL;
    const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
    struct page *page = NULL;
    unsigned int alloc_flags;
@@@ -4238,25 -4187,6 +4238,25 @@@
    unsigned int zonelist_iter_cookie;
    int reserve_flags;
+	if (unlikely(nofail)) {
 +		/*
 +		 * We most definitely don't want callers attempting to
 +		 * allocate greater than order-1 page units with __GFP_NOFAIL.
 +		 */
 +		WARN_ON_ONCE(order > 1);
 +		/*
 +		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
 +		 * otherwise, we may result in lockup.
 +		 */
 +		WARN_ON_ONCE(!can_direct_reclaim);
 +		/*
 +		 * PF_MEMALLOC request from this context is rather bizarre
 +		 * because we cannot reclaim anything and only can loop waiting
 +		 * for somebody to do a work for us.
 +		 */
 +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
 +	}
 +
  restart:
    compaction_retries = 0;
    no_progress_loops = 0;
@@@ -4279,7 -4209,7 +4279,7 @@@
     */
    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
    				ac->highest_zoneidx, ac->nodemask);
 -	if (!ac->preferred_zoneref->zone)
 +	if (!zonelist_zone(ac->preferred_zoneref))
    	goto nopage;
/*
@@@ -4291,7 -4221,7 +4291,7 @@@
    	struct zoneref *z = first_zones_zonelist(ac->zonelist,
    				ac->highest_zoneidx,
    				&cpuset_current_mems_allowed);
 -		if (!z->zone)
 +		if (!zonelist_zone(z))
    		goto nopage;
    }
@@@ -4474,15 -4404,29 +4474,15 @@@ nopage
     * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
     * we always retry
     */
 -	if (gfp_mask & __GFP_NOFAIL) {
 +	if (unlikely(nofail)) {
    	/*
 -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
 -		 * of any new users that actually require GFP_NOWAIT
 +		 * Lacking direct_reclaim we can't do anything to reclaim memory,
 +		 * we disregard these unreasonable nofail requests and still
 +		 * return NULL
    	 */
 -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
 +		if (!can_direct_reclaim)
    		goto fail;
-		/*
 -		 * PF_MEMALLOC request from this context is rather bizarre
 -		 * because we cannot reclaim anything and only can loop waiting
 -		 * for somebody to do a work for us
 -		 */
 -		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
 -
 -		/*
 -		 * non failing costly orders are a hard requirement which we
 -		 * are not prepared for much so let's warn about these users
 -		 * so that we can identify them and convert them to something
 -		 * else.
 -		 */
 -		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
 -
    	/*
    	 * Help non-failing allocations by giving some access to memory
    	 * reserves normally used for high priority non-blocking
@@@ -4634,28 -4578,17 +4634,28 @@@ unsigned long alloc_pages_bulk_noprof(g
    		continue;
    	}
-		if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
 -		    zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
 +		if (nr_online_nodes > 1 && zone != zonelist_zone(ac.preferred_zoneref) &&
 +		    zone_to_nid(zone) != zonelist_node_idx(ac.preferred_zoneref)) {
    		goto failed;
    	}
+		cond_accept_memory(zone, 0);
 +retry_this_zone:
    	mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
    	if (zone_watermark_fast(zone, 0,  mark,
    			zonelist_zone_idx(ac.preferred_zoneref),
    			alloc_flags, gfp)) {
    		break;
    	}
 +
 +		if (cond_accept_memory(zone, 0))
 +			goto retry_this_zone;
 +
 +		/* Try again if zone has deferred pages */
 +		if (deferred_pages_enabled()) {
 +			if (_deferred_grow_zone(zone, 0))
 +				goto retry_this_zone;
 +		}
    }
/*
@@@ -4705,7 -4638,7 +4705,7 @@@
    pcp_trylock_finish(UP_flags);
__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
 -	zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);
 +	zone_statistics(zonelist_zone(ac.preferred_zoneref), zone, nr_account);
out:
    return nr_populated;
@@@ -4763,7 -4696,7 +4763,7 @@@ struct page *__alloc_pages_noprof(gfp_
     * Forbid the first pass from falling back to types that fragment
     * memory until all local zones are considered.
     */
 -	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
 +	alloc_flags |= alloc_flags_nofragment(zonelist_zone(ac.preferred_zoneref), gfp);
/* First allocation attempt */
    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
@@@ -5017,7 -4950,7 +5017,7 @@@ static void *make_alloc_exact(unsigned 
    	struct page *last = page + nr;
split_page_owner(page, order, 0);
 -		pgalloc_tag_split(page, 1 << order);
 +		pgalloc_tag_split(page_folio(page), order, 0);
    	split_page_memcg(page, order, 0);
    	while (page < --last)
    		set_page_refcounted(last);
@@@ -5368,7 -5301,7 +5368,7 @@@ int local_memory_node(int node
    z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
    			   gfp_zone(GFP_KERNEL),
    			   NULL);
 -	return zone_to_nid(z->zone);
 +	return zonelist_node_idx(z);
  }
  #endif
@@@ -6500,31 -6433,6 +6500,31 @@@ int __alloc_contig_migrate_range(struc
    return (ret < 0) ? ret : 0;
  }
+static void split_free_pages(struct list_head *list)
 +{
 +	int order;
 +
 +	for (order = 0; order < NR_PAGE_ORDERS; order++) {
 +		struct page *page, *next;
 +		int nr_pages = 1 << order;
 +
 +		list_for_each_entry_safe(page, next, &list[order], lru) {
 +			int i;
 +
 +			post_alloc_hook(page, order, __GFP_MOVABLE);
 +			if (!order)
 +				continue;
 +
 +			split_page(page, order);
 +
 +			/* Add all subpages to the order-0 head, in sequence. */
 +			list_del(&page->lru);
 +			for (i = 0; i < nr_pages; i++)
 +				list_add_tail(&page[i].lru, &list[0]);
 +		}
 +	}
 +}
 +
  /**
   * alloc_contig_range() -- tries to allocate given range of pages
   * @start:	start PFN to allocate
@@@ -6637,25 -6545,12 +6637,25 @@@ int alloc_contig_range_noprof(unsigned 
    	goto done;
    }
-	/* Free head and tail (if any) */
 -	if (start != outer_start)
 -		free_contig_range(outer_start, start - outer_start);
 -	if (end != outer_end)
 -		free_contig_range(end, outer_end - end);
 +	if (!(gfp_mask & __GFP_COMP)) {
 +		split_free_pages(cc.freepages);
+		/* Free head and tail (if any) */
 +		if (start != outer_start)
 +			free_contig_range(outer_start, start - outer_start);
 +		if (end != outer_end)
 +			free_contig_range(end, outer_end - end);
 +	} else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) {
 +		struct page *head = pfn_to_page(start);
 +		int order = ilog2(end - start);
 +
 +		check_new_pages(head, order);
 +		prep_new_page(head, order, gfp_mask, 0);
 +	} else {
 +		ret = -EINVAL;
 +		WARN(true, "PFN range: requested [%lu, %lu), allocated [%lu, %lu)\n",
 +		     start, end, outer_start, outer_end);
 +	}
  done:
    undo_isolate_page_range(start, end, migratetype);
    return ret;
@@@ -6764,18 -6659,6 +6764,18 @@@ struct page *alloc_contig_pages_noprof(
  void free_contig_range(unsigned long pfn, unsigned long nr_pages)
  {
    unsigned long count = 0;
 +	struct folio *folio = pfn_folio(pfn);
 +
 +	if (folio_test_large(folio)) {
 +		int expected = folio_nr_pages(folio);
 +
 +		if (nr_pages == expected)
 +			folio_put(folio);
 +		else
 +			WARN(true, "PFN %lu: nr_pages %lu != expected %d\n",
 +			     pfn, nr_pages, expected);
 +		return;
 +	}
for (; nr_pages--; pfn++) {
    	struct page *page = pfn_to_page(pfn);
@@@ -7044,50 -6927,23 +7044,50 @@@ early_param("accept_memory", accept_mem
  static bool page_contains_unaccepted(struct page *page, unsigned int order)
  {
    phys_addr_t start = page_to_phys(page);
 -	phys_addr_t end = start + (PAGE_SIZE << order);
-	return range_contains_unaccepted_memory(start, end);
 +	return range_contains_unaccepted_memory(start, PAGE_SIZE << order);
  }
-static void accept_page(struct page *page, unsigned int order)
 +static void __accept_page(struct zone *zone, unsigned long *flags,
 +			  struct page *page)
  {
 -	phys_addr_t start = page_to_phys(page);
 +	bool last;
 +
 +	list_del(&page->lru);
 +	last = list_empty(&zone->unaccepted_pages);
 +
 +	account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
 +	__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
 +	__ClearPageUnaccepted(page);
 +	spin_unlock_irqrestore(&zone->lock, *flags);
 +
 +	accept_memory(page_to_phys(page), PAGE_SIZE << MAX_PAGE_ORDER);
 +
 +	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
-	accept_memory(start, start + (PAGE_SIZE << order));
 +	if (last)
 +		static_branch_dec(&zones_with_unaccepted_pages);
 +}
 +
 +void accept_page(struct page *page)
 +{
 +	struct zone *zone = page_zone(page);
 +	unsigned long flags;
 +
 +	spin_lock_irqsave(&zone->lock, flags);
 +	if (!PageUnaccepted(page)) {
 +		spin_unlock_irqrestore(&zone->lock, flags);
 +		return;
 +	}
 +
 +	/* Unlocks zone->lock */
 +	__accept_page(zone, &flags, page);
  }
static bool try_to_accept_memory_one(struct zone *zone)
  {
    unsigned long flags;
    struct page *page;
 -	bool last;
spin_lock_irqsave(&zone->lock, flags);
    page = list_first_entry_or_null(&zone->unaccepted_pages,
@@@ -7097,17 -6953,23 +7097,17 @@@
    	return false;
    }
-	list_del(&page->lru);
 -	last = list_empty(&zone->unaccepted_pages);
 -
 -	account_freepages(zone, -MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
 -	__mod_zone_page_state(zone, NR_UNACCEPTED, -MAX_ORDER_NR_PAGES);
 -	spin_unlock_irqrestore(&zone->lock, flags);
 -
 -	accept_page(page, MAX_PAGE_ORDER);
 -
 -	__free_pages_ok(page, MAX_PAGE_ORDER, FPI_TO_TAIL);
 -
 -	if (last)
 -		static_branch_dec(&zones_with_unaccepted_pages);
 +	/* Unlocks zone->lock */
 +	__accept_page(zone, &flags, page);
return true;
  }
+static inline bool has_unaccepted_memory(void)
 +{
 +	return static_branch_unlikely(&zones_with_unaccepted_pages);
 +}
 +
  static bool cond_accept_memory(struct zone *zone, unsigned int order)
  {
    long to_accept;
@@@ -7119,8 -6981,8 +7119,8 @@@
    if (list_empty(&zone->unaccepted_pages))
    	return false;
-	/* How much to accept to get to high watermark? */
 -	to_accept = high_wmark_pages(zone) -
 +	/* How much to accept to get to promo watermark? */
 +	to_accept = promo_wmark_pages(zone) -
    	    (zone_page_state(zone, NR_FREE_PAGES) -
    	    __zone_watermark_unusable_free(zone, order, 0) -
    	    zone_page_state(zone, NR_UNACCEPTED));
@@@ -7135,6 -6997,11 +7135,6 @@@
    return ret;
  }
-static inline bool has_unaccepted_memory(void)
 -{
 -	return static_branch_unlikely(&zones_with_unaccepted_pages);
 -}
 -
  static bool __free_unaccepted(struct page *page)
  {
    struct zone *zone = page_zone(page);
@@@ -7149,7 -7016,6 +7149,7 @@@
    list_add_tail(&page->lru, &zone->unaccepted_pages);
    account_freepages(zone, MAX_ORDER_NR_PAGES, MIGRATE_MOVABLE);
    __mod_zone_page_state(zone, NR_UNACCEPTED, MAX_ORDER_NR_PAGES);
 +	__SetPageUnaccepted(page);
    spin_unlock_irqrestore(&zone->lock, flags);
if (first)
@@@ -7165,11 -7031,20 +7165,11 @@@ static bool page_contains_unaccepted(st
    return false;
  }
-static void accept_page(struct page *page, unsigned int order)
 -{
 -}
 -
  static bool cond_accept_memory(struct zone *zone, unsigned int order)
  {
    return false;
  }
-static inline bool has_unaccepted_memory(void)
 -{
 -	return false;
 -}
 -
  static bool __free_unaccepted(struct page *page)
  {
    BUILD_BUG();
diff --combined net/core/dev.c
index 8f4dead64284f,e4d5e9bdd09e2..91e8f5d0d3b16
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@@ -158,7 -158,6 +158,7 @@@
  #include <net/page_pool/types.h>
  #include <net/page_pool/helpers.h>
  #include <net/rps.h>
 +#include <linux/phy_link_topology.h>
#include "dev.h"
  #include "net-sysfs.h"
@@@ -3387,7 -3386,6 +3387,7 @@@ int skb_crc32c_csum_help(struct sk_buf
  out:
    return ret;
  }
 +EXPORT_SYMBOL(skb_crc32c_csum_help);
__be16 skb_network_protocol(struct sk_buff *skb, int *depth)
  {
@@@ -3707,7 -3705,7 +3707,7 @@@ struct sk_buff *validate_xmit_skb_list(
    	next = skb->next;
    	skb_mark_not_on_list(skb);
-		/* in case skb wont be segmented, point to itself */
 +		/* in case skb won't be segmented, point to itself */
    	skb->prev = skb;
skb = validate_xmit_skb(skb, dev, again);
@@@ -4247,6 -4245,13 +4247,6 @@@ u16 dev_pick_tx_zero(struct net_device 
  }
  EXPORT_SYMBOL(dev_pick_tx_zero);
-u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb,
 -		       struct net_device *sb_dev)
 -{
 -	return (u16)raw_smp_processor_id() % dev->real_num_tx_queues;
 -}
 -EXPORT_SYMBOL(dev_pick_tx_cpu_id);
 -
  u16 netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
    	     struct net_device *sb_dev)
  {
@@@ -5243,7 -5248,7 +5243,7 @@@ int netif_rx(struct sk_buff *skb
  }
  EXPORT_SYMBOL(netif_rx);
- static __latent_entropy void net_tx_action(struct softirq_action *h)
+ static __latent_entropy void net_tx_action(void)
  {
    struct softnet_data *sd = this_cpu_ptr(&softnet_data);
@@@ -5720,9 -5725,10 +5720,9 @@@ static void __netif_receive_skb_list_co
    struct packet_type *pt_curr = NULL;
    /* Current (common) orig_dev of sublist */
    struct net_device *od_curr = NULL;
 -	struct list_head sublist;
    struct sk_buff *skb, *next;
 +	LIST_HEAD(sublist);
-	INIT_LIST_HEAD(&sublist);
    list_for_each_entry_safe(skb, next, head, list) {
    	struct net_device *orig_dev = skb->dev;
    	struct packet_type *pt_prev = NULL;
@@@ -5860,8 -5866,9 +5860,8 @@@ static int netif_receive_skb_internal(s
  void netif_receive_skb_list_internal(struct list_head *head)
  {
    struct sk_buff *skb, *next;
 -	struct list_head sublist;
 +	LIST_HEAD(sublist);
-	INIT_LIST_HEAD(&sublist);
    list_for_each_entry_safe(skb, next, head, list) {
    	net_timestamp_check(READ_ONCE(net_hotdata.tstamp_prequeue),
    			    skb);
@@@ -6914,7 -6921,7 +6914,7 @@@ static int napi_threaded_poll(void *dat
    return 0;
  }
- static __latent_entropy void net_rx_action(struct softirq_action *h)
+ static __latent_entropy void net_rx_action(void)
  {
    struct softnet_data *sd = this_cpu_ptr(&softnet_data);
    unsigned long time_limit = jiffies +
@@@ -9265,7 -9272,7 +9265,7 @@@ EXPORT_SYMBOL(netdev_port_same_parent_i
   */
  int dev_change_proto_down(struct net_device *dev, bool proto_down)
  {
 -	if (!(dev->priv_flags & IFF_CHANGE_PROTO_DOWN))
 +	if (!dev->change_proto_down)
    	return -EOPNOTSUPP;
    if (!netif_device_present(dev))
    	return -ENODEV;
@@@ -9362,15 -9369,6 +9362,15 @@@ u8 dev_xdp_prog_count(struct net_devic
  }
  EXPORT_SYMBOL_GPL(dev_xdp_prog_count);
+int dev_xdp_propagate(struct net_device *dev, struct netdev_bpf *bpf)
 +{
 +	if (!dev->netdev_ops->ndo_bpf)
 +		return -EOPNOTSUPP;
 +
 +	return dev->netdev_ops->ndo_bpf(dev, bpf);
 +}
 +EXPORT_SYMBOL_GPL(dev_xdp_propagate);
 +
  u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode)
  {
    struct bpf_prog *prog = dev_xdp_prog(dev, mode);
@@@ -10323,17 -10321,6 +10323,17 @@@ static void netdev_do_free_pcpu_stats(s
    }
  }
+static void netdev_free_phy_link_topology(struct net_device *dev)
 +{
 +	struct phy_link_topology *topo = dev->link_topo;
 +
 +	if (IS_ENABLED(CONFIG_PHYLIB) && topo) {
 +		xa_destroy(&topo->phys);
 +		kfree(topo);
 +		dev->link_topo = NULL;
 +	}
 +}
 +
  /**
   * register_netdevice() - register a network device
   * @dev: device to register
@@@ -10881,7 -10868,7 +10881,7 @@@ noinline void netdev_core_stats_inc(str
    		return;
    }
-	field = (__force unsigned long __percpu *)((__force void *)p + offset);
 +	field = (unsigned long __percpu *)((void __percpu *)p + offset);
    this_cpu_inc(*field);
  }
  EXPORT_SYMBOL_GPL(netdev_core_stats_inc);
@@@ -11112,7 -11099,6 +11112,7 @@@ struct net_device *alloc_netdev_mqs(in
  #ifdef CONFIG_NET_SCHED
    hash_init(dev->qdisc_hash);
  #endif
 +
    dev->priv_flags = IFF_XMIT_DST_RELEASE | IFF_XMIT_DST_RELEASE_PERM;
    setup(dev);
@@@ -11134,7 -11120,7 +11134,7 @@@
    if (!dev->ethtool)
    	goto free_all;
-	strcpy(dev->name, name);
 +	strscpy(dev->name, name);
    dev->name_assign_type = name_assign_type;
    dev->group = INIT_NETDEV_GROUP;
    if (!dev->ethtool_ops)
@@@ -11205,8 -11191,6 +11205,8 @@@ void free_netdev(struct net_device *dev
    free_percpu(dev->xdp_bulkq);
    dev->xdp_bulkq = NULL;
+	netdev_free_phy_link_topology(dev);
 +
    /*  Compatibility with error handling in drivers */
    if (dev->reg_state == NETREG_UNINITIALIZED ||
        dev->reg_state == NETREG_DUMMY) {
@@@ -11423,7 -11407,7 +11423,7 @@@ void unregister_netdevice_many_notify(s
   *	@head: list of devices
   *
   *  Note: As most callers use a stack allocated list_head,
 - *  we force a list_del() to make sure stack wont be corrupted later.
 + *  we force a list_del() to make sure stack won't be corrupted later.
   */
  void unregister_netdevice_many(struct list_head *head)
  {
@@@ -11478,10 -11462,10 +11478,10 @@@ int __dev_change_net_namespace(struct n
/* Don't allow namespace local devices to be moved. */
    err = -EINVAL;
 -	if (dev->features & NETIF_F_NETNS_LOCAL)
 +	if (dev->netns_local)
    	goto out;
-	/* Ensure the device has been registrered */
 +	/* Ensure the device has been registered */
    if (dev->reg_state != NETREG_REGISTERED)
    	goto out;
@@@ -11860,7 -11844,7 +11860,7 @@@ static void __net_exit default_device_e
    	char fb_name[IFNAMSIZ];
/* Ignore unmoveable devices (i.e. loopback) */
 -		if (dev->features & NETIF_F_NETNS_LOCAL)
 +		if (dev->netns_local)
    		continue;
/* Leave virtual devices for the generic cleanup */
@@@ -11921,7 -11905,7 +11921,7 @@@ static struct pernet_operations __net_i
  static void __init net_dev_struct_check(void)
  {
    /* TX read-mostly hotpath */
 -	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, priv_flags);
 +	CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, priv_flags_fast);
    CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, netdev_ops);
    CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, header_ops);
    CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_tx, _tx);
diff --combined tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 9649e7f09fc90,1fc16657cf425..8835761d9a126
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@@ -17,7 -17,6 +17,7 @@@
  #include <linux/in.h>
  #include <linux/in6.h>
  #include <linux/un.h>
 +#include <linux/filter.h>
  #include <net/sock.h>
  #include <linux/namei.h>
  #include "bpf_testmod.h"
@@@ -142,12 -141,13 +142,12 @@@ bpf_testmod_test_mod_kfunc(int i
__bpf_kfunc int bpf_iter_testmod_seq_new(struct bpf_iter_testmod_seq *it, s64 value, int cnt)
  {
 -	if (cnt < 0) {
 -		it->cnt = 0;
 +	it->cnt = cnt;
 +
 +	if (cnt < 0)
    	return -EINVAL;
 -	}
it->value = value;
 -	it->cnt = cnt;
return 0;
  }
@@@ -162,14 -162,6 +162,14 @@@ __bpf_kfunc s64 *bpf_iter_testmod_seq_n
    return &it->value;
  }
+__bpf_kfunc s64 bpf_iter_testmod_seq_value(int val, struct bpf_iter_testmod_seq* it__iter)
 +{
 +	if (it__iter->cnt < 0)
 +		return 0;
 +
 +	return val + it__iter->value;
 +}
 +
  __bpf_kfunc void bpf_iter_testmod_seq_destroy(struct bpf_iter_testmod_seq *it)
  {
    it->cnt = 0;
@@@ -184,36 -176,6 +184,36 @@@ __bpf_kfunc void bpf_kfunc_dynptr_test(
  {
  }
+__bpf_kfunc struct sk_buff *bpf_kfunc_nested_acquire_nonzero_offset_test(struct sk_buff_head *ptr)
 +{
 +	return NULL;
 +}
 +
 +__bpf_kfunc struct sk_buff *bpf_kfunc_nested_acquire_zero_offset_test(struct sock_common *ptr)
 +{
 +	return NULL;
 +}
 +
 +__bpf_kfunc void bpf_kfunc_nested_release_test(struct sk_buff *ptr)
 +{
 +}
 +
 +__bpf_kfunc void bpf_kfunc_trusted_vma_test(struct vm_area_struct *ptr)
 +{
 +}
 +
 +__bpf_kfunc void bpf_kfunc_trusted_task_test(struct task_struct *ptr)
 +{
 +}
 +
 +__bpf_kfunc void bpf_kfunc_trusted_num_test(int *ptr)
 +{
 +}
 +
 +__bpf_kfunc void bpf_kfunc_rcu_task_test(struct task_struct *ptr)
 +{
 +}
 +
  __bpf_kfunc struct bpf_testmod_ctx *
  bpf_testmod_ctx_create(int *err)
  {
@@@ -394,8 -356,6 +394,8 @@@ bpf_testmod_test_read(struct file *file
    if (bpf_testmod_loop_test(101) > 100)
    	trace_bpf_testmod_test_read(current, &ctx);
+	trace_bpf_testmod_test_nullable_bare(NULL);
 +
    /* Magic number to enable writable tp */
    if (len == 64) {
    	struct bpf_testmod_test_writable_ctx writable = {
@@@ -472,7 -432,7 +472,7 @@@ uprobe_ret_handler(struct uprobe_consum
struct testmod_uprobe {
    struct path path;
- 	loff_t offset;
+ 	struct uprobe *uprobe;
    struct uprobe_consumer consumer;
  };
@@@ -486,25 -446,25 +486,25 @@@ static int testmod_register_uprobe(loff
  {
    int err = -EBUSY;
- 	if (uprobe.offset)
+ 	if (uprobe.uprobe)
    	return -EBUSY;
mutex_lock(&testmod_uprobe_mutex);
- 	if (uprobe.offset)
+ 	if (uprobe.uprobe)
    	goto out;
err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
    if (err)
    	goto out;
- 	err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
- 				     offset, 0, &uprobe.consumer);
- 	if (err)
+ 	uprobe.uprobe = uprobe_register(d_real_inode(uprobe.path.dentry),
+ 					offset, 0, &uprobe.consumer);
+ 	if (IS_ERR(uprobe.uprobe)) {
+ 		err = PTR_ERR(uprobe.uprobe);
    	path_put(&uprobe.path);
- 	else
- 		uprobe.offset = offset;
- 
+ 		uprobe.uprobe = NULL;
+ 	}
  out:
    mutex_unlock(&testmod_uprobe_mutex);
    return err;
@@@ -514,10 -474,11 +514,11 @@@ static void testmod_unregister_uprobe(v
  {
    mutex_lock(&testmod_uprobe_mutex);
- 	if (uprobe.offset) {
- 		uprobe_unregister(d_real_inode(uprobe.path.dentry),
- 				  uprobe.offset, &uprobe.consumer);
- 		uprobe.offset = 0;
+ 	if (uprobe.uprobe) {
+ 		uprobe_unregister_nosync(uprobe.uprobe, &uprobe.consumer);
+ 		uprobe_unregister_sync();
+ 		path_put(&uprobe.path);
+ 		uprobe.uprobe = NULL;
    }
mutex_unlock(&testmod_uprobe_mutex);
@@@ -571,16 -532,8 +572,16 @@@ BTF_KFUNCS_START(bpf_testmod_common_kfu
  BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW)
  BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL)
  BTF_ID_FLAGS(func, bpf_iter_testmod_seq_destroy, KF_ITER_DESTROY)
 +BTF_ID_FLAGS(func, bpf_iter_testmod_seq_value)
  BTF_ID_FLAGS(func, bpf_kfunc_common_test)
  BTF_ID_FLAGS(func, bpf_kfunc_dynptr_test)
 +BTF_ID_FLAGS(func, bpf_kfunc_nested_acquire_nonzero_offset_test, KF_ACQUIRE)
 +BTF_ID_FLAGS(func, bpf_kfunc_nested_acquire_zero_offset_test, KF_ACQUIRE)
 +BTF_ID_FLAGS(func, bpf_kfunc_nested_release_test, KF_RELEASE)
 +BTF_ID_FLAGS(func, bpf_kfunc_trusted_vma_test, KF_TRUSTED_ARGS)
 +BTF_ID_FLAGS(func, bpf_kfunc_trusted_task_test, KF_TRUSTED_ARGS)
 +BTF_ID_FLAGS(func, bpf_kfunc_trusted_num_test, KF_TRUSTED_ARGS)
 +BTF_ID_FLAGS(func, bpf_kfunc_rcu_task_test, KF_RCU)
  BTF_ID_FLAGS(func, bpf_testmod_ctx_create, KF_ACQUIRE | KF_RET_NULL)
  BTF_ID_FLAGS(func, bpf_testmod_ctx_release, KF_RELEASE)
  BTF_KFUNCS_END(bpf_testmod_common_kfunc_ids)
@@@ -968,51 -921,6 +969,51 @@@ out
    return err;
  }
+static DEFINE_MUTEX(st_ops_mutex);
 +static struct bpf_testmod_st_ops *st_ops;
 +
 +__bpf_kfunc int bpf_kfunc_st_ops_test_prologue(struct st_ops_args *args)
 +{
 +	int ret = -1;
 +
 +	mutex_lock(&st_ops_mutex);
 +	if (st_ops && st_ops->test_prologue)
 +		ret = st_ops->test_prologue(args);
 +	mutex_unlock(&st_ops_mutex);
 +
 +	return ret;
 +}
 +
 +__bpf_kfunc int bpf_kfunc_st_ops_test_epilogue(struct st_ops_args *args)
 +{
 +	int ret = -1;
 +
 +	mutex_lock(&st_ops_mutex);
 +	if (st_ops && st_ops->test_epilogue)
 +		ret = st_ops->test_epilogue(args);
 +	mutex_unlock(&st_ops_mutex);
 +
 +	return ret;
 +}
 +
 +__bpf_kfunc int bpf_kfunc_st_ops_test_pro_epilogue(struct st_ops_args *args)
 +{
 +	int ret = -1;
 +
 +	mutex_lock(&st_ops_mutex);
 +	if (st_ops && st_ops->test_pro_epilogue)
 +		ret = st_ops->test_pro_epilogue(args);
 +	mutex_unlock(&st_ops_mutex);
 +
 +	return ret;
 +}
 +
 +__bpf_kfunc int bpf_kfunc_st_ops_inc10(struct st_ops_args *args)
 +{
 +	args->a += 10;
 +	return args->a;
 +}
 +
  BTF_KFUNCS_START(bpf_testmod_check_kfunc_ids)
  BTF_ID_FLAGS(func, bpf_testmod_test_mod_kfunc)
  BTF_ID_FLAGS(func, bpf_kfunc_call_test1)
@@@ -1049,10 -957,6 +1050,10 @@@ BTF_ID_FLAGS(func, bpf_kfunc_call_kerne
  BTF_ID_FLAGS(func, bpf_kfunc_call_sock_sendmsg, KF_SLEEPABLE)
  BTF_ID_FLAGS(func, bpf_kfunc_call_kernel_getsockname, KF_SLEEPABLE)
  BTF_ID_FLAGS(func, bpf_kfunc_call_kernel_getpeername, KF_SLEEPABLE)
 +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_prologue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
 +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
 +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_test_pro_epilogue, KF_TRUSTED_ARGS | KF_SLEEPABLE)
 +BTF_ID_FLAGS(func, bpf_kfunc_st_ops_inc10, KF_TRUSTED_ARGS)
  BTF_KFUNCS_END(bpf_testmod_check_kfunc_ids)
static int bpf_testmod_ops_init(struct btf *btf)
@@@ -1121,11 -1025,6 +1122,11 @@@ static void bpf_testmod_test_2(int a, i
  {
  }
+static int bpf_testmod_tramp(int value)
 +{
 +	return 0;
 +}
 +
  static int bpf_testmod_ops__test_maybe_null(int dummy,
    				    struct task_struct *task__nullable)
  {
@@@ -1172,144 -1071,6 +1173,144 @@@ struct bpf_struct_ops bpf_testmod_ops2 
    .owner = THIS_MODULE,
  };
+static int bpf_test_mod_st_ops__test_prologue(struct st_ops_args *args)
 +{
 +	return 0;
 +}
 +
 +static int bpf_test_mod_st_ops__test_epilogue(struct st_ops_args *args)
 +{
 +	return 0;
 +}
 +
 +static int bpf_test_mod_st_ops__test_pro_epilogue(struct st_ops_args *args)
 +{
 +	return 0;
 +}
 +
 +static int st_ops_gen_prologue(struct bpf_insn *insn_buf, bool direct_write,
 +			       const struct bpf_prog *prog)
 +{
 +	struct bpf_insn *insn = insn_buf;
 +
 +	if (strcmp(prog->aux->attach_func_name, "test_prologue") &&
 +	    strcmp(prog->aux->attach_func_name, "test_pro_epilogue"))
 +		return 0;
 +
 +	/* r6 = r1[0]; // r6 will be "struct st_ops *args". r1 is "u64 *ctx".
 +	 * r7 = r6->a;
 +	 * r7 += 1000;
 +	 * r6->a = r7;
 +	 */
 +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0);
 +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_7, BPF_REG_6, offsetof(struct st_ops_args, a));
 +	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, 1000);
 +	*insn++ = BPF_STX_MEM(BPF_DW, BPF_REG_6, BPF_REG_7, offsetof(struct st_ops_args, a));
 +	*insn++ = prog->insnsi[0];
 +
 +	return insn - insn_buf;
 +}
 +
 +static int st_ops_gen_epilogue(struct bpf_insn *insn_buf, const struct bpf_prog *prog,
 +			       s16 ctx_stack_off)
 +{
 +	struct bpf_insn *insn = insn_buf;
 +
 +	if (strcmp(prog->aux->attach_func_name, "test_epilogue") &&
 +	    strcmp(prog->aux->attach_func_name, "test_pro_epilogue"))
 +		return 0;
 +
 +	/* r1 = stack[ctx_stack_off]; // r1 will be "u64 *ctx"
 +	 * r1 = r1[0]; // r1 will be "struct st_ops *args"
 +	 * r6 = r1->a;
 +	 * r6 += 10000;
 +	 * r1->a = r6;
 +	 * r0 = r6;
 +	 * r0 *= 2;
 +	 * BPF_EXIT;
 +	 */
 +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_FP, ctx_stack_off);
 +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0);
 +	*insn++ = BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, offsetof(struct st_ops_args, a));
 +	*insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_6, 10000);
 +	*insn++ = BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_6, offsetof(struct st_ops_args, a));
 +	*insn++ = BPF_MOV64_REG(BPF_REG_0, BPF_REG_6);
 +	*insn++ = BPF_ALU64_IMM(BPF_MUL, BPF_REG_0, 2);
 +	*insn++ = BPF_EXIT_INSN();
 +
 +	return insn - insn_buf;
 +}
 +
 +static int st_ops_btf_struct_access(struct bpf_verifier_log *log,
 +				    const struct bpf_reg_state *reg,
 +				    int off, int size)
 +{
 +	if (off < 0 || off + size > sizeof(struct st_ops_args))
 +		return -EACCES;
 +	return 0;
 +}
 +
 +static const struct bpf_verifier_ops st_ops_verifier_ops = {
 +	.is_valid_access = bpf_testmod_ops_is_valid_access,
 +	.btf_struct_access = st_ops_btf_struct_access,
 +	.gen_prologue = st_ops_gen_prologue,
 +	.gen_epilogue = st_ops_gen_epilogue,
 +	.get_func_proto = bpf_base_func_proto,
 +};
 +
 +static struct bpf_testmod_st_ops st_ops_cfi_stubs = {
 +	.test_prologue = bpf_test_mod_st_ops__test_prologue,
 +	.test_epilogue = bpf_test_mod_st_ops__test_epilogue,
 +	.test_pro_epilogue = bpf_test_mod_st_ops__test_pro_epilogue,
 +};
 +
 +static int st_ops_reg(void *kdata, struct bpf_link *link)
 +{
 +	int err = 0;
 +
 +	mutex_lock(&st_ops_mutex);
 +	if (st_ops) {
 +		pr_err("st_ops has already been registered\n");
 +		err = -EEXIST;
 +		goto unlock;
 +	}
 +	st_ops = kdata;
 +
 +unlock:
 +	mutex_unlock(&st_ops_mutex);
 +	return err;
 +}
 +
 +static void st_ops_unreg(void *kdata, struct bpf_link *link)
 +{
 +	mutex_lock(&st_ops_mutex);
 +	st_ops = NULL;
 +	mutex_unlock(&st_ops_mutex);
 +}
 +
 +static int st_ops_init(struct btf *btf)
 +{
 +	return 0;
 +}
 +
 +static int st_ops_init_member(const struct btf_type *t,
 +			      const struct btf_member *member,
 +			      void *kdata, const void *udata)
 +{
 +	return 0;
 +}
 +
 +static struct bpf_struct_ops testmod_st_ops = {
 +	.verifier_ops = &st_ops_verifier_ops,
 +	.init = st_ops_init,
 +	.init_member = st_ops_init_member,
 +	.reg = st_ops_reg,
 +	.unreg = st_ops_unreg,
 +	.cfi_stubs = &st_ops_cfi_stubs,
 +	.name = "bpf_testmod_st_ops",
 +	.owner = THIS_MODULE,
 +};
 +
  extern int bpf_fentry_test1(int a);
static int bpf_testmod_init(void)
@@@ -1320,17 -1081,14 +1321,17 @@@
    		.kfunc_btf_id	= bpf_testmod_dtor_ids[1]
    	},
    };
 +	void **tramp;
    int ret;
ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &bpf_testmod_common_kfunc_set);
    ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_testmod_kfunc_set);
    ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_testmod_kfunc_set);
    ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_testmod_kfunc_set);
 +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_testmod_kfunc_set);
    ret = ret ?: register_bpf_struct_ops(&bpf_bpf_testmod_ops, bpf_testmod_ops);
    ret = ret ?: register_bpf_struct_ops(&bpf_testmod_ops2, bpf_testmod_ops2);
 +	ret = ret ?: register_bpf_struct_ops(&testmod_st_ops, bpf_testmod_st_ops);
    ret = ret ?: register_btf_id_dtor_kfuncs(bpf_testmod_dtors,
    					 ARRAY_SIZE(bpf_testmod_dtors),
    					 THIS_MODULE);
@@@ -1346,14 -1104,6 +1347,14 @@@
    ret = register_bpf_testmod_uprobe();
    if (ret < 0)
    	return ret;
 +
 +	/* Ensure nothing is between tramp_1..tramp_40 */
 +	BUILD_BUG_ON(offsetof(struct bpf_testmod_ops, tramp_1) + 40 * sizeof(long) !=
 +		     offsetofend(struct bpf_testmod_ops, tramp_40));
 +	tramp = (void **)&__bpf_testmod_ops.tramp_1;
 +	while (tramp <= (void **)&__bpf_testmod_ops.tramp_40)
 +		*tramp++ = bpf_testmod_tramp;
 +
    return 0;
  }
diff --combined tools/testing/selftests/mm/Makefile
index d7a85059c27bd,4ea188be0588a..02e1204971b0a
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@@ -90,6 -90,7 +90,7 @@@ CAN_BUILD_X86_64 := $(shell ./../x86/ch
  CAN_BUILD_WITH_NOPIE := $(shell ./../x86/check_cc.sh "$(CC)" ../x86/trivial_program.c -no-pie)
VMTARGETS := protection_keys
+ VMTARGETS += pkey_sighandler_tests
  BINARIES_32 := $(VMTARGETS:%=%_32)
  BINARIES_64 := $(VMTARGETS:%=%_64)
@@@ -106,13 -107,13 +107,13 @@@ TEST_GEN_FILES += $(BINARIES_64
  endif
  else
-ifneq (,$(findstring $(ARCH),powerpc))
 +ifneq (,$(filter $(ARCH),arm64 powerpc))
  TEST_GEN_FILES += protection_keys
  endif
endif
-ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390))
 +ifneq (,$(filter $(ARCH),arm64 mips64 parisc64 powerpc riscv64 s390x sparc64 x86_64 s390))
  TEST_GEN_FILES += va_high_addr_switch
  TEST_GEN_FILES += virtual_address_range
  TEST_GEN_FILES += write_to_hugetlbfs
diff --combined tools/testing/selftests/mm/pkey-helpers.h
index 15608350fc017,4d31a309a46b5..9ab6a3ee153b5
--- a/tools/testing/selftests/mm/pkey-helpers.h
+++ b/tools/testing/selftests/mm/pkey-helpers.h
@@@ -79,7 -79,18 +79,18 @@@ extern void abort_hooks(void)
    }					\
  } while (0)
- __attribute__((noinline)) int read_ptr(int *ptr);
+ #define barrier() __asm__ __volatile__("": : :"memory")
+ #ifndef noinline
+ # define noinline __attribute__((noinline))
+ #endif
+ 
+ noinline int read_ptr(int *ptr)
+ {
+ 	/* Keep GCC from optimizing this away somehow */
+ 	barrier();
+ 	return *ptr;
+ }
+ 
  void expected_pkey_fault(int pkey);
  int sys_pkey_alloc(unsigned long flags, unsigned long init_val);
  int sys_pkey_free(unsigned long pkey);
@@@ -91,17 -102,12 +102,17 @@@ void record_pkey_malloc(void *ptr, lon
  #include "pkey-x86.h"
  #elif defined(__powerpc64__) /* arch */
  #include "pkey-powerpc.h"
 +#elif defined(__aarch64__) /* arch */
 +#include "pkey-arm64.h"
  #else /* arch */
  #error Architecture not supported
  #endif /* arch */
+#ifndef PKEY_MASK
  #define PKEY_MASK	(PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE)
 +#endif
+#ifndef set_pkey_bits
  static inline u64 set_pkey_bits(u64 reg, int pkey, u64 flags)
  {
    u32 shift = pkey_bit_position(pkey);
@@@ -111,9 -117,7 +122,9 @@@
    reg |= (flags & PKEY_MASK) << shift;
    return reg;
  }
 +#endif
+#ifndef get_pkey_bits
  static inline u64 get_pkey_bits(u64 reg, int pkey)
  {
    u32 shift = pkey_bit_position(pkey);
@@@ -123,7 -127,6 +134,7 @@@
     */
    return ((reg >> shift) & PKEY_MASK);
  }
 +#endif
extern u64 shadow_pkey_reg;
diff --combined tools/testing/selftests/mm/protection_keys.c
index 0789981b72b95,cc6de1644360a..4990f7ab4cb72
--- a/tools/testing/selftests/mm/protection_keys.c
+++ b/tools/testing/selftests/mm/protection_keys.c
@@@ -147,7 -147,7 +147,7 @@@ void abort_hooks(void
   * will then fault, which makes sure that the fault code handles
   * execute-only memory properly.
   */
 -#ifdef __powerpc64__
 +#if defined(__powerpc64__) || defined(__aarch64__)
  /* This way, both 4K and 64K alignment are maintained */
  __attribute__((__aligned__(65536)))
  #else
@@@ -212,6 -212,7 +212,6 @@@ void pkey_disable_set(int pkey, int fla
    unsigned long syscall_flags = 0;
    int ret;
    int pkey_rights;
 -	u64 orig_pkey_reg = read_pkey_reg();
dprintf1("START->%s(%d, 0x%x)\n", __func__,
    	pkey, flags);
@@@ -241,6 -242,8 +241,6 @@@
dprintf1("%s(%d) pkey_reg: 0x%016llx\n",
    	__func__, pkey, read_pkey_reg());
 -	if (flags)
 -		pkey_assert(read_pkey_reg() >= orig_pkey_reg);
    dprintf1("END<---%s(%d, 0x%x)\n", __func__,
    	pkey, flags);
  }
@@@ -250,6 -253,7 +250,6 @@@ void pkey_disable_clear(int pkey, int f
    unsigned long syscall_flags = 0;
    int ret;
    int pkey_rights = hw_pkey_get(pkey, syscall_flags);
 -	u64 orig_pkey_reg = read_pkey_reg();
pkey_assert(flags & (PKEY_DISABLE_ACCESS | PKEY_DISABLE_WRITE));
@@@ -269,6 -273,8 +269,6 @@@
dprintf1("%s(%d) pkey_reg: 0x%016llx\n", __func__,
    		pkey, read_pkey_reg());
 -	if (flags)
 -		assert(read_pkey_reg() <= orig_pkey_reg);
  }
void pkey_write_allow(int pkey)
@@@ -308,9 -314,7 +308,9 @@@ void signal_handler(int signum, siginfo
    ucontext_t *uctxt = vucontext;
    int trapno;
    unsigned long ip;
 +#ifdef MCONTEXT_FPREGS
    char *fpregs;
 +#endif
  #if defined(__i386__) || defined(__x86_64__) /* arch */
    u32 *pkey_reg_ptr;
    int pkey_reg_offset;
@@@ -324,11 -328,9 +324,11 @@@
    		__func__, __LINE__,
    		__read_pkey_reg(), shadow_pkey_reg);
-	trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
 -	ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
 +	trapno = MCONTEXT_TRAPNO(uctxt->uc_mcontext);
 +	ip = MCONTEXT_IP(uctxt->uc_mcontext);
 +#ifdef MCONTEXT_FPREGS
    fpregs = (char *) uctxt->uc_mcontext.fpregs;
 +#endif
dprintf2("%s() trapno: %d ip: 0x%016lx info->si_code: %s/%d\n",
    		__func__, trapno, ip, si_code_str(si->si_code),
@@@ -357,9 -359,7 +357,9 @@@
  #endif /* arch */
dprintf1("siginfo: %p\n", si);
 +#ifdef MCONTEXT_FPREGS
    dprintf1(" fpregs: %p\n", fpregs);
 +#endif
if ((si->si_code == SEGV_MAPERR) ||
        (si->si_code == SEGV_ACCERR) ||
@@@ -389,8 -389,6 +389,8 @@@
  #elif defined(__powerpc64__) /* arch */
    /* restore access and let the faulting instruction continue */
    pkey_access_allow(siginfo_pkey);
 +#elif defined(__aarch64__)
 +	aarch64_write_signal_pkey(uctxt, PKEY_ALLOW_ALL);
  #endif /* arch */
    pkey_faults++;
    dprintf1("<<<<==================================================\n");
@@@ -904,9 -902,7 +904,9 @@@ void expected_pkey_fault(int pkey
     * test program continue.  We now have to restore it.
     */
    if (__read_pkey_reg() != 0)
 -#else /* arch */
 +#elif defined(__aarch64__)
 +	if (__read_pkey_reg() != PKEY_ALLOW_ALL)
 +#else
    if (__read_pkey_reg() != shadow_pkey_reg)
  #endif /* arch */
    	pkey_assert(0);
@@@ -954,16 -950,6 +954,6 @@@ void close_test_fds(void
    nr_test_fds = 0;
  }
- #define barrier() __asm__ __volatile__("": : :"memory")
- __attribute__((noinline)) int read_ptr(int *ptr)
- {
- 	/*
- 	 * Keep GCC from optimizing this away somehow
- 	 */
- 	barrier();
- 	return *ptr;
- }
- 
  void test_pkey_alloc_free_attach_pkey0(int *ptr, u16 pkey)
  {
    int i, err;
@@@ -1496,11 -1482,6 +1486,11 @@@ void test_executing_on_unreadable_memor
    lots_o_noops_around_write(&scratch);
    do_not_expect_pkey_fault("executing on PROT_EXEC memory");
    expect_fault_on_read_execonly_key(p1, pkey);
 +
 +	// Reset back to PROT_EXEC | PROT_READ for architectures that support
 +	// non-PKEY execute-only permissions.
 +	ret = mprotect_pkey(p1, PAGE_SIZE, PROT_EXEC | PROT_READ, (u64)pkey);
 +	pkey_assert(!ret);
  }
void test_implicit_mprotect_exec_only_memory(int *ptr, u16 pkey)
@@@ -1674,84 -1655,6 +1664,84 @@@ void test_ptrace_modifies_pkru(int *ptr
  }
  #endif
+#if defined(__aarch64__)
 +void test_ptrace_modifies_pkru(int *ptr, u16 pkey)
 +{
 +	pid_t child;
 +	int status, ret;
 +	struct iovec iov;
 +	u64 trace_pkey;
 +	/* Just a random pkey value.. */
 +	u64 new_pkey = (POE_X << PKEY_BITS_PER_PKEY * 2) |
 +			(POE_NONE << PKEY_BITS_PER_PKEY) |
 +			POE_RWX;
 +
 +	child = fork();
 +	pkey_assert(child >= 0);
 +	dprintf3("[%d] fork() ret: %d\n", getpid(), child);
 +	if (!child) {
 +		ptrace(PTRACE_TRACEME, 0, 0, 0);
 +
 +		/* Stop and allow the tracer to modify PKRU directly */
 +		raise(SIGSTOP);
 +
 +		/*
 +		 * need __read_pkey_reg() version so we do not do shadow_pkey_reg
 +		 * checking
 +		 */
 +		if (__read_pkey_reg() != new_pkey)
 +			exit(1);
 +
 +		raise(SIGSTOP);
 +
 +		exit(0);
 +	}
 +
 +	pkey_assert(child == waitpid(child, &status, 0));
 +	dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status);
 +	pkey_assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
 +
 +	iov.iov_base = &trace_pkey;
 +	iov.iov_len = 8;
 +	ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov);
 +	pkey_assert(ret == 0);
 +	pkey_assert(trace_pkey == read_pkey_reg());
 +
 +	trace_pkey = new_pkey;
 +
 +	ret = ptrace(PTRACE_SETREGSET, child, (void *)NT_ARM_POE, &iov);
 +	pkey_assert(ret == 0);
 +
 +	/* Test that the modification is visible in ptrace before any execution */
 +	memset(&trace_pkey, 0, sizeof(trace_pkey));
 +	ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov);
 +	pkey_assert(ret == 0);
 +	pkey_assert(trace_pkey == new_pkey);
 +
 +	/* Execute the tracee */
 +	ret = ptrace(PTRACE_CONT, child, 0, 0);
 +	pkey_assert(ret == 0);
 +
 +	/* Test that the tracee saw the PKRU value change */
 +	pkey_assert(child == waitpid(child, &status, 0));
 +	dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status);
 +	pkey_assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
 +
 +	/* Test that the modification is visible in ptrace after execution */
 +	memset(&trace_pkey, 0, sizeof(trace_pkey));
 +	ret = ptrace(PTRACE_GETREGSET, child, (void *)NT_ARM_POE, &iov);
 +	pkey_assert(ret == 0);
 +	pkey_assert(trace_pkey == new_pkey);
 +
 +	ret = ptrace(PTRACE_CONT, child, 0, 0);
 +	pkey_assert(ret == 0);
 +	pkey_assert(child == waitpid(child, &status, 0));
 +	dprintf3("[%d] waitpid(%d) status: %x\n", getpid(), child, status);
 +	pkey_assert(WIFEXITED(status));
 +	pkey_assert(WEXITSTATUS(status) == 0);
 +}
 +#endif
 +
  void test_mprotect_pkey_on_unsupported_cpu(int *ptr, u16 pkey)
  {
    int size = PAGE_SIZE;
@@@ -1787,7 -1690,7 +1777,7 @@@ void (*pkey_tests[])(int *ptr, u16 pkey
    test_pkey_syscalls_bad_args,
    test_pkey_alloc_exhaust,
    test_pkey_alloc_free_attach_pkey0,
 -#if defined(__i386__) || defined(__x86_64__)
 +#if defined(__i386__) || defined(__x86_64__) || defined(__aarch64__)
    test_ptrace_modifies_pkru,
  #endif
  };
-- 
LinuxNextTracking