You need to use a window partition by and let the random number do the shuffle for you.

from pyspark.sql import Window
(
	df
	.withColumn(
	"random_number", F.rand()
	)
	.withColumn(
	"row_id",
	F.row_number().over(
		Window().partitionBy("level0", "level1", "level2").orderBy(F.col("random_number").desc())
	)
	) 
	.where(F.col("row_id") < 10)
	.orderBy("level0", "level1", "level2")
	.show()
)

I'm Carlo Nicolini — I am interested on the reliability of AI reasoning systems (interpretability, inference-time methods, probabilistic language programming) and on quantitative portfolio optimization (I am a maintainer of skfolio). If you're working on something in these areas and think we might collaborate, chat, discuss, I'm happy to talk about it!

The best way to reach me is on via DM on LinkedIn.

Sampling 10 rows per groupb in PySpark

Let's talk!

Further reading

Let's talk!