You need to use a window partition by and let the random number do the shuffle for you.

from pyspark.sql import Window
(
	df
	.withColumn(
	"random_number", F.rand()
	)
	.withColumn(
	"row_id",
	F.row_number().over(
		Window().partitionBy("level0", "level1", "level2").orderBy(F.col("random_number").desc())
	)
	) 
	.where(F.col("row_id") < 10)
	.orderBy("level0", "level1", "level2")
	.show()
)

Further reading

Read more in the tech topic.

Let's talk!

I'm Carlo Nicolini — I am interested on the reliability of AI reasoning systems (interpretability, inference-time methods, probabilistic language programming) and on quantitative portfolio optimization (I am a maintainer of skfolio). If you're working on something in these areas and think we might collaborate, chat, discuss, I'm happy to talk about it!

The best way to reach me is on via DM on LinkedIn.