TransformAccessArrayDemo
This project demonstrates a few ways to implement moving 3d objects (casters) that project some decal (see Decal Documentation in URP) on the ground below them.
Install / Use
/learn @Jura-Z/TransformAccessArrayDemoREADME
TransformAccessArrayDemo
What's the fastest way to move a lot of transforms in Unity?
Here we try to use a naive transform.SetPositionAndRotation + some raycasts to get a new position and that's around 32 msec to move 40k objects (22.4 msec) and cast 20k rays (9.6 msec).
And then we try to do the same (40k rotation/movement + 20k raycasts) in 0.120 msec in the main thread (2.5 msec on background jobs).

DecalMovement
This project demonstrates a few ways to implement moving 3d objects (casters) that project a decal (see Decal Documentation in URP) on the ground below them.
Naive
A naive implementation could involve having an agent do the following for each Update:
- Move and apply a new position/rotation using Transform.SetPositionAndRotation. Code lives here
- Perform a Physics.RaycastNonAlloc from the new position to find where Decal must be positioned. Code lives here
- If there is a hit then position the decal with the same Transform.SetPositionAndRotation. Code lives here
To control how many agents we have and how we spawn them - we have a manager
that uses ObjectPool of agents.
This way we're not calling Instantiate/Destroy, but just disabling objects on destroy, and enabling them on create. If and only if the pool is empty will the expensive call to Instantiate be made.

32 msec to move 20k casters + 20k decals… Raycasts are 9.6 msec, movement code is 22.4 msec. But is it possible to make it faster?
TransformAccessArray - what's that?
Each hierarchy in the root is a special object called TransformHierarchy that has an array of transforms in it. You even can control its capacity via Transform.hierarchyCapacity - that doc also has a bit of technical details.
TransformAccess defines single transform, basically TransformHierarchy pointer + index inside of it.
And TransformAccessArray is an array of those TransformAccess objects that is ready to be processed in multithreaded way.
❗ Yes, that's right - we can modify GameObject's Transforms via jobs. And that's insanely fast!
Hierarchy is critical - it controls how jobs can be scheduled. Only one thread can modify one TransformHierarchy. Read only jobs don't have such limitation, see Additional note on ReadOnly transform jobs
Also take a look at https://www.youtube.com/watch?v=W45-fsnPhJY&t=798 from amazing Ian Dundore (all his talks are great and must-see!).
Implementation
So, how?
Let's say we decouple 'agent' to 'caster' and 'decal'. In such a case the logic will be different: The manager would control a list of 'casters' and 'decals'.
Also, we need to use jobs (multithreading) and Burst (special compiler for a C# subset).
To do the raycasts we'd use multithreaded RaycastCommand.ScheduleBatch
For every Update of the manager it would spawn a chain of jobs that depend on each other:
- it would spawn a job that changes the position of all casters.
- next job would prepare RaycastCommands
- next job that actually does Raycasts in parallel
- next job positions decals on raycasts' hit position
- job handle of the whole chain is memorized, to call it's
Completeon the beginning of the next Update.
Therefore, other than scheduling the jobs, Update takes almost no time on the main thread. In addition to that,
MoveAgentsJob -> CommandsCreationJob -> RaycastCommand.ScheduleBatch -> SetPositionsJob are executed in parallel.

Just 2.37 msec spent on the main thread! Around 4 msec total. That's much faster.
Let's try enabling Burst:

Unfortunately, this didn't change much. Apparently, the movement code is not in our application's critical path anymore. However, our performance is being decreased by something else. Perhaps rendering is the cause of our performance drop. On the other hand, Burst's cost of invoking jobs is almost the same as the performance benefits we observe from the fast compilation. Remember, we are competing against il2cpp and not mono!
Ok, let's try to make this even faster.
TransformAccessArray + correctly organized hierarchy
As I said hierarchy is critical - because it controls how jobs are scheduled.
Only one thread can process one TransformHierarchy. To demonstrate it, let's try the worst possible case which unfortunately is a case that occurs often.
Wrong hierarchy: all under one parent GameObject
If we create all agents under some parent GameObject like this:
Scene
- Casters
- caster1
- caster2
- caster...
- caster342
- Decals
- decal1
- decal2
- decal...
- decal342
We would kill all the performance.

There is just one job that does all the read-write work, because there is only one TransformHierarchy per all casters and one per all decals.
Note: Read only jobs don't have such limitation, see Additional note on ReadOnly transform jobs
Better hierarchy: all in the root
Currently we're creating all casters in the root, like:
Scene
- caster1
- caster2
- caster...
- caster342
- decal1
- decal2
- decal...
- decal342

As you can see, we now have a much better picture however, we are still experiencing a lot of time spent on just scheduling.
To answer this I need to dig into Unity's source code. But since you don't have access to it, I try to explain it here:
Basically, for 40k objects in the root (20k casters, 20k decals) we have 40k TransformHierarchy that do some internal work, like for every TransformHierarchy that was used for scheduling a transform job (IJobParallelForTransform) - Unity engine marks them as 'potentially changed', by calling DidScheduleTransformJob and adding them to a special list, on main thread. That's not really expensive for few TransformHierarchy, but for 40k that's 2msec on my machine!
Optimal hierarchy: root buckets of 256
So, to improve the performance we need to reduce the amount of TransformHierarchy-s. For instance by creating one TransformHierarchy per 256 objects, like so:
Scene
- parentCaster1
- caster1
- caster2
- caster...
- caster256
- parentCaster2
- caster257
- caster258
- caster...
- caster342
- parentDecal1
- decal1
- decal2
- decal...
- decal256
- parentDecal2
- decal257
- decal258
- decal...
- decal342
This reduces the amount of TransformHierarchy from 40k to ~157 and we would reduce complexity of some internal algorithm that has O(TransformHierarchy count). Take a look:

We can now observe that the scheduler spends almost no time on the main thread. Total jobs completion takes ~2.5 msec.
That's for 20k objects and 20k decals!

Additional note on ReadOnly transform jobs
There is an addition to Unity 20
