FireSim
This project aims to render fire in real-time using physics-based methods. All particle computation will be done on GPU. Rendering will be done on GPU through the use of an graphics API (potentially OpenGL)
Install / Use
/learn @jiatiansun/FireSimREADME
Link to Poster
https://docs.google.com/presentation/d/1Esl31lVrWi-zJ_zSCG8Znf92NNgdNU2k_t7QZnSOuoA/edit#slide=id.g4a050b83a1_0_0
Fire!

Summary
This project aims to render fire in real-time by solving Navier Stoke equations. We first implemented particle state simulation and rendering on CPU using Processing and Java, then we switched to using Direct3D to run both simulation and rendering on GPU via compute shaders.
Background
Simulation of Navier-Stokes
The Navier-Stokes equations are widely used for simulating fluids. The equations specify the relationship between mass and momentum of fluids, which can be used to simulate phenomenons including water, cloud, smoke, and fire. Our projects simulates fire.
If we define 'u}' as the velocity field, 'p' as the pressure field, 'rho' as the density field, the Navier-Stokes equations can be expressed as The common algorithm for simulating Navier-Stokes is using the "stable fluids" method proposed by Stam [Stam 1999]. The simulation divides the space being rendered into a 3D array of cubicle cells, each representing the state of the fluid particle at a particular location. Each grid cell is responsible for storing the physics state such as temperature, pressure, and velocity. In each time step, the simulation is broken down into four operations: Advect, Add Forces, Diffuse and Project. At each simulation time step, the four operations are applied to each particle in the grid. Cells at the boundary must be handled specially.
To obtain a realistic simulation of fluids, a large number of particles and small time step is desired. However, even if we simulate each dimension with only 100 particles, there are $100^3$ particles in total that we need to keep track of.
However, since each grid cell is updated using the same scheme (besides cells at the boundary), and only reference cells nearby (if any). This highly parallel, coherent nature of the simulation algorithm makes GPU a good candidate for this application.
There are some algorithmic constraints to how much the computation can be optimized. For example , each of the four operations, namely advection, force addition, diffusion and projection, depends on the output of the previous stage and cannot be parallelized. Each time step also depends on the previous time step.
Environment
We used two laptops running on Windos 10 for our development. The first laptop features a 6-core i7 CPU with an NVIDIA GeForce GTX1070 GPU. The second laptop features a 4-core i7 CPU with an NVIDIA GTX 970M GPU.
Approach
We uses Direct3D API and C++ for our CPU implementation and HLSL (high Level Shading Language) for our GPU shader programs implementation. We are targeting an x64 machine with an NVIDIA GeForce GPU.
References
Our implementation references from a few sources listed in the References section.
CPU
We simulate fire by dividing the space into $n \times n \times n$ cubical cells, where n is chosen to be 200 in our implementation. Each cell is simulated using one particle. Each state that the particles are associated with are implemented using an array of $n$ 2D textures with size $n \times n$. Specifically, the states we are simulating are direction of velocity, speed, divergence and pressure. In each time step, the values in the textures are updated in four stages according Navier-Stokes.
The program first initializes all resources (constant buffers, textures) vis Direct3D API call, then enters the main loop. In the main loop, particle states are first updated on GPU, then the results are rendered and presented.
GPU Shaders
All shaders are implemented as compute shaders, since our application doesn't required the standard graphics pipeline. There are five shaders (advect, addForce, divergence, Jacobi, project) Navier-Stokes simulation, corresponding to the four stages of simulation. There is one shader used for rendering.

Advection
Advection simulates the process of the fluid transporting itself in a field. This is simulated by first calculating how much the particle has traveled using its velocity, then updating velocity using the sampled quantity at the new position. This compute shader is executed in groups of $16 \times 4 \times 4$ threads. The code snippet for advect shader is shown below.
float3 newPos = i - velocity[i];
newPos = (newPos + 0.5) / dim;
velocityRW[i] = velocity.SampleLevel(samLinear, newPos, 0);
AddForce
Add force accounts for how the environment acts external force on the system. In our system, the mouse takes impulse input from the user, and particles around the mouse takes this force into account.
Diffusion
Because fluids that are viscous have a resistance to flow, diffusion of velocity occurs when fluids flow. The viscous equation, when formulated in discretized form, is in the form of Poisson equations for velocity. An iterative technique for solving Poisson equations is called Jacobi iteration. This technique needs to be executed many times for it to converge. This can be cheaply done on GPU. In our implementation, we run Jacobi iteration 10 times in each time stamp. Jacobi technique requires calculating the divergence of velocity, then using divergence and velocity to calculate the new pressure value.
The code snippet for divergence shader is presented below. An micro optimization is done by unrolling the for loop, which allows the system to determine memory access pattern in advance.
float pL, pR, pF, pB, pU, pD;
float divergence[4];
[unroll]
for (int j = 0; j<4; j++){
uint3 i4 = uint3(4 * i.x + j, i.y, i.z);
pL = velocity[i4 + uint3(-1, 0, 0)].x;
pR = velocity[i4 + uint3(1, 0, 0)].x;
pF = velocity[i4 + uint3(0, -1, 0)].y;
pB = velocity[i4 + uint3(0, 1, 0)].y;
pU = velocity[i4 + uint3(0, 0, -1)].z;
pD = velocity[i4 + uint3(0, 0, 1)].z;
divergence[j] = (pR - pL + pB - pF + pD - pU) / 2;
}
divergenceRW[i] = (float4)divergence;
The code snippet for Jacobi iteration is presented below:
uint3 cL = uint3(max(i.x - 1, 0), i.y, i.z);
uint3 cR = uint3(min(i.x + 1, dim.x - 1), i.y, i.z);
uint3 cD = uint3(i.x, max(i.y - 1, 0), i.z);
uint3 cU = uint3(i.x, min(i.y + 1, dim.y - 1), i.z);
uint3 cF = uint3(i.x, i.y, max(i.z - 1, 0));
uint3 cB = uint3(i.x, i.y, min(i.z + 1, dim.z - 1));
float4 pL = float4(pressure[cL].w, pressure[i + uint3(0, 0, 0)].xyz);
float4 pR = float4(pressure[i + uint3(0, 0, 0)].yzw, pressure[cR].x);
float4 pD = pressure[cD];
float4 pU = pressure[cU];
float4 pF = pressure[cF];
float4 pB = pressure[cB];
pressureRW[i] = (pR + pL + pF + pB + pD + pU - divergence[i]) / 6;
Rendering
We render each particle using speed, calculated using the L2 norm of velocity of each particle. Speed is linear interpolated with two colors, representing fire of low
Projection
The projection step aims at projecting the divergent velocity field to its divergence-free component. This will give us the final updated velocity for each particle. Projection is computationally similar to that of calculating divergence.
uint3 cL = uint3(max(i.x - 1, 0), i.y, i.z);
float4 pL = float4(pressure[cL].w, pressure[i + uint3(0, 0, 0)].xyz);
uint3 cR = uint3(min(i.x + 1, dim.x - 1), i.y, i.z);
float4 pR = float4(pressure[i + uint3(0, 0, 0)].yzw, pressure[cR].x);
uint3 cB = uint3(i.x, max(i.y - 1, 0), i.z);
float4 pB = pressure[cB];
uint3 cF = uint3(i.x, min(i.y + 1, dim.y - 1), i.z);
float4 pF = pressure[cF];
uint3 cD = uint3(i.x, i.y, max(i.z - 1, 0));
float4 pD = pressure[cD];
uint3 cU = uint3(i.x, i.y, min(i.z + 1, dim.z - 1));
float4 pU = pressure[cU];
pR -= pL;
pF -= pB;
pU -= pD;
// this is done for x, y, z, w channel ..
float4 s;
i.x *= 4;
bool borderyz = any(i.yz == 0 | i.yz == dim.yz - 1);
s = velocity[i] - float4(pR.x, pF.x, pU.x, 0) / 2;
velocityRW[i] = (i.x == 0 || borderyz) ? -s : s;
speedRW[i] = length(s.xyz);
i.x++;
...
Optimization
Overview
Our project has two main part to be paralleled. First is the calculation of each particle states in each time step. Second is the rendering of the 2D image to be displayed. During our implementation of the parallel code, we make use of Nvidia GeForce GTX 970, a GPU with 13 multiprocessors and 1024 threads per block at maximum. Thus, to parallel the particle states update, we divide the box space in which the fire light up into smaller blocks of $16 \times 8 \times 8$ and map computation inside each of these smaller blocks to a GPU block. Computation inside each GPU block will be further paralleled by the $16 \times 8 \times 8$ threads we assign to each block. We pick this configuration for thread number at first since it is consistent with the maximum number of threads per block of GTX 970. With a lot of threads per block, it would be easier for warp to hide latency by context switching if one of the threads stalls because of memory access. To make sure we have the best configuration for our current GPU, we experimented with multiple other configuration and our current solution turns out to have the highest frame per second.
After multiple experiments over the number of threads. We eventually choose this configuration, since it gives the highest frame per second for our simulation. Also, this value is consistent with the configuration of the GPU we use.
To parallel the rendering of the two dimensional display image, We divide the display image into smaller tiles of $16 \times 16$ and assign the rendering of each tile to a GPU block, in which each rendering of each pixel in the tile will be assigned to a thread. Thus, we have $16 \times 16$ threads to render all pixels in a block. We get this configuration of tile following the similar process as we described above. We start off from assigning tiles of $32 \times 32$ to each GPU blo
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
Security Score
Audited on Apr 25, 2025
