Profiling and Optimising a Rust application
This story recounts the profiling and optimisation of a Rust application presented here. Two optimisations are implemented: (a) part code rewrite and (b) compiler flag optimisation. The optimisation is based on the native port of the Rust application on Windows. The effects of the optimisations on WebAssembly execution speed are touched on for reference.
The optimisation is evaluated on the basis of time-clock. Whilst it is a noisy metrics, it is straightforward to assess and relevant to my objectives.
The views/opinions expressed in this story are my own. This story relates my personal experience and choices, and is provided for information in the hope that it will be useful but without any warranty.
Setup
The Rust Performance Book includes a section dedicated to profiling, which served as my guide throughout this exercise. I am using a laptop running Windows and hence have used it for the optimisation exercise. Using Windows unfortunately prevents using profiling tools available on Linux. So AMD μProf is the tool employed herein.
The profiling is undertaken using the command line interface running with the vortex ring example provided within the source github repository. The actual command employed is as follows, where […]
denotes release or debug depending on the compilation profile employed:
.\target\[..]\cli.exe --init ..\examples\vortex-ring.json -r
Baseline
The baseline performance is established by running the solver as available in repository commit d7e248 compiled in release configuration. It reports a run-time of circa 3 seconds with the majority of the time taken in the advect_vortons
part of the calculation [some profiling has been baked in the source code]. Note the times listed after the iteration number in the extract from output correspond to the simulation physical time, not the execution time…
>.\target\release\cli.exe --init ..\examples\vortex-ring.json -r
Domain min (X/Y/Z): -10, -10, -10
Domain max (X/Y/Z): 10, 10, 10
Number of vortons: 1304
Iteration 1: 0.03s [advect_vortons: 30ms; step_vorticity: 0ms]
Iteration 2: 0.06s [advect_vortons: 30ms; step_vorticity: 0ms]
Iteration 3: 0.09s [advect_vortons: 29ms; step_vorticity: 0ms]
Iteration 4: 0.12s [advect_vortons: 31ms; step_vorticity: 0ms]
Iteration 5: 0.15s [advect_vortons: 54ms; step_vorticity: 1ms]
[...]
Iteration 95: 2.85s [advect_vortons: 29ms; step_vorticity: 0ms]
Iteration 96: 2.88s [advect_vortons: 28ms; step_vorticity: 1ms]
Iteration 97: 2.91s [advect_vortons: 33ms; step_vorticity: 0ms]
Iteration 98: 2.94s [advect_vortons: 63ms; step_vorticity: 0ms]
Iteration 99: 2.97s [advect_vortons: 38ms; step_vorticity: 1ms]
Iteration 100: 3.00s [advect_vortons: 77ms; step_vorticity: 0ms]
The executable is recompiled in release configuration using the debug
control option to add line table debug information, which is achieved by adding the following to the root Cargo.toml
. This change has a marginal cost on the execution time, but allows for the profiler to identify and report lines of code where time is spent.
[profile.release]
debug = 1
AMD μProf is applied on the recompiled executable. The profiling shows that for the optimised compilation, the majority (73% of the time) of the execution time is spent in one function velocity_contribution
that calculates the velocity contribution of a given vorton. The image below shows a screenshot of the profiling result. The function itself is not complex, but the inlining of functions operating on Point3
and Vector3
such as .cross(&r)
are responsible for its time-consuming nature.
This function has a high cost as it is used to accumulate the contribution of the vorticity induced velocity from all vortons at all vortons position. In other words, the algorithm scales with the square of the number of vortons. It is possible to use different approaches for calculating this term — but this is beyond the scope of this story.
Profiling the Debug configuration
The exercise above is repeated in the debug configuration — with a number of vortons reduced from 1,000 to 500 and the number of iteration reduced from 100 to 10 to allow for the calculation to be undertaken in a simulation time of circa 19 seconds on my laptop.
.\target\debug\cli.exe --init ..\examples\vortex-ring-500.json --iteration 10 -r
Domain min (X/Y/Z): -10, -10, -10
Domain max (X/Y/Z): 10, 10, 10
Number of vortons: 548
Iteration 1: 0.03s [advect_vortons: 1287ms; step_vorticity: 2ms]
Iteration 2: 0.06s [advect_vortons: 1258ms; step_vorticity: 2ms]
Iteration 3: 0.09s [advect_vortons: 1767ms; step_vorticity: 6ms]
Iteration 4: 0.12s [advect_vortons: 2023ms; step_vorticity: 4ms]
Iteration 5: 0.15s [advect_vortons: 2152ms; step_vorticity: 3ms]
Iteration 6: 0.18s [advect_vortons: 1910ms; step_vorticity: 3ms]
Iteration 7: 0.21s [advect_vortons: 1577ms; step_vorticity: 2ms]
Iteration 8: 0.24s [advect_vortons: 2115ms; step_vorticity: 4ms]
Iteration 9: 0.27s [advect_vortons: 2120ms; step_vorticity: 4ms]
Iteration 10: 0.30s [advect_vortons: 2104ms; step_vorticity: 3ms]
The analysis is repeated in AMD μProf with the results shown below. The profiling shows a computational time spread over a number of functions — ie there is not one function that dominates the computational time as in the release configuration. The most expensive function relates to a conversion from an unsigned char
to int64
. The function velocity_contribution
is quite a way down the list after functions related to Point3
and Vector3
functions handled by the nalgebra
library.
Optimisation Attempt #1
The first optimisation attempt is based on the assumption that replacing the calculations handled by the nalgebra
library with a bespoke implementation could result in some speed benefits.
This attempt involves the writing of Rust classes for Point3
and Vector3
that are specific to this solver. The rewrite is done to provide a drop-in replacement, and is developed using a compiler led approach: (a) the classes Point3<T>
and Vector3<T>
are declared and the reference to nalgebra::Point3
and nalgebra::Vector3
replaced in the Rust files declaration; and (b) the missing methods are identified using the compiler and implemented.
This optimisation appears to reduce the computational time to circa 2.5 seconds in release mode. The time taken by the advert_vortons
steps is reported consistently as 23ms to 25ms as shown below, whereas it varies between 24ms and 70ms in the implementation relying on nalgebra.
.
>.\target\release\cli.exe --init ..\examples\vortex-ring.json -r
Domain min (X/Y/Z): -10, -10, -10
Domain max (X/Y/Z): 10, 10, 10
Number of vortons: 1304
Iteration 1: 0.03s [advect_vortons: 23ms; step_vorticity: 0ms]
Iteration 2: 0.06s [advect_vortons: 24ms; step_vorticity: 1ms]
Iteration 3: 0.09s [advect_vortons: 24ms; step_vorticity: 0ms]
Iteration 4: 0.12s [advect_vortons: 23ms; step_vorticity: 1ms]
Iteration 5: 0.15s [advect_vortons: 25ms; step_vorticity: 1ms]
[...]
Iteration 96: 2.88s [advect_vortons: 24ms; step_vorticity: 0ms]
Iteration 97: 2.91s [advect_vortons: 23ms; step_vorticity: 1ms]
Iteration 98: 2.94s [advect_vortons: 23ms; step_vorticity: 1ms]
Iteration 99: 2.97s [advect_vortons: 24ms; step_vorticity: 0ms]
Iteration 100: 3.00s [advect_vortons: 24ms; step_vorticity: 0ms]
In debug configuration, the difference is very surprising with the computational time reduced from 10+ seconds to under 2 seconds for 10 iterations and a targeted number of vorton of 500. This represents a reduction of the computational time of over 80% — just too bad the same could not be achieved in release configuration.
>.\target\debug\cli.exe --init ..\examples\vortex-ring-500.json --iteration 10 -r
Domain min (X/Y/Z): -10, -10, -10
Domain max (X/Y/Z): 10, 10, 10
Number of vortons: 548
Iteration 1: 0.03s [advect_vortons: 146ms; step_vorticity: 0ms]
Iteration 2: 0.06s [advect_vortons: 145ms; step_vorticity: 1ms]
Iteration 3: 0.09s [advect_vortons: 147ms; step_vorticity: 0ms]
Iteration 4: 0.12s [advect_vortons: 145ms; step_vorticity: 0ms]
Iteration 5: 0.15s [advect_vortons: 147ms; step_vorticity: 0ms]
Iteration 6: 0.18s [advect_vortons: 153ms; step_vorticity: 0ms]
Iteration 7: 0.21s [advect_vortons: 151ms; step_vorticity: 0ms]
Iteration 8: 0.24s [advect_vortons: 146ms; step_vorticity: 0ms]
Iteration 9: 0.27s [advect_vortons: 144ms; step_vorticity: 1ms]
Iteration 10: 0.30s [advect_vortons: 147ms; step_vorticity: 1ms]
This difference prompted me to repeat the analysis to confirm the results with the same outcome; an circa 80% reduction in computational time in debug configuration.
Optimisation Attempt #2
This optimisation attempt is looking at the compiler options. The optimisation is based on The Rust Performance Book’s Build Configuration section, specifically:
- Link-time optimisation;
- Codegen units;
- CPU specific instructions; and
- Abort on panic!.
The compilation options are changed in the root Cargo.toml
with each change in configuration accompanied with the removal of the target
folder to ensure that a fresh compilation is undertaken. Simulations are repeated three times and the results reported for the average over all three simulations. Combination of compilation options have also been tested to evaluate any benefits. Profile-guided optimisation has not been attempted.
The graph below shows the computational time per iteration for the baseline code, based on nalgebra
, and the code modification undertaken in optimisation attempt #1. The modified code has been tested for the different compilation optimisation, while the baseline code is only tested with the default release configuration and the “fat” link-time optimisation option, lto = true
.
The lowest computational time is obtained using “fat” link-time optimisation, i.e adding lto = true
to the release profile, with the code source from the optimisation attempt #1.
Notably the comparison shows a computational time reduction of circa 15% against the baseline associated with the code changes made in optimisation attempt #1 when compiled using the default release profile. The improvement in performance associated with the code changes reduces significantly when the “fat” link-time optimisation flag is added to the compilation. This outcomes make the trade-of between the benefit of a faster execution speed and the drawback of a larger codebase to maintain something that needs to be considered carefully.
Parting Word
The assessment showed two interesting behaviour. A code rewrite was able to show a significant speed up — over 80% reduction — in the debug configuration, but that benefit only translated into a minor benefit in release configuration — circa 15% — with the default compiler options.
Further speed up could be achieved using compiler link-time optimisation in both pre-code rewrite and post-code rewrite. With this compiler optimisation, the benefits of the code rewrite further reduced to get to the point where the tradeoff between using a library — as in the initial implementation — and implementing the functionality yourself — as under in the code rewrite — needs to be careful considered.
The benefits of the code rewrite and compiler link-time optimisation have been evaluated on the same code base compiled to WebAssembly. In this environment, the results are somewhat slightly different: (a) the compiler link-time optimisation yields very little speed-up, (b) the code rewrite results in a reduction of computational time of circa 10 to 20% regardless of compiler optimisation. As identified previously, the computational time differs widely between Chrome and Firefox regardless of code rewrite or compiler optimisation.
The code rewrite is kept in the implementation due to its speed-up in WebAssembly. The link-time optimisation option is kept due to its speed-up in native execution speed.
An additional optimisation based on using a different algorithm for evaluating the velocity contribution will be considered in the future. This implementation is part of my to-do list and is anticipated to reduce the computational effort significantly. I will attempt it in a couple of months after some elementary result visualisation is implemented — hopefully in the next story.