Krita/Benchmarking
From KOffice
< Krita
Contents |
[edit]
Benchmarking Krita performance
[edit]
Tile engine
[edit]
Data Manager
- Image dimension for test 4096*4096, RGB
- I executed every test few times and I selected the results that popped again more times
- callgrind backend did not produced callgrind.* files so I used valgrind directly, but that does create benchmarking also for Qt test lib
- http://lukast.mediablog.sk/callgrind/DatamanagerBenchmarks.tar.gz
| benchmark name | walltime | tickcounter | Mb/s |
|---|---|---|---|
| benchmarkWriteBytes | 38.0 msec per iteration (total: 380, iterations: 10) | 77,528,468.2 ticks per iteration (total: 775284683, iterations: 10) | 1333.3 Mb/s |
| benchmarkReadBytes | 39.3 msec per iteration (total: 394, iterations: 10) | 77,311,910.2 ticks per iteration (total: 773119103, iterations: 10) | 1628.4 Mb/s |
| benchmarkReadWriteBytes | 46.2 msec per iteration (total: 462, iterations: 10) | 91,198,881.7 ticks per iteration (total: 911988817, iterations: 10) | 1391.3 Mb/s |
| benchmarkExtent | 0.00020 msec per iteration (total: 34, iterations: 163840) | 735.0 ticks per iteration (total: 7350, iterations: 10) | N/A |
| benchmarkClear | 1.3 msec per iteration (total: 26, iterations: 20) | 2,542,070.2 ticks per iteration (total: 25420702, iterations: 10) | N/A |
[edit]
Iterators
[edit]
Horizontal Iterator
- image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
- http://lukast.mediablog.sk/callgrind/HLineBenchmarks.tar.gz
| benchmark name | walltime | tickcounter | Mb/s |
|---|---|---|---|
| benchmarkWriteBytes | 1,383.4 msec per iteration (total: 13834, iterations: 10) | 4,389,801,089.3 ticks per iteration (total: 43898010893, iterations: 10) | 46.3 Mb/s |
| benchmarkReadBytes | 1,443.2 msec per iteration (total: 14433, iterations: 10) | 4,461,418,645.5 ticks per iteration (total: 44614186455, iterations: 10) | 44.4 Mb/s |
| benchmarkConstReadBytes | 1,380.7 msec per iteration (total: 13808, iterations: 10) | 4,501,257,062.3 ticks per iteration (total: 45012570623, iterations: 10) | 46.3 Mb/s |
| benchmarkReadWriteBytes | 2,041.7 msec per iteration (total: 20418, iterations: 10) | 5,736,531,494.3 ticks per iteration (total: 57365314943, iterations: 10) | 31.3 Mb/s |
| benchmarkNoMemCpy | 655.7 msec per iteration (total: 6557, iterations: 10) | 3,025,535,970.6 ticks per iteration (total: 30255359707, iterations: 10) | 97.7 Mb/s |
| benchmarkConstNoMemCpy | 583.7 msec per iteration (total: 5837, iterations: 10) | 2,889,942,765.8 ticks per iteration (total: 28899427658, iterations: 10) | 109.6 Mb/s |
| benchmarkTwoIteratorsNoMemCpy | 1,205.7 msec per iteration (total: 12057, iterations: 10) | 3,952,530,421.5 ticks per iteration (total: 39525304215, iterations: 10) | 53.1 Mb/s |
Update
state:trunk 17.feb 2010 15:38
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkWriteBytes | 1,548.0 msec per iteration (total: 15481, iterations: 10) | 41.34 Mb/s |
| benchmarkReadBytes | 3,087.8 msec per iteration (total: 30878, iterations: 10) | 20.73 Mb/s |
| benchmarkConstReadBytes | 3,062.0 msec per iteration (total: 30620, iterations: 10) | 20.90 Mb/s |
| benchmarkReadWriteBytes | 3,725.0 msec per iteration (total: 37251, iterations: 10) | 17.18 Mb/s |
| benchmarkNoMemCpy | 2,264.4 msec per iteration (total: 22644, iterations: 10) | 28.26 Mb/s |
| benchmarkConstNoMemCpy | 2,316.8 msec per iteration (total: 23168, iterations: 10) | 27.62 Mb/s |
| benchmarkTwoIteratorsNoMemCpy | 2,950.0 msec per iteration (total: 29501, iterations: 10) | 21.69 Mb/s |
state: caching patch applied to trunk
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkWriteBytes | 1,211.4 msec per iteration (total: 12114, iterations: 10) | 52.83 Mb/s (speedup 1.28) |
| benchmarkReadBytes | 1,196.2 msec per iteration (total: 11962, iterations: 10) | 53.50 Mb/s (speedup 2.58) |
| benchmarkConstReadBytes | 1,202.2 msec per iteration (total: 12022, iterations: 10) | 53.24 Mb/s (speedup 1.28) |
| benchmarkReadWriteBytes | 1,563.0 msec per iteration (total: 15631, iterations: 10) | 40.95 Mb/s (speedup 2.38) |
| benchmarkNoMemCpy | 389.1 msec per iteration (total: 3891, iterations: 10) | 164.48 Mb/s (speedup 5.82) |
| benchmarkConstNoMemCpy | 372.5 msec per iteration (total: 3725, iterations: 10) | 171.81 Mb/s (speedup 6.21) |
| benchmarkTwoIteratorsNoMemCpy | 670.3 msec per iteration (total: 6704, iterations: 10) | 95.48 Mb/s (speedup 4.4) |
[edit]
Vertical Iterator
- image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
- http://www.valdyas.org/~lukast/VLineIteratorBenchmarks.tar.gz
| benchmark name | walltime | tickcounter | Mb/s |
|---|---|---|---|
| benchmarkWriteBytes | 1,541.9 msec per iteration (total: 15419, iterations: 10) | Not measured | 41.52 Mb/s |
| benchmarkReadBytes | 1,534.4 msec per iteration (total: 15344, iterations: 10) | Not measured | 41.7 Mb/s |
| benchmarkConstReadBytes | 1,460.5 msec per iteration (total: 14606, iterations: 10) | Not measured | 43.82 Mb/s |
| benchmarkReadWriteBytes | 2,156.3 msec per iteration (total: 21563, iterations: 10) | Not measured | 29.7 Mb/s |
| benchmarkNoMemCpy | 649.0 msec per iteration (total: 6490, iterations: 10) | Not measured | 98.6 Mb/s |
| benchmarkConstNoMemCpy | 599.3 msec per iteration (total: 5994, iterations: 10) | Not measured | 106.7 Mb/s |
| benchmarkTwoIteratorsNoMemCpy | 1,231.5 msec per iteration (total: 12316, iterations: 10) | Not measured | 52 Mb/s |
[edit]
Rectangular Iterator
- image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
- http://valdyas.org/~lukast/RectIteratorBenchmarks.tar.gz
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkWriteBytes | 118.2 msec per iteration (total: 1182, iterations: 10) | 541.4 Mb/s |
| benchmarkReadBytes | 121.7 msec per iteration (total: 1217, iterations: 10) | 525.9 Mb/s |
| benchmarkConstReadBytes | 120.5 msec per iteration (total: 1205, iterations: 10) | 533.3 Mb/s |
| benchmarkReadWriteBytes | 167.0 msec per iteration (total: 1670, iterations: 10) | 383.2 Mb/s |
| benchmarkNoMemCpy | 35.7 msec per iteration (total: 358, iterations: 10) | 1792.7 Mb/s |
| benchmarkConstNoMemCpy | 37.7 msec per iteration (total: 377, iterations: 10) | 1697.6 Mb/s |
| benchmarkTwoIteratorsNoMemCpy | 65.2 msec per iteration (total: 652, iterations: 10) | 981.6 Mb/s |
[edit]
Random Iterator
- image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
- http://lukast.mediablog.sk/callgrind/RandomIterBenchmarks.tar.gz
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkWriteBytes | 1,641.5 msec per iteration (total: 16415, iterations: 10) | 39.0 Mb/s |
| benchmarkReadBytes | 1,598.5 msec per iteration (total: 15985, iterations: 10) | 40.0 Mb/s |
| benchmarkConstReadBytes | 1,654.5 msec per iteration (total: 16545, iterations: 10) | 38.68 Mb/s |
| benchmarkReadWriteBytes | 2,934.8 msec per iteration (total: 29348, iterations: 10) | 21.8 Mb/s |
| benchmarkNoMemCpy | 971.3 msec per iteration (total: 9714, iterations: 10) | 65.9 Mb/s |
| benchmarkConstNoMemCpy | 938.6 msec per iteration (total: 9386, iterations: 10) | 68.2 Mb/s |
| benchmarkTwoIteratorsNoMemCpy | 1,929.7 msec per iteration (total: 19298, iterations: 10) | 33.2 Mb/s |
| benchmarkTileByTileWrite | 1,310.0 msec per iteration (total: 13101, iterations: 10) | 48.9 Mb/s |
| benchmarkTotalRandom | 27,999 msec per iteration (total: 27999, iterations: 1) | 2.2 Mb/s |
| benchmarkTotalRandomConst | 29,124 msec per iteration (total: 29124, iterations: 1) | 2.2 Mb/s |
[edit]
KisPainter
[edit]
Composition (bitBlt)
- image used is 4096x4096, colorspace is RGBA, 8 bit per channel (64Mb)
- two images are composited 20 times in loop with and without selections
- http://lukast.mediablog.sk/callgrind/KisPainterBenchmarks.tar.gz
| benchmark name | walltime | Mb/s | |
|---|---|---|---|
| benchmarkBitBlt | 5,456.8 msec per iteration (total: 54569, iterations: 10) | 234.6 Mb/s | |
| benchmarkBitBltSelection | 5,922.8 msec per iteration (total: 59228, iterations: 10) | 216.1 Mb/s | |
| benchmarkFixedBitBlt | 3,635.5 msec per iteration (total: 36356, iterations: 10) | 352.1 Mb/s | |
| benchmarkFixedBitBltSelection | 5,342.1 msec per iteration (total: 53421, iterations: 10) | 239.6 Mb/s |
[edit]
Filters
[edit]
Brightness/Contrast
- Random image is generated with 3274x2067 dimension, RGBA 8-bit (pippin test image dimension) (25.82 Mb)
- curve is linear (0.0 - 1.0)
- http://lukast.mediablog.sk/callgrind/BContrastBenchmark.tar.gz
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkFilter | 1,783.5 msec per iteration (total: 17835, iterations: 10) | 14.47 Mb/s |
[edit]
Blur
- Random image is generated with 3274x2067 dimension, RGBA 8-bit (pippin test image dimension) (25.82 Mb)
- Default settings is used for blur, which means convolution 5x5
- http://lukast.mediablog.sk/callgrind/blurBenchmark.tar.gz
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkFilter | 31,674 msec per iteration (total: 31674, iterations: 1) | 0.81 Mb/s |
[edit]
Projection
- we load image in Krita native format 1000x753 with 100 dpi with all types of layers (group, effect, adjustment,..)
- projection is computed by refreshGraph()
- we save image in Krita native format again
- http://lukast.mediablog.sk/callgrind/ProjectionBenchmark.tar.gz
Everything is benchmarked in one go.
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkProjection | 834.6 msec per iteration (total: 8346, iterations: 10) | N/A |
[edit]
Painting strokes
- we paint on empty 4096x4096 paint device
- The brush used is 70px pixelbrush, autobrush (the default one)
- the benchmark can run with any paintop, just need to change the preset
- first test paints the stroke you can see in the preview box in different scale. On 4096x4096px image.
- the second test paints 20 random lines (every test the same 20 lines) with varying pressure (from 0.0 to 1.0)
- http://lukast.mediablog.sk/callgrind/strokeBenchmarks.tar.gz [TODO add bouds result]
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkStroke | 2,962 msec per iteration (total: 2962, iterations: 1) | N/A |
| benchmarkRandomLines | 18,576 msec per iteration (total: 18576, iterations: 1) | N/A |
[edit]
First results
[edit]
Computer specification
- CPU: Intel(R) Core(TM)2 Duo CPU P7350 @2.00GHz ( details http://ark.intel.com/Product.aspx?id=36750&code=p7350 )
- RAM:2 GB
- Graphics: NVidia 9200M
- Fedora 12 i686 (32 bit version), KDE4.4 RC2, Qt 4.6.1
[edit]
Compiler options
gcc -Wnon-virtual-dtor -Wno-long-long -ansi -Wundef -Wcast-align -Wchar-subscripts -Wall -W -Wpointer-arith -Wformat-security -fno-exceptions -DQT_NO_EXCEPTIONS -fno-check-new -fno-common -Woverloaded-virtual -fno-threadsafe-statics -fvisibility=hidden -fvisibility-inlines-hidden -O2 -g -fPIC -Wl,--enable-new-dtags
In CMake Configuration we have option called KritaDevs, that's what I used for the benchmarking. This output was found by make VERBOSE=1
[edit]
First optimizations
With performance fix + FastMath::atan2
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkStroke | 650.2 msec per iteration (total: 6503, iterations: 10) | N/A |
| benchmarkRandomLines | 4,158.8 msec per iteration (total: 41589, iterations: 10) | N/A |
Cyrille's tuning commits around lunch
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkStroke | 533.3 msec per iteration (total: 5334, iterations: 10) | N/A |
| benchmarkRandomLines | 3,555.5 msec per iteration (total: 35556, iterations: 10) | N/A |
Just with performance fix
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkStroke | 683.7 msec per iteration (total: 6838, iterations: 10) | N/A |
| benchmarkRandomLines | 4,696.3 msec per iteration (total: 46964, iterations: 10) | N/A |
Compute 1/4 for the symmetrical brushes
| benchmark name | walltime | Mb/s |
|---|---|---|
| benchmarkStroke | 257.3 msec per iteration (total: 2574, iterations: 10) | N/A |
| benchmarkRandomLines | 1,449.2 msec per iteration (total: 14492, iterations: 10) | N/A |

