-
Notifications
You must be signed in to change notification settings - Fork 4
What is SIMD NEON?
What is sse2neon.h
and why do we need it for OLC PGE 2.0 Mobile development?
- The
sse2neon.h
is required as mobile CPU's and GPU's are nowhere near as powerful as their Desktop cousins. - Therefore we need to use SIMD (Single Instruction Multiple Data) technology in order to make our games playable on these weak devices.
- _The
sse2neon.h
allows use to write one set of SIMD instructions sets that will be supported across multiple CPU types (X86/X64/ARM/ARM64) and multiple instruction set types (SSE2, SSE4.2, NEON).
What is SIMD wikipedia
Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.
What is NEON wikipedia
The Advanced SIMD extension (also known as Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications.
In short but SIMD and NEON are the same thing but use different syntax
Therefore we use a header file that can reinterpret_cast
our SIMD syntax to NEON syntax, this header file is sse2neon.h
SIMD stands for Single Instruction Multiple Data; it is a hardware circuitry embedded into the CPU (Central Processing Unit). This complex circuit allows for the same instruction to be executed across multiple APU’s (Arithmetic Processing Unit, also known as MPU Maths Processing Unit).
How this circuit is created and how it works is out of scope for this document. All we need to know is we access and execute SIMD instructions using Intrinsic Functions .
Javidx9 (OneLoneCoder.com creator) has an amazing video that clearly explains how SIMD & Intrinsic Functions works here.
There are multiple versions of SIMD (see Later Versions) , we are only interested in SSE4, as this is the instruction sets implemented in the PGE 2.0 Mobile.
Finally: You do not need to understand how intrinsic functions work, nor understand the code to be able to use the new SIMD instructions sets. it is all build in and hidden away withing the OLC PGE 2.0 Mobile Engine and will automatically apply SIMD against your graphic commands, such as olc::DrawSprite
, olc::DrawLine
, olc::FillRect
etc.
Usage:
olc::FillRect(50,50,50,50, olc::BLUE)
Under the hood when the OLC PGE 2.0 Mobile edition engine see's this command it will use an override method to select the SIMD execution for this command. The complier would have already reinterpret_cast
our SIMD syntax to NEON syntax for ARM/ARM64 processors.
The OLC PGE 2.0 Mobile engine will override olc::DrawSprite(...)
and execute olc::DrawSprite_SIMD(...)
/// <summary>
/// Draws a sprite (SIMD)
/// </summary>
/// <param name="x">Position x</param>
/// <param name="y">Position y</param>
/// <param name="sprite">Pointer to sprite</param>
/// <param name="scale">Sprite scale (>=1)</param>
/// <param name="flip">Sprite Orientation NONE = 0, HORIZ = 1, VERT = 2</param>
/// <param name="pDrawTarget">Pointer to draw target</param>
/// <returns>FAIL = 0, OK = 1</returns>
virtual olc::rcode DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE, olc::Sprite* pDrawTarget = nullptr) = 0;
virtual olc::rcode DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale, uint8_t flip, olc::Sprite* pDrawTarget) override
{
if (sprite == nullptr) return rcode::FAIL;
olc::vi2d vPos = { x, y };
// Lets check if the sprite all ready exist?
olc::vi2d vStartPos = { 0,0 };
olc::vi2d vScaleSize = { sprite->width * (int)scale, sprite->height * (int)scale };
olc::Decal* dec = (olc::Decal*)sprite->GetStoredSubDecal(vStartPos, vScaleSize, scale, (olc::Sprite::Flip)flip, pDrawTarget);
if (dec == nullptr)
{
//1: Lets flip it (if Store Sub Sprites has a copy it will be returned)
olc::Sprite* sprFlipped = sprite->Duplicate((olc::Sprite::Flip)flip);
//2: Lets scale it (if Store Sub Sprites has a copy it will be returned)
olc::Sprite* sprScaled = sprFlipped->Duplicate(scale);
//3: Store the SubSprite, a Decal will also be created
if (!sprite->StoreSubSprite(sprScaled, vStartPos, scale, (olc::Sprite::Flip)flip, pDrawTarget))
{
// OK the vector is full or sub sprites disabled
// We Cannot Store the sub sprite, lets draw it
DuplicateMerge_SIMD(vStartPos, pDrawTarget, olc::BLANK, sprScaled);
delete sprScaled;
delete sprFlipped;
return rcode::OK;
}
else
{
//4: Get the newly created Decal
dec = (olc::Decal*)sprite->GetStoredSubDecal(vStartPos, vScaleSize, scale, (olc::Sprite::Flip)flip, pDrawTarget);
//5: Clean up
delete sprFlipped;
}
}
renderer->ptrPGE->DrawDecal(vPos, dec);
return rcode::OK;
}
// Lets flip it (if Store Sub Sprites has a copy it will be returned)
// olc::Sprite* sprFlipped = sprPartial->Duplicate((olc::Sprite::Flip)flip);
virtual olc::Sprite* Duplicate_SIMD(olc::Sprite::Flip flip, olc::Sprite* pSource) override
{
if (pSource == nullptr) return nullptr;
olc::Sprite* spr = nullptr;
// Some optimisations, if we are not flipping just return a duplicate
if ((uint8_t)flip < 1)
{
spr = pSource->Duplicate();
return spr;
}
olc::vi2d vStartPos = { 0, 0 };
olc::vi2d vSize = { pSource->width, pSource->height };
spr = new olc::Sprite(pSource->width, pSource->height);
int sx = 0;
int ex = pSource->width;
int nOffSet = ex % 4;
if (nOffSet > 0)
{
// we need to work out what is the next muliple of 8 pixels
// Example: vSize.x = 270
nOffSet = (ex / 4) + 1; // 270 / 4 = 67. + 1 = 68
nOffSet = (nOffSet * 4); // 68 * 4 = 272
nOffSet = nOffSet - ex; // therefore the offset is 2
}
int nVecTarget = 0;
float* pTargetVector = (float*)spr->pColData.data();
size_t nVecTLen = spr->pColData.size();
size_t nVecRead = 0; // Start position of read vector
size_t nVecRLen = pSource->pColData.size();
__m128i _sx, _ex, _result, _vecRead;
_sx = _mm_set1_epi32(sx);
_ex = _mm_set1_epi32(ex);
_result = _mm_set1_epi32(0xFF); // 0xFF = -1 -> True, 0x00 = 0 -> False;
if (flip & olc::Sprite::Flip::HORIZ)
{
nVecRead = nVecRLen;
for (int y = 1; y <= pSource->height; y++)
{
if (y == 0) y = 1;
nVecRead = (y * pSource->width) - 4;
for (int x = 0; x < ex; x += 4, pTargetVector += 4, nVecRead += -4, nVecTarget += 4)
{
_vecRead = _mm_load_ps((const float*)((olc::Pixel*)pSource->pColData.data() + nVecRead));
_mm_storer_ps(pTargetVector, _vecRead);
}
pTargetVector -= nOffSet;
nVecTarget -= nOffSet;
}
}
if (flip & olc::Sprite::Flip::VERT)
{
nVecRead = nVecRLen;
for (int y = pSource->height; y > 0; y--)
{
if (nVecTarget + 4 > nVecTLen) break;
nVecRead = (y * pSource->width) + 0;
for (int x = 0; x < ex; x += 4, pTargetVector += 4, nVecRead += 4, nVecTarget += 4)
{
_vecRead = _mm_load_ps((const float*)((olc::Pixel*)pSource->pColData.data() + nVecRead));
_mm_storeu_ps(pTargetVector, _vecRead);
}
pTargetVector -= nOffSet;
nVecTarget -= nOffSet;
}
}
// There MAY be a few left over sprites we just clear them
// This is a little cheat, you could write some very complex code to ensure all left pixels are matched, but this will be very slow.
// Therefore we just hide (Aphla) them, at most there could be 1 line of pixels at the top/buttom/top, in most cases it will be a few pixels
// The end user cannot see that this 1 line of pixels is missing, it will be so small
for (size_t x = nVecTarget; x < (size_t)nVecTLen; x++)
{
spr->pColData[x] = olc::BLANK;
}
return spr;
}
All the speed none of the headache
You can download a Demo of how all the PGE SIMD Instructions set work here .
You can use this demo to learn more on how SIMD for PGE works. The implementation in the OLC PGE 2.0 Mobile is using the same idea but is implemented differently.
Please note I do not own the copyright to the images displayed in the demo and this documents, these images are used for education purposes only, and cannot be copied or redistributed
By pressing S you can enabled/disable the SIMD instruction set, compare the frame rate when SIMD is disabled to when it is enabled.
It is not always the case you will achieve double the frame rate, as in the example above, this will depend on what else is set to execute during the OnUserUpdate execution.
You can also press E to enable/disable Storing of Sub Sprites (this is explained in Section “A Little More Technical”), again compare the frame rate when SIMD is enabled, and Store Sub Sprites are Enabled / Disabled.
Store Sub Sprites Disabled:
Store Sub Sprites Enabled:
This demo showcases all the available SIMD options available, such as DrawPartailSprite_SIMD, getting Sprites from a Sprite Sheet, flip and resize options.
NOTE: Frame Rate should not be used as an indication of how fast your game is running, a low frame rate does not mean a slow game. However, if you have a frame rate of < 24 then you do need to look at your architecture. A good frame rate is anything above 30FPS, believe it or not, most consoles are set at 30FPS, only the newer PlayStation 5, X Box Series X now have 60FPS. However, PC gamers have had 60FPS for a very long time. Reference GPU Mag
A high frame rate just means you can fit more into your OnUserUpdate without affecting the user experience. SIMD will not always increase your frame rate; however, it should stabilize it. SIMD is more likely lessen the frame rate decrease impact, when more executions are been ran in the OnUserUpdate.
Example only (not realistic):
- Two Sprites been drawn with DrawSprite() lessen the frame rate by 10: 40FPS --> 30FPS
- Two Sprites been drawn with DrawSprite_SIMD() lessen the frame rate by 3: 40FPS --> 37FPS
- However, in a lot of cases it will increase it, but it all depends again on what is been executed in the OnUserUpdate
The SIMD code is there for you in its lowest form, it is up to your imagination to bring it to life
void PixelGameEngine::Clear_SIMD(Pixel p)
executes the very same functionality as void PixelGameEngine::Clear(Pixel p)
except it clears the Draw Target Vector using a supported SIMD Instruction set (SSE, AVX, AVX512).
Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::Clear(Pixel p)
is called.
Usage:
SetDrawTarget(nLayerPlayer);
Clear_SIMD(olc::BLANK);
void PixelGameEngine::FillCircle_SIMD(olc::vi2d pos, int32_t radius, Pixel p)
executes the very same functionality as
void PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p)
, except it draws the Circle using a supported SIMD Instruction set (SSE, AVX, AVX512).
Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p)
is called.
As it is using SIMD instruction sets it can draw the Circle much faster than using the standard Fill Circle. To obtain the maximum speed try to keep your radius in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… etc. By using this method, the maximum number of pixels will be processed via SIMD, and only a very small (if any) left over pixels will need to be process using the standard method.
Overloads:
FillCircle_SIMD(olc::vi2d pos, int32_t radius, Pixel p = olc::WHITE);
FillCircle_SIMD(int32_t x, int32_t y, int32_t radius, Pixel p = olc::WHITE);
Usage:
FillCircle_SIMD({ 50, 110 }, 25, olc::GREEN);
Output:
void PixelGameEngine::FillRect_SIMD(const olc::vi2d& pos, const olc::vi2d& size, Pixel p)
executes the very same functionality as
void PixelGameEngine::FillRect(const olc::vi2d& pos, const olc::vi2d& size, Pixel p)
except it draws the Rectangle using a supported SIMD Instruction set (SSE, AVX, AVX512).
Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p)
is called.
As it is using SIMD instruction sets it can draw the Rectangle much faster than using the standard Fill Rectangle. To obtain the maximum speed try to keep your width in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… etc. By using this method, the maximum number of pixels will be processed via SIMD, and only a very small (if any) left over pixels will need to be process using the standard method.
Overloads
FillRect_SIMD(const olc::vi2d& pos, const olc::vi2d& size, Pixel p = olc::WHITE);
FillRect_SIMD(int32_t x, int32_t y, int32_t w, int32_t h, Pixel p = olc::WHITE);
Usage
FillRect_SIMD({ 25, 160 }, { 50, 50 }, olc::RED);
Output:
void PixelGameEngine::FillTriangle_SIMD(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p)
executes the very same functionality as void PixelGameEngine::FillTriangle(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p)
except it draws the Triangle using a supported SIMD Instruction set (SSE, AVX, AVX512).
Should no SIMD instruction set be supported by the CPU the default void PixelGameEngine::FillTriangle(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p)
is called.
As it is using SIMD instruction sets it can draw the Triangle much faster than using the standard Fill Triangle.
**Overloads""
FillTriangle_SIMD(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p = olc::WHITE);
FillTriangle_SIMD(int32_t x1, int32_t y1, int32_t x2, int32_t y2, int32_t x3, int32_t y3, Pixel p = olc::WHITE);
Usage
FillTriangle_SIMD({ 50, 235 }, { 25, 285 }, { 75, 285 }, olc::YELLOW);
Output
PixelGameEngine::DrawSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip)
executes the very same functionality as
PixelGameEngine::DrawSprite(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip)
except it draws the Sprite using a supported SIMD Instruction set (SSE, AVX, AVX512).
Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::DrawSprite(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip)
is called.
There can be great performance gains by drawing your Sprites using SIMD. As with other SIMD instructions it is important to try to keep the width of the Sprite in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… The height does not need to be in multiples of 4 or 8, it can be any figure. It is the width that really matters.
SIMD will draw the width pixels in some multiple of X, in one execution for example:
- SSE4 / NEON: Draws 4 Pixels at once (This is primary used in the OLC PGE 2.0 Mobile Edition)
- AVX2: Draws 8 Pixels at once
- AVX512: Draws 16 Pixels at once
The position of the Sprite on the Draw Target will automatically recalculate the width & height of the Sprite should the sprite be place or partially place outside the bounds of the Draw Target.
Overloads
DrawSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);
DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);
Usage
DrawSprite_SIMD({ 100, 50 }, sprMrIcey, 1, olc::Sprite::NONE);
Output
DrawSprite_SIMD({ 400, 50 }, sprMrIcey, 2, olc::Sprite::NONE);
Output
DrawSprite_SIMD(100, 200, sprMrIcey, 1, olc::Sprite::HORIZ);
Output
Usage
DrawSprite_SIMD({ 400, 400 }, sprMrIcey, 2, olc::Sprite::HORIZ);
Output
DrawSprite_SIMD(100, 350, sprMrIcey, 1, olc::Sprite::VERT);
Output
- For Visual Studio All In One Android and iOS (Windows) Project Template: OLC Pixel Game Engine Mobile 2.2.8 Visual Studio for Android and iOS
- For Visual Studio Android Only (Windows) Use this project: OLC Pixel Game Engine Mobile 2.2.8 for Android Visual Studio
- For Android Studio (Windows/Linux/MAC) Use this project: OLC Pixel Game Engine Mobile 2.2.8 for Android Studio
- For Xcode (MAC) Use this project: OLC Pixel Game Engine Mobile 2.2.8 for Xcode