Skip to content

What is SIMD NEON?

John Galvin edited this page Oct 28, 2024 · 9 revisions

What is sse2neon.h and why do we need it for OLC PGE 2.0 Mobile development?

  • The sse2neon.h is required as mobile CPU's and GPU's are nowhere near as powerful as their Desktop cousins.
  • Therefore we need to use SIMD (Single Instruction Multiple Data) technology in order to make our games playable on these weak devices.
  • _The sse2neon.h allows use to write one set of SIMD instructions sets that will be supported across multiple CPU types (X86/X64/ARM/ARM64) and multiple instruction set types (SSE2, SSE4.2, NEON).

What is SIMD wikipedia

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

What is NEON wikipedia

The Advanced SIMD extension (also known as Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardised acceleration for media and signal processing applications.


In short but SIMD and NEON are the same thing but use different syntax

Therefore we use a header file that can reinterpret_cast our SIMD syntax to NEON syntax, this header file is sse2neon.h


SIMD:

SIMD stands for Single Instruction Multiple Data; it is a hardware circuitry embedded into the CPU (Central Processing Unit). This complex circuit allows for the same instruction to be executed across multiple APU’s (Arithmetic Processing Unit, also known as MPU Maths Processing Unit).

How this circuit is created and how it works is out of scope for this document. All we need to know is we access and execute SIMD instructions using Intrinsic Functions .

Javidx9 (OneLoneCoder.com creator) has an amazing video that clearly explains how SIMD & Intrinsic Functions works here.

There are multiple versions of SIMD (see Later Versions) , we are only interested in SSE4, as this is the instruction sets implemented in the PGE 2.0 Mobile.

Finally: You do not need to understand how intrinsic functions work, nor understand the code to be able to use the new SIMD instructions sets. it is all build in and hidden away withing the OLC PGE 2.0 Mobile Engine and will automatically apply SIMD against your graphic commands, such as olc::DrawSprite, olc::DrawLine, olc::FillRect etc.

Usage: olc::FillRect(50,50,50,50, olc::BLUE)

Under the hood when the OLC PGE 2.0 Mobile edition engine see's this command it will use an override method to select the SIMD execution for this command. The complier would have already reinterpret_cast our SIMD syntax to NEON syntax for ARM/ARM64 processors.

Example: What happens when olc::DrawSprite(..) with FLIP is executed:

The OLC PGE 2.0 Mobile engine will override olc::DrawSprite(...) and execute olc::DrawSprite_SIMD(...)

/// <summary>
/// Draws a sprite (SIMD)
/// </summary>
/// <param name="x">Position x</param>
/// <param name="y">Position y</param>
/// <param name="sprite">Pointer to sprite</param>
/// <param name="scale">Sprite scale (>=1)</param>
/// <param name="flip">Sprite Orientation NONE = 0, HORIZ = 1, VERT = 2</param>
/// <param name="pDrawTarget">Pointer to draw target</param>
/// <returns>FAIL = 0, OK = 1</returns>
virtual olc::rcode DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE, olc::Sprite* pDrawTarget = nullptr) = 0;

virtual olc::rcode DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale, uint8_t flip, olc::Sprite* pDrawTarget) override
{
	if (sprite == nullptr) return rcode::FAIL;
	olc::vi2d vPos = { x, y };

	// Lets check if the sprite all ready exist?
	olc::vi2d vStartPos = { 0,0 };
	olc::vi2d vScaleSize = { sprite->width * (int)scale, sprite->height * (int)scale };
	olc::Decal* dec = (olc::Decal*)sprite->GetStoredSubDecal(vStartPos, vScaleSize, scale, (olc::Sprite::Flip)flip, pDrawTarget);
	if (dec == nullptr)
	{
		//1: Lets flip it (if Store Sub Sprites has a copy it will be returned)
		olc::Sprite* sprFlipped = sprite->Duplicate((olc::Sprite::Flip)flip);

		//2: Lets scale it (if Store Sub Sprites has a copy it will be returned)
		olc::Sprite* sprScaled = sprFlipped->Duplicate(scale);

		//3: Store the SubSprite, a Decal will also be created
		if (!sprite->StoreSubSprite(sprScaled, vStartPos, scale, (olc::Sprite::Flip)flip, pDrawTarget))
		{
			// OK the vector is full or sub sprites disabled
			// We Cannot Store the sub sprite, lets draw it
			DuplicateMerge_SIMD(vStartPos, pDrawTarget, olc::BLANK, sprScaled);
			delete sprScaled;
			delete sprFlipped;
			return rcode::OK;
		}
		else
		{
			//4: Get the newly created Decal
			dec = (olc::Decal*)sprite->GetStoredSubDecal(vStartPos, vScaleSize, scale, (olc::Sprite::Flip)flip, pDrawTarget);

			//5: Clean up
			delete sprFlipped;

		}

	}

	renderer->ptrPGE->DrawDecal(vPos, dec);
	return rcode::OK;
}

// Lets flip it (if Store Sub Sprites has a copy it will be returned)
// olc::Sprite* sprFlipped = sprPartial->Duplicate((olc::Sprite::Flip)flip);

virtual olc::Sprite* Duplicate_SIMD(olc::Sprite::Flip flip, olc::Sprite* pSource) override
{
	if (pSource == nullptr) return nullptr;

	olc::Sprite* spr = nullptr;

	// Some optimisations, if we are not flipping just return a duplicate
	if ((uint8_t)flip < 1)
	{
		spr = pSource->Duplicate();
		return spr;
	}

	olc::vi2d vStartPos = { 0, 0 };
	olc::vi2d vSize = { pSource->width, pSource->height };

	spr = new olc::Sprite(pSource->width, pSource->height);

	int sx = 0;
	int ex = pSource->width;
	int nOffSet = ex % 4;

	if (nOffSet > 0)
	{
		// we need to work out what is the next muliple of 8 pixels
		// Example: vSize.x = 270
		nOffSet = (ex / 4) + 1; // 270 / 4 = 67. + 1 = 68
		nOffSet = (nOffSet * 4); // 68 * 4 = 272
		nOffSet = nOffSet - ex; // therefore the offset is 2

	}

	int nVecTarget = 0;
	float* pTargetVector = (float*)spr->pColData.data();
	size_t nVecTLen = spr->pColData.size();

	size_t nVecRead = 0; // Start position of read vector
	size_t nVecRLen = pSource->pColData.size();

	__m128i _sx, _ex, _result, _vecRead;

	_sx = _mm_set1_epi32(sx);
	_ex = _mm_set1_epi32(ex);
	_result = _mm_set1_epi32(0xFF); // 0xFF = -1 -> True, 0x00 = 0 -> False;

	if (flip & olc::Sprite::Flip::HORIZ)
	{
		nVecRead = nVecRLen;
		for (int y = 1; y <= pSource->height; y++)
		{
			if (y == 0) y = 1;
			nVecRead = (y * pSource->width) - 4;
			for (int x = 0; x < ex; x += 4, pTargetVector += 4, nVecRead += -4, nVecTarget += 4)
			{

				_vecRead = _mm_load_ps((const float*)((olc::Pixel*)pSource->pColData.data() + nVecRead));
				_mm_storer_ps(pTargetVector, _vecRead);

			}

			pTargetVector -= nOffSet;
			nVecTarget -= nOffSet;

		}


	}

	if (flip & olc::Sprite::Flip::VERT)
	{
		nVecRead = nVecRLen;
		for (int y = pSource->height; y > 0; y--)
		{
			if (nVecTarget + 4 > nVecTLen) break;
			nVecRead = (y * pSource->width) + 0;
			for (int x = 0; x < ex; x += 4, pTargetVector += 4, nVecRead += 4, nVecTarget += 4)
			{
				_vecRead = _mm_load_ps((const float*)((olc::Pixel*)pSource->pColData.data() + nVecRead));
				_mm_storeu_ps(pTargetVector, _vecRead);

			}
			pTargetVector -= nOffSet;
			nVecTarget -= nOffSet;

		}


	}

	// There MAY be a few left over sprites we just clear them
	// This is a little cheat, you could write some very complex code to ensure all left pixels are matched, but this will be very slow.
	// Therefore we just hide (Aphla) them, at most there could be 1 line of pixels at the top/buttom/top, in most cases it will be a few pixels
	// The end user cannot see that this 1 line of pixels is missing, it will be so small
	for (size_t x = nVecTarget; x < (size_t)nVecTLen; x++)
	{
		spr->pColData[x] = olc::BLANK;

	}

	return spr;

}


All the speed none of the headache


DEMO:

You can download a Demo of how all the PGE SIMD Instructions set work here .

You can use this demo to learn more on how SIMD for PGE works. The implementation in the OLC PGE 2.0 Mobile is using the same idea but is implemented differently.

Please note I do not own the copyright to the images displayed in the demo and this documents, these images are used for education purposes only, and cannot be copied or redistributed

image

By pressing S you can enabled/disable the SIMD instruction set, compare the frame rate when SIMD is disabled to when it is enabled.

SIMD Disabled:

image

SIMD Enabled:

image

It is not always the case you will achieve double the frame rate, as in the example above, this will depend on what else is set to execute during the OnUserUpdate execution.  

You can also press E to enable/disable Storing of Sub Sprites (this is explained in Section “A Little More Technical”), again compare the frame rate when SIMD is enabled, and Store Sub Sprites are Enabled / Disabled.

Store Sub Sprites Disabled:

image

Store Sub Sprites Enabled:

image

This demo showcases all the available SIMD options available, such as DrawPartailSprite_SIMD, getting Sprites from a Sprite Sheet, flip and resize options.

NOTE: Frame Rate should not be used as an indication of how fast your game is running, a low frame rate does not mean a slow game. However, if you have a frame rate of < 24 then you do need to look at your architecture. A good frame rate is anything above 30FPS, believe it or not, most consoles are set at 30FPS, only the newer PlayStation 5, X Box Series X now have 60FPS. However, PC gamers have had 60FPS for a very long time. Reference GPU Mag

A high frame rate just means you can fit more into your OnUserUpdate without affecting the user experience. SIMD will not always increase your frame rate; however, it should stabilize it. SIMD is more likely lessen the frame rate decrease impact, when more executions are been ran in the OnUserUpdate.

Example only (not realistic):

  • Two Sprites been drawn with DrawSprite() lessen the frame rate by 10: 40FPS --> 30FPS
  • Two Sprites been drawn with DrawSprite_SIMD() lessen the frame rate by 3: 40FPS --> 37FPS
  • However, in a lot of cases it will increase it, but it all depends again on what is been executed in the OnUserUpdate

The SIMD code is there for you in its lowest form, it is up to your imagination to bring it to life


Clear_SIMD

void PixelGameEngine::Clear_SIMD(Pixel p) executes the very same functionality as void PixelGameEngine::Clear(Pixel p) except it clears the Draw Target Vector using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::Clear(Pixel p) is called.

Usage:

SetDrawTarget(nLayerPlayer);
Clear_SIMD(olc::BLANK);

FillCircle_SIMD

void PixelGameEngine::FillCircle_SIMD(olc::vi2d pos, int32_t radius, Pixel p) executes the very same functionality as void PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p) , except it draws the Circle using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p) is called.

As it is using SIMD instruction sets it can draw the Circle much faster than using the standard Fill Circle. To obtain the maximum speed try to keep your radius in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… etc. By using this method, the maximum number of pixels will be processed via SIMD, and only a very small (if any) left over pixels will need to be process using the standard method.

Overloads:

FillCircle_SIMD(olc::vi2d pos, int32_t radius, Pixel p = olc::WHITE);
FillCircle_SIMD(int32_t x, int32_t y, int32_t radius, Pixel p = olc::WHITE);

Usage:

FillCircle_SIMD({ 50, 110 }, 25, olc::GREEN);

Output:

image


FillRect_SIMD:

void PixelGameEngine::FillRect_SIMD(const olc::vi2d& pos, const olc::vi2d& size, Pixel p) executes the very same functionality as void PixelGameEngine::FillRect(const olc::vi2d& pos, const olc::vi2d& size, Pixel p) except it draws the Rectangle using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::FillCircle(const olc::vi2d& pos, int32_t radius, Pixel p) is called.

As it is using SIMD instruction sets it can draw the Rectangle much faster than using the standard Fill Rectangle. To obtain the maximum speed try to keep your width in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… etc. By using this method, the maximum number of pixels will be processed via SIMD, and only a very small (if any) left over pixels will need to be process using the standard method.

Overloads

FillRect_SIMD(const olc::vi2d& pos, const olc::vi2d& size, Pixel p = olc::WHITE);
FillRect_SIMD(int32_t x, int32_t y, int32_t w, int32_t h, Pixel p = olc::WHITE);

Usage

FillRect_SIMD({ 25, 160 }, { 50, 50 }, olc::RED);

Output:

image


FillTriangle_SIMD:

void PixelGameEngine::FillTriangle_SIMD(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p) executes the very same functionality as void PixelGameEngine::FillTriangle(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p) except it draws the Triangle using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default void PixelGameEngine::FillTriangle(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p) is called.

As it is using SIMD instruction sets it can draw the Triangle much faster than using the standard Fill Triangle.

**Overloads""

FillTriangle_SIMD(const olc::vi2d& pos1, const olc::vi2d& pos2, const olc::vi2d& pos3, Pixel p = olc::WHITE);

FillTriangle_SIMD(int32_t x1, int32_t y1, int32_t x2, int32_t y2, int32_t x3, int32_t y3, Pixel p = olc::WHITE);

Usage

FillTriangle_SIMD({ 50, 235 }, { 25, 285 }, { 75, 285 }, olc::YELLOW);

Output

image


DrawSprite_SIMD:

PixelGameEngine::DrawSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip) executes the very same functionality as PixelGameEngine::DrawSprite(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip) except it draws the Sprite using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::DrawSprite(const olc::vi2d& pos, Sprite* sprite, uint32_t scale, uint8_t flip) is called.

There can be great performance gains by drawing your Sprites using SIMD. As with other SIMD instructions it is important to try to keep the width of the Sprite in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… The height does not need to be in multiples of 4 or 8, it can be any figure. It is the width that really matters.

SIMD will draw the width pixels in some multiple of X, in one execution for example:

  • SSE4 / NEON: Draws 4 Pixels at once (This is primary used in the OLC PGE 2.0 Mobile Edition)
  • AVX2: Draws 8 Pixels at once
  • AVX512: Draws 16 Pixels at once

The position of the Sprite on the Draw Target will automatically recalculate the width & height of the Sprite should the sprite be place or partially place outside the bounds of the Draw Target.

Overloads

DrawSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);
DrawSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);

Usage

DrawSprite_SIMD({ 100, 50 }, sprMrIcey, 1, olc::Sprite::NONE);

Output

image

DrawSprite_SIMD({ 400, 50 }, sprMrIcey, 2, olc::Sprite::NONE);

Output

image

DrawSprite_SIMD(100, 200, sprMrIcey, 1, olc::Sprite::HORIZ);

Output

image

Usage

DrawSprite_SIMD({ 400, 400 }, sprMrIcey, 2, olc::Sprite::HORIZ);

Output

image

DrawSprite_SIMD(100, 350, sprMrIcey, 1, olc::Sprite::VERT);

Output

image


DrawPartialSprite_SIMD:

PixelGameEngine::DrawPartialSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, const olc::vi2d& sourcepos, const olc::vi2d& size, uint32_t scale, uint8_t flip) executes the very same functionality as PixelGameEngine::DrawPartialSprite(const olc::vi2d& pos, Sprite* sprite, const olc::vi2d& sourcepos, const olc::vi2d& size, uint32_t scale, uint8_t flip) except it draws the Sprite using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::DrawPartialSprite(const olc::vi2d& pos, Sprite* sprite, const olc::vi2d& sourcepos, const olc::vi2d& size, uint32_t scale, uint8_t flip) is called.

There can be great performance gains by drawing your Sprites using SIMD. As with other SIMD instructions it is important to try to keep the width of the Sprite in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36… The height does not need to be in multiples of 4 or 8, it can be any figure. It is the width that really matters.

SIMD will draw the width pixels in some multiple of X, in one execution for example:

  • SSE4 / NEON: Draws 4 Pixels at once (This is primary used in the OLC PGE 2.0 Mobile Edition)
  • AVX2: Draws 8 Pixels at once
  • AVX512: Draws 16 Pixels at once

The position of the Sprite on the Draw Target will automatically recalculate the width & height of the Sprite should the sprite be place or partially place outside the bounds of the Draw Target.

Using DrawPartialSprite_SIMD with Sprite Sheets and Store Sub Sprites (see section Add link “A Little More Technical”) enabled can greatly improve the speed of your game.

Overloads

DrawPartialSprite_SIMD(const olc::vi2d& pos, Sprite* sprite, const olc::vi2d& sourcepos, const olc::vi2d& size, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);

DrawPartialSprite_SIMD(int32_t x, int32_t y, Sprite* sprite, int32_t ox, int32_t oy, int32_t w, int32_t h, uint32_t scale = 1, uint8_t flip = olc::Sprite::NONE);

Usage

DrawPartialSprite_SIMD({ 300, 60 }, sprMrIcey, { 50, 50 }, { 80,80 }, 1, olc::Sprite::NONE);

Output

image

DrawPartialSprite_SIMD({ 300, 200 }, sprMrIcey, { 50, 50 }, { 80,80 }, 2, olc::Sprite::HORIZ);

Output

image

Sprite Sheet Usage:

image

DrawPartialSprite_SIMD({ 700, 100 }, sprSpriteSheet, { 0, 0 }, { 150,150 }, 1, olc::Sprite::NONE);

Output

image

DrawPartialSprite_SIMD({ 850, 100 }, sprSpriteSheet, { 150, 0 }, { 100,150 }, 1, olc::Sprite::HORIZ);

Output

image

DrawPartialSprite_SIMD({ 950, 120 }, sprSpriteSheet, { 250, 0 }, { 100,150 }, 1, olc::Sprite::VERT);

Output

image


DrawMergeSprite_SIMD:

The DrawMergeSprite() and DrawMergSpite_SIMD() are two new methods introduced in the OLC PGE 2.0 Mobile Engine.

As before both methods do the exact same thing, except one use SIMD instructions sets.

The Draw Merge Sprite simply merges one sprite into another.

Example:

image

When merging sprites, it is important to know the layout. In normal method executions “what is on the left becomes what is on the right”

Example:

int b = 5;
int a = b;

Therefore, the above will result in a becomes b with a value of 5:

  • Side Note:
    • “=” means becomes (it does not mean equals)
    • “==” means equals
    • For fun… don’t you just love JavaScript!, “===” means Absolute equals, everything and the kitchen sink must be the same. 😒

The same is true for the DrawMergeSprite(). In the above example we are merging image into image

Example:

obj a = image ;

obj b = image ;

obj c = a + b;

c = image ;

Therefore, c becomes (image + image ) with a value of image

Let’s look at the command.

void PixelGameEngine::DrawMergeSprite_SIMD(const olc::vi2d& vPos, Sprite* pFromSprite, const olc::vi2d& vToSpritePos, Sprite* pToSprite, Pixel blendPixel, uint32_t scale, olc::Sprite::Flip flip) executes the very same functionality as void PixelGameEngine::DrawMergeSprite(const olc::vi2d& vPos, Sprite* pFromSprite, const olc::vi2d& vToSpritePos, Sprite* pToSprite, Pixel blendPixel, uint32_t scale, olc::Sprite::Flip flip) except it merges the Sprites using a supported SIMD Instruction set (SSE, AVX, AVX512).

Should no SIMD instruction set be supported by the CPU the default PixelGameEngine::DrawMergeSprite(const olc::vi2d& vPos, Sprite* pFromSprite, const olc::vi2d& vToSpritePos, Sprite* pToSprite, Pixel blendPixel, uint32_t scale, olc::Sprite::Flip flip) is called.

We merge from one sprite into another. It is important to have the order correct, and as always test your code.

  • The vPos is the location on the screen where you want to draw the merge Sprite
  • The pFromSprite is image and our pToSprite is image .
  • The vToSpritePos is the position within image (pToSprite) where you want to draw image (pFromSprite).
  • The blendPixel is the blend Pixel. This is the Pixel Colour that is not to be drawn from the image (pFromSprite).
    • The engine will not draw this pixel but instead take the according pixel from image (pToSprite) and draw it.
    • Having this blend pixel right is the key to a successful merge.
    • The good news is it is defaulted to Transparent, so merging PNGs is a breeze.
  • The scale is what scale you what the Output Sprite to be image image .
  • The flip is if you want to flip the Output Sprite image image

There can be great performance gains by merging your Sprites using SIMD. As with other SIMD instructions it is important to try to keep the width of the Sub Sprite in multiples of 4 or 8. i.e. 4, 8, 12, 16, 24, 32, 36…

The height does not need to be in multiples of 4 or 8, it can be any figure. It is the width that really matters.

  • SSE4/NEON: Draws 4 Pixels at once (Primary use in the OLC PGE 2.0 Mobile Engine)
  • AVX2: Draws 8 Pixels at once
  • AVX512: Draws 16 Pixels at once

When merging sprites, the engine (currently) will not auto scale your From Sprite to fit correctly within the To Sprite.

Remember The PGE is not Paint Shop Pro

image

The position of the pFromSprite on the pToSprite will automatically recalculate the width & height and crop the pFromSprite should the sprite be place or partially place outside the bounds of the pToSprite.

Note: The image (pFromSprite) and the image (pToSprite) are not affected by the merge, they are not overwritten.

Overloads

DrawMergeSprite_SIMD(const olc::vi2d& vPos, Sprite* pFromSprite, const olc::vi2d& vToSpritePos, Sprite* pToSprite, Pixel blendPixel, uint32_t scale, olc::Sprite::Flip flip)

DrawMergeSprite_SIMD(int32_t vPosx, int32_t vPosy, Sprite* pFromSprite, int32_t vToSpritePosx, int32_t vToSpritePosy,Sprite* pToSprite, Pixel blendPixel, uint32_t scale, olc::Sprite::Flip flip)

Usage

DrawMergeSprite({ 100, 500 }, sprFace1, { 10, 40 }, sprBody3D, olc::BLANK, 1, olc::Sprite::NONE);
DrawMergeSprite_SIMD({ 300, 500 }, sprFace1, { 10, 40 }, sprBody3D, olc::BLANK, 1, olc::Sprite::NONE);

Output

image