Why I Love DirectX Shader Reflection. Also, Why I Hate DirectX Shader Reflection.

Based on the title, it should be obvious that this isn’t the updated talk on my multithreaded renderer I mentioned last post. That whole write up is still in the works, but today I thought I’d jump into D3D Shader Reflection, and how it both great and worth implementing, and completely ruined by it’s limitations.

So, my shader manager has undergone very little change since I first implemented it last year for Evac.  It’s a very simple class, just designed to hold all of my compiled shaders, keep track of the last shader used, and allow materials to switch shaders on demand.  Very bare bones, but very functional.  The biggest problem with it has been that I have to manually do too many things when I want to add a new shader; I have to write the HLSL, I have to add an entry to my shaders enum, I have to write a new C++ class for the shader, and then I have to update the shader manager since each shader having it’s own class means I have to explicitly create and destruct the new types.  Ideally, I should only need to write the HLSL and update the enum, and everything else could happen automatically.

When I was rebuilding my framework this year, this was an area I investigated for improvement.  I was able to refactor my shader class code so that 95% of it was generic and capable of being a single class that could service all my compiled shaders.  The problem was the input layout generation.  Since that was something that had to be defined per vertex fragment (or at least, unique vertex input struct), I had to have a way to generate this properly for each shader, and it lead to still having a C++ class per shader even if 95% of it was copy/paste and the only unique code between them was the input layout declaration.  An improvement to be sure, but not nearly enough of one.

Now, I’ve been aware of, and interested by, the shader reflection system provided by D3D for a while, but I’ve always considered the time commitment to research, implement, and fix to not be worth it when I already had a working, albeit slightly tedious, shader system.  This week finally tipped the scales because I found myself avoiding trying to implement something via a new shader because I didn’t want to go through the hassle of the whole process if I didn’t have to.  So, I took the plunge.

Before I get into the source code, there are two things worth sharing.  The first is that I am using the shader reflection system solely to generate my input layout from my HLSL in an effort to create a single, generic shader class; the entirety of the system is very powerful and can do a lot more than the small subsection that I’m discussing here.  The second is that I took the basis of my implementation from this post by Bobby Anguelov, and it probably is worth a read if this is interesting to you at all.  With that said, here’s the function I wrote that generates my input layouts:


void CompiledShader::CreateVertexInputLayout(ID3D11Device* pDevice, ShaderBytecode* pBytecode, const char* pFileName)
{
 ID3D11ShaderReflection* lVertexShaderReflection = nullptr;
 if (FAILED(D3DReflect(pBytecode->bytecode, pBytecode->size, IID_ID3D11ShaderReflection, (void**) &lVertexShaderReflection)))
 {
  return;
 }

 D3D11_SHADER_DESC lShaderDesc;
 lVertexShaderReflection->GetDesc(&lShaderDesc);

 std::ifstream lStream;
 lStream.open(pFileName, std::ios_base::binary);
 bool lStreamIsGood = lStream.is_open();
 unsigned lLastInputSlot = 900;

 std::vector<D3D11_INPUT_ELEMENT_DESC> lInputLayoutDesc;
 for (unsigned lI = 0; lI < lShaderDesc.InputParameters; lI++)
 {
  D3D11_SIGNATURE_PARAMETER_DESC lParamDesc;
  lVertexShaderReflection->GetInputParameterDesc(lI, &lParamDesc);

  D3D11_INPUT_ELEMENT_DESC lElementDesc;
  lElementDesc.SemanticName = lParamDesc.SemanticName;
  lElementDesc.SemanticIndex = lParamDesc.SemanticIndex;

  if (lStreamIsGood)
  {
   lStream >> lElementDesc.InputSlot;
   lStream >> reinterpret_cast<unsigned&>(lElementDesc.InputSlotClass);
   lStream >> lElementDesc.InstanceDataStepRate;
  }
  else
  {
   lElementDesc.InputSlot = 0;
   lElementDesc.InputSlotClass = D3D11_INPUT_PER_VERTEX_DATA;
   lElementDesc.InstanceDataStepRate = 0;
  }

  lElementDesc.AlignedByteOffset = lElementDesc.InputSlot == lLastInputSlot ? D3D11_APPEND_ALIGNED_ELEMENT : 0;
  lLastInputSlot = lElementDesc.InputSlot;

  if (lParamDesc.Mask == 1)
  {
   if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_UINT32) lElementDesc.Format = DXGI_FORMAT_R32_UINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_SINT32) lElementDesc.Format = DXGI_FORMAT_R32_SINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_FLOAT32) lElementDesc.Format = DXGI_FORMAT_R32_FLOAT;
  }
  else if (lParamDesc.Mask <= 3)
  {
   if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_UINT32) lElementDesc.Format = DXGI_FORMAT_R32G32_UINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_SINT32) lElementDesc.Format = DXGI_FORMAT_R32G32_SINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_FLOAT32) lElementDesc.Format = DXGI_FORMAT_R32G32_FLOAT;
  }
  else if (lParamDesc.Mask <= 7)
  {
   if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_UINT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32_UINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_SINT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32_SINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_FLOAT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32_FLOAT;
  }
  else if (lParamDesc.Mask <= 15)
  {
   if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_UINT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32A32_UINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_SINT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32A32_SINT;
   else if (lParamDesc.ComponentType == D3D_REGISTER_COMPONENT_FLOAT32) lElementDesc.Format = DXGI_FORMAT_R32G32B32A32_FLOAT;
  }

  lInputLayoutDesc.push_back(lElementDesc);
 }

 lStream.close();

 pDevice->CreateInputLayout(&lInputLayoutDesc[0], lShaderDesc.InputParameters, pBytecode->bytecode, pBytecode->size, &m_layout);

 lVertexShaderReflection->Release();
}

Now, some places where I differ from Bobby’s implementation.

The easiest is the AlignedByteOffset.  He keeps track of how many bytes each parameter of the input struct takes up to calculate this as he goes.  However, the first element of a given input slot is always zero, and any following element in the same input slot can be given D3D11_APPEND_ALIGNED_ELEMENT and get the correct result.  A small difference, and his version works, but this is just simpler code and less prone to ever potentially being a headache.  Also, I suppose that I’m making some assumptions here that you’re not doing weird packing on your input structs that could otherwise break my code, but it’d also break Bobby’s so I don’t feel too bad about it.  You’ll also notice my lLastInputSlot variable, and how it starts at 900 and probably think that’s weird.  The API documentation says that valid values for input slots are 0 – 15, and I needed a value that ensures that the first element in slot 0 properly gets an offset of 0, so this was a way to do that.  Any value > 15 would work.  I picked 900 for no good reason.

Now we start to get into the territory that infuriates me about D3D shader reflection.  And the worst part is that I understand why this limitation exists, and I understand it’s reasonable for this to be the way it is, but I’m mad that I can’t write code that fully automates this process and allows me to be as lazy as I want to be.  I am referring to the unholy trio of InputSlot, InputSlotClass, and InstanceDataStepRate.  These are related fields, and if you don’t ever do anything with combining multiple input streams you can safely default these to 0, D3D11_INPUT_PER_VERTEX_DATA, and 0 and live a happy, carefree life.  However, if you’re doing any batching through input stream combining, this becomes a very different, and annoying story.

See, the reflection system is able to glean every other necessary piece of data from your HLSL because it directly exists in your HLSL; that’s how reflection works.  However, there is nothing there to denote what input slot a parameter belongs to, if it’s per_vertex or per_instance data, or what the step rate is if it’s per_instance data.  There aren’t even any optional syntax keywords to give the reflection system hints at what you want, which would be acceptable for making this work.  Instead, you get nothing!  So, my solution was to create a small metadata file for each vertex shader file that just denotes input slot, input slot class, and instance data step rate per input struct parameter.  If you do the same thing, it’s worth noting that D3D11_INPUT_PER_VERTEX_DATA is 0 and D3D11_INPUT_PER_INSTANCE_DATA is 1.

So, I no longer have to create a C++ class per shader, and my shader manager automatically handles new shaders based on additions to my shaders enum, but I do have to create this metadata file per vertex fragment.  It’s sub-optimal to be sure, but definitely still a huge win over the old system.  And I was able to do the whole overhaul in a day.  So, if you’re in anywhere near the same boat I was, I’d definitely recommend looking into the shader reflection system.

However, there is one last caveat.  It doesn’t really matter to me as far as shipping my game this year, but I could see it being a pain at some point, and probably to other people.  By using the D3DReflect function, you cannot pass validation for the Windows App Store.  Microsoft has an insane plan wherein you cannot compile or reflect your shaders at runtime at all.  I understand the logic here, but I also can’t help but think this undermines what I think is great about the reflection system.  I was able to put in relatively minimal effort and reap a huge benefit.  If I wanted to bring my game “up to code” to pass app store validation, I would need to put in a lot more work to reflect my input layout into a file during compile time and then load that during runtime.  It’s not undoable by any means, but it really forces you to dive into the deep end or not at all as far as the tools they’re providing here.  I guess I’m just not a fan.

Anyway, that’s that.  Hopefully you found something in all of this useful.  And at some point soon, I really will write about my finished multithreaded renderer.  But there may be a post or two before that happens.  We’ll see.

The Shift To GGX And A Glimpse Of Things To Come

I’m just going to jump into things and not really explain why there’s been a 4 month gap since the last post.  It should be pretty sufficient to just say… school.  However, I now have a lot of stuff to post about, so I’m hoping that means a lot of posts in quick succession.  And I think a lot of them are going to be a lot more directly technical in nature, complete with code samples.  We’ll see how that actually shakes out.

So, as mentioned in my last post, one of the things I wanted to do this year is move towards a more physically based system for rendering.  Now, I might have pipe dreams of implementing global illumination and a full microfacet model and really doing this thing right, but let’s be realistic.  I only get so much time around the rest of my classes and staying sane, and it can’t all be spent on my lighting pipeline.  I have to deliver features and tools so my team can make the game we’re making, and when it comes down to it our game could ship just fine on last year’s lighting model if it had to.  Luckily, short of implementing CryEngine there were a number of much smaller things I could accomplish that made significant impact to both the visual fidelity of the system as well as the ability for artists to more properly represent a wider range of materials.  The first of these was to change the distribution function for my specular calculation.

Now a little back story.  Over the summer, Blizzard sent all of their graphics programmers to SIGGRAPH, and they were gracious enough to have that include me despite just being an intern.  While there, I attended the Physically Based Shading course, and it presented me with a lot of information that I hadn’t necessarily previously considered, but which made a lot of sense as soon as I saw it; it was basically my experience at GDC last year, all over again.  All of the talks brought a lot of useful information, but Brian Karis’ talk about UE4’s rendering system ended up being the true catalyst of that morning.  I did a lot more research on my own and ended up replacing my Blinn distribution function with GGX, and the results were pretty astonishing despite not adopting any other part of the Cook-Torrance microfacet model.  This change also rendered the specular texture completely unnecessary; I ended up repurposing those channels to store material properties like roughness, and it opened up a lot more possibilities as far as having varied materials across a single model.

Here is a before and after comparison of the distribution functions:

Blinn

GGX

Now, as a programmer, I am perfectly capable of designing and implementing this system.  However, I am no artist, and Photoshop is definitely not my friend.  So, the results here are based on very terrible texture manipulation by me, and you might be scratching your head over how much of an improvement this really was.  But I still think it proves my point well enough.  In the top image, the entire car looks plasticy, which is pretty common for Blinn.  In the bottom, the highlight tails vary, but overall feel much warmer.  The yellowness of the near pointlight also has more effect on the objects in the scene in the GGX version.  And due to storing material properties in textures, the GGX version is also able to present the canopy as shiny while the body is much more dull.

So, while this is certainly still not a fully physically based system, I was able to make significant improvements relatively quickly, and maintain near equal computational complexity.  I call that a win.  Next time, I plan to delve into my finished multithreaded renderer, and talk about how incredibly wrong some elements of the code I previously posted in regards to it were.  So, it’ll be fun.  Especially for me.  Look forward to it!

Errors With Continuous Output. Also, Terrain.

Last time I said I would be talking about my work on terrain and some errors I ran into over the course of developing my framework for this year.  So, surprisingly, that’s what’s about to happen.

I’ll talk about terrain first because it’s visually more appealing (and I actually remembered to take screenshots of it!) and also likely infinitely less boring to people that aren’t me.  Right now I’m taking a fairly traditional approach of utilizing a heightmap to generate a 2D array of vertices, and decoding the color values of each pixel into height values.  I’m still early into my implementation, so things are very basic but functional.  Here’s a screen capture:

So, again, I’m still very early into my implementation.  That image represents approximately 3 days worth of work.  I still have a lot of work to do in spatial partitioning, LOD, and blending different detail maps to get something that looks and performs as well as I need it to.  I also found an Nvidia sample that utilizes the tessellation hardware to great effect that I want to try to incorporate.  However, right now I’m able to generate 10 square kilometers worth of game world terrain and maintain over 60fps so I’m not unhappy with how things are going right now.  To get a sense of the scale, here are some images:

TerrainWork2

TerrainWorkNow, unfortunately these screenshots have different terrains in them and it kind of ruins the effect of what I was trying to illustrate.  But, the movement of the terrain is minuscule as the car moves out of view into the far distance.  I would have retaken them to fix this inconsistency, except I’ve made dramatic changes to the lighting model that I want to save for my next post.  Sorry!

And that brings me to the other issue I wanted to talk about; errors in the framework.  Some of them are caused by subtle problems but are still obvious.  For example, setting the depth function to D3D11_COMPARISON_EQUAL can be a great way to re-use depth information and eliminate pixel overdraw on subsequent material passes.  However, if one pass multiplies the viewprojection matrix against the world matrix on the cpu, and another does this on the gpu, you will not get the result you want because of differing precision.  It seems obvious as I type it, but it took me about 30 minutes to realize what was going on.  But at least it was incredibly obvious that something WAS wrong because the output was broken.

Things get tougher when you get a continuous output from your error and you don’t realize anything is wrong, and this can happen a lot in graphics.  A good example is depth data usage in my lighting and postprocessing.  The image in my last post was actually completely incorrect, but I didn’t realize it when it was initially taken.  Tweaking some data lead me to realize that moving the camera far enough away caused illumination to zero out, which makes no sense.  But for a large range of distances it seemed correct, so it took me a while to even realize there was a problem.  A good way to simulate this (but wasn’t my actual problem) is to bind the active depth buffer as the shader resource view my lighting is sampling, ensuring I get an all black texture.  That’s obviously wrong, but the output you get can seem right within the right data set.  This problem also manifested itself in my depth of field effect.  When I first implemented it, I happened upon values that worked against my incorrect depth data.  It wasn’t until I started tweaking focal distance and width that I realized something was very wrong.

Those problems might be specific to me, but I think the real takeaway here is that when you implement something new in graphics, you really need to test a wide range of data to ensure your effect is really doing what you want, and not doing it by circumstance.  If you don’t, a fundamental error can persist in your code base for a long, long time without anyone realizing, and it makes it much harder to add new features when basic code additions don’t do what you should reasonably expect them to.

So, that’s it for this time.  I wish I’d taken some screenshots of output of the errors I talked about, but I think the information is still pretty worthwhile without pretty pictures.  Next time (hopefully soon), I’ll talk about the start of my move to a physically based rendering model, beginning with replacing the specular reflection model.

Back to Square One?

So, in my last post, I said that my big goal was to thread my renderer and to do a better job of writing it against a public interface so that it could be a stand-alone library.  Good goals, but now that they’re basically finished I have a lot of new goals.  However, since that post was made a week+ after the work it described, and it’s now been a month since that post, I think my primary goal is to get better at writing these more regularly.  It might mean that the technology changes in each update is less dramatic, but I think it’ll let me spend more time to properly illustrate what I’m working on at the time and lead to overall better posts.  We’ll see how that turns out.

So, previously I had reworked my rendering pipeline to be multi-threaded, and managed to draw a triangle.  Which is great and terrible at the same time.  But, since the rewrite was still 85% built on my framework from last year, once the core worked in the new thread-safe system it was much less work to re-implement everything else I had last year.  I was quickly able to bring in refactored versions of my previous material and lighting processes and get to a point where I could render a lit scene that was as complex as I cared to hard code (since the editor wasn’t quite ready to go yet).  As you can tell from the image, I did’t care to hard code very much.

So, that’s what you can see; let’s talk about what you can’t see.  Last year, I built my render target class to own a back buffer and a depth buffer, and created a system to copy a depth buffer from one render target into another.  I was told it was strange to not share a single depth buffer across my render targets, but there were cases where I wanted the main geometry buffer for a process and I wanted to change it for current use without changing it for geometry.  There were also cases where I down and up scaled render targets, which necessitated different depth buffers with different sizes; it’s how I achieved my geometry occluded outlines last year.  It worked for me, but it didn’t come without it’s difficulties.  And as I was re-implementing render targets this year, I realized that I overlooked a lot of potential performance gains last year.

I was re-using depth information for geometry occluded outlines, and I was using it to generate lighting information, but I wasn’t actually using it to cut down on overdraw.  And overdraw was a huge problem in Evac (especially the final level, which was huge).  So, I re-factored my render target system to instead keep a pool of back and depth buffers, and a render target now makes associations to the buffers in the pool that it wants.  This allows easy sharing of buffers between targets (and without DirectX complaining about binding a buffer for read and write simultaneously!) in a way that let me do a Z Pre-pass process that is intended to build the shared depth buffer but remain lightweight by only ever engaging the vertex shader.  Once that depth buffer is built, all processes that utilize it can set their depthstencil state to D3D11_COMPARISON_EQUAL to ensure that each screen pixel is engaged by a pixel shader only once.  And since the pixel shader is often where you can find pipeline bottlenecks, optimizing it’s usage can be key to keeping a high performing renderer.  So, ensuring that each process pass at most engages a pixel shader once per pixel is a huge boon and I was very pleased with the results.

The other thing you can tell from the image is that I’ve started work on my post-processing system.  It’s not going to be terribly different from last year’s base system, but I’m looking to implement a new set of effects to give the game a filmic look.  It starts with the depth of field effect you can see here, but there’s a lot more in the works to eventually make the game look good.  I hope.  Also, with the re-organization and re-optimization of the whole system, I’m hoping to have the frame time available to do a LOT more post-processing this year.

So, that’s all for this time.  There were actually a lot of errors going on behind the scenes around this time that looked correct enough to not notice them right away, but which created huge problems when I tried to tweak data.  I’ll probably talk about them next time as I think that they’re pretty interesting, and might be useful to anyone implementing a similar system.  Also, I’ll be talking about my initial work on terrain, so look forward to that.  Hopefully it won’t take a month.

New Year, Old Problems

It’s a new school year, and my senior year, so of course I decided that the most reasonable course of action was to put together a new team to build a new game in a new custom engine.  And I’d heavily rebuild my rendering system, too!  I’m smart like that.  And with new work come new posts, so aren’t you all just so lucky?  Also, this post should have been made last Thursday as I’ve made pretty major strides since the information I’m about to talk about, but… school.

At the end of last year I was fairly happy with what I had accomplished, but there were also a lot of things I wanted to add and/or fix that I just never found the time for.  Story of everyone’s life.  My initial work has targeted two major refactor issues that are unsurprisingly intertwined: making the renderer a standalone library and multithreading.

Two years ago, I saw Reggie Meisler give a talk to Game Engine Architecture Club about a basic render-thread system in DX9 (sorry, I don’t have a link to the slides offhand) and it got me started thinking.  Then last year at GDC, I saw a great talk by Bryan Dudash about utilizing deferred contexts in DX11 and the gears really started turning.  I quickly realized that I could combine the knowledge from those two talks into my existing framework and get a lot of benefit, but I also knew that the messy interface (or lack thereof really) into my renderer prevented me from realistically ensuring thread-safety.  And that’s why that never happened last year.

The general concept can be illustrated in this amazing diagram I drew in MS Paint.

There’s two layers of parallelism in my current system, with the ability to add more later as time and performance dictate.  The first layer is to put the actual rendering on a separate thread that constantly loops, consuming the last command list buffer sent to it.  That was the basis of Reggie’s talk and has been a common technique for a while.  The second layer is to utilize deferred contexts to build the command lists for each pass process in parallel, and is the basic implementation discussed in Bryan Dudash’s GDC presentation.  Of course, currently I’m only utilizing those two layers to draw a triangle into a render target and then composite that render target into the final presentation buffer, which is super impressive and all, but it provides a successful proof of concept that I can move forward from.

It turns out that DX11 does a pretty great job of keeping itself thread-safe as long as you don’t try to utilize a single context over two threads, so the major task in getting this working was in keeping my own data thread-safe.  This necessitated three things: a simple public interface into the renderer, a transfer system for the command list buffers, and a transfer system for entities and assets.  The public interface was simple once I refactored and reorganized my classes, and now allows the renderer to run as a standalone library, fulfilling one of my major initial desires with this refactor.

The command list buffer is a read/write/pivot transfer system where the gameloop side copies to the write buffer, the renderloop side copies from the read buffer, and any copy causes a swap between the target buffer and the pivot.  This does introduce a lock into the system, but I used a Benaphore to keep it as lightweight as possible. Here’s the code:

/*
Project: Graphics
Purpose: Declaration of the command list buffer container
Coder/s: Matt Sutherlin (matt.sutherlin@digipen.edu)
Copyright: “All content © 2013 DigiPen (USA) Corporation, all rights reserved.”
*/

#pragma once

#include "../Definitions/ProcessEnums.hpp"
#include "../Definitions/DirectXIncludes.hpp"
#include "../Renderer/Benaphore.hpp"

struct CommandListLayer
{
  CommandListLayer() {
    m_bIsDirty = false;

    for (unsigned lI = 0; lI < Passes::NumberOf; ++lI)
    {
      m_commandLists[lI] = nullptr;
    }
  }
  ~CommandListLayer() {
    for (unsigned lI = 0; lI < Passes::NumberOf; ++lI)
    {
      SAFE_RELEASE(m_commandLists[lI]);
    }
  }

  bool                m_bIsDirty;
  ID3D11CommandList*  m_commandLists[Passes::NumberOf];
};

class CommandListBuffer
{
public:
  CommandListBuffer() {
    for (unsigned lI = 0; lI < CommandListLayers::NumberOf; ++lI)
    {
      m_layers[lI] = new CommandListLayer();
    }
  }
  ~CommandListBuffer() {
    for (unsigned lI = 0; lI < CommandListLayers::NumberOf; ++lI)
    {
      delete m_layers[lI];
    }
  }
  ID3D11CommandList** GetLayerCommandLists(CommandListLayers::Layer pLayer) {
    if (pLayer == CommandListLayers::Read)
    {
      m_lock.Lock();

      if (m_layers[CommandListLayers::Pivot]->m_bIsDirty)
      {
        m_layers[CommandListLayers::Read]->m_bIsDirty = false;

        std::swap(m_layers[CommandListLayers::Read], m_layers[CommandListLayers::Pivot]);
      }

      m_lock.Unlock();
    }

    return m_layers[pLayer]->m_commandLists;
  }
  void SwapWriteLayer() {
    m_lock.Lock();

    m_layers[CommandListLayers::Write]->m_bIsDirty = true;

    std::swap(m_layers[CommandListLayers::Write], m_layers[CommandListLayers::Pivot]);

    m_lock.Unlock();
  }
private:
  CommandListLayer*   m_layers[CommandListLayers::NumberOf];
  Benaphore           m_lock;
};

The entity/asset system was by far the most complicated to implement, but it came down to a single container class that enqueued data changes (add, update, or remove) until a synchronization was called.  Here is the code:

/*
Project: Graphics
Purpose: Declaration of the transfer buffer container
Coder/s: Matt Sutherlin (matt.sutherlin@digipen.edu)
Copyright: “All content © 2013 DigiPen (USA) Corporation, all rights reserved.”
*/

#pragma once

#include <unordered_map>
#include <queue>
#include <type_traits>

//Q01:  How does this work?
//A01:  The game engine could potentially make requests for adding, removing, or
//      deleting resources at any time.  While the ID3D11Device is thread-free, we 
//      still need to be careful with resource management.
//
//      Adds can create their new resources immediately (and need to do so to return 
//      a resource ID to the caller), but should not be added to traversal lists 
//      while process threads are running.  So we defer that until the next game loop.
//
//      Updates should only occur at the start of a new game loop, so they're deferred 
//      until the game thread calls for a synch.  We need to ensure the data we're 
//      traversing doesn't get changed out from under us, so we synch this before 
//      producer threads start for the frame.
//
//      Deletes have two levels of synchronization to deal with.  Removing the resource 
//      from traversal lists needs to happen at the next game loop and removing it 
//      from memory needs to happen at the next render loop after that.
//
//Q02:  What are the limitations?
//A02:  t_entry objects need to have an UpdateData function, and that function needs to 
//      take a t_data object.
//
//      t_id objects need to be able to be initialized by setting = 0, and need to 
//      properly increment when post-incremented.
//
//      t_entry and t_data objects MUST be pointer types.

template <typename t_entry, typename t_data, typename t_id>
class TransferBuffer
{
  typedef std::pair<t_id, t_entry>                              t_entryPair;
  typedef std::pair<t_id, t_data>                               t_dataPair;
  typedef typename std::unordered_map<t_id, t_entry>::iterator  t_iterator;
private:
  std::unordered_map<t_id, t_entry>   m_entries;
  std::queue<t_entryPair>             m_pendingAdditions;
  std::queue<t_id>                    m_markedDeletions;
  std::queue<t_entry>                 m_pendingDeletions;
  std::queue<t_dataPair>              m_pendingUpdates;
  t_id                                m_nextID;
public:
  TransferBuffer() {
    m_nextID = 0;
  }

  ~TransferBuffer() {

  }

  //This should only ever be called by the game engine!
  t_id AddEntry(t_data pData) {
    t_id lReturnID = m_nextID++;

    t_entry lEntry = new std::remove_pointer<t_entry>::type(pData);
    m_pendingAdditions.push(t_entryPair(lReturnID, lEntry));

    return lReturnID;
  }

  //This should only ever be called by the game engine!
  void RemoveEntry(t_id pID) {
    m_markedDeletions.push(pID);
  }

  //This should only ever be called by the game engine!
  void UpdateEntry(t_id pID, t_data pData) {
    t_data lData = new std::remove_pointer<t_data>::type();
    memcpy(lData, pData, sizeof(std::remove_pointer<t_data>::type));

    m_pendingUpdates.push(t_dataPair(pID, lData));
  }

  //This should only ever be called by parallel producer threads!
  t_iterator GetEntries() {
    return m_entries.begin();
  }

  t_iterator GetEnd() {
    return m_entries.end();
  }

  //This should only ever be called by the synchronous game thread!
  //Should get called once per game loop before threading deferred contexts
  void SynchAdd() {
    unsigned lSizeAdditions = m_pendingAdditions.size();

    for (unsigned lI = 0; lI < lSizeAdditions; ++lI)
    {
      t_entryPair lEntryPair = m_pendingAdditions.front();
      m_pendingAdditions.pop();
      m_entries.emplace(lEntryPair.first, lEntryPair.second);
    }
  }

  //This should only ever be called by the synchronous game thread!
  //Should get called directly after SynchAdd
  void SynchUpdate() {
    unsigned lSizeUpdates = m_pendingUpdates.size();

    for (unsigned lI = 0; lI < lSizeUpdates; ++lI)
    {
      t_dataPair lDataPair = m_pendingUpdates.front();
      m_pendingUpdates.pop();

      auto lIter = m_entries.find(lDataPair.first);

      if (lIter != m_entries.end())
      {
        lIter->second->UpdateData(lDataPair.second);
        delete lDataPair.second;
      }
    }
  }

  //This should only ever be called by synchronous game thread!
  //Should get called directly after SynchUpdate
  void SynchMarkedDelete() {
    unsigned lSizeDeletions = m_markedDeletions.size();

    for (unsigned lI = 0; lI < lSizeDeletions; ++lI)
    {
      t_id lID = m_markedDeletions.front();
      m_markedDeletions.pop();

      auto lIter = m_entries.find(lID);

      if (lIter != m_entries.end())
      {
        t_entry lEntry = lIter->second;
        m_pendingDeletions.push(lEntry);
        m_entries.erase(lIter);
      }
    }
  }

  //This should only ever be called by the parallel consumer thread!
  //Should only get called at the start of a render loop
  void SynchPendingDelete() {
    unsigned lSizeDeletions = m_pendingDeletions.size();

    for (unsigned lI = 0; lI < lSizeDeletions; ++lI)
    {
      t_entry lEntry = m_pendingDeletions.front();
      m_pendingDeletions.pop();
      delete lEntry;
    }
  }
};

And that’s how I solved my big thread-safety issue. The FAQ at the top of the file is worth reading, and while I’m preplanning for t_id to be some kind of GUID functor, I’m just using unsigned int for that type in all cases right now. I welcome any questions, comments, or criticisms of my methodology, but try not to be too hard on me as this is the first time I’ve actually posted code I’ve written.

But that’s all for now. Like I said, I’ve actually managed to take the system much further in the last week, but I’m just swamped right now.  Hopefully I’ll have time to make the next post before that information is also out of date, but no promises.