diff --git a/README.md b/README.md
index d63a6a1..e2afd21 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,38 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Liam Dugan
+  * [LinkedIn](https://www.linkedin.com/in/liam-dugan-95a961135/), [personal website](http://liamdugan.com/)
+* Tested on: Windows 10, Ryzen 5 1600 @ 3.20GHz 16GB, GTX 1070 16GB (Personal Computer)
 
-### (TODO: Your README)
+# Boid Flocking
+See INSTRUCTION.md for an explanation of the flocking algorithm
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+![](images/BoidVidLong.gif)
+
+# Questions
+### For each implementation, how does changing the number of boids affect performance?
+
+![](images/numPart.png) 
+* NOTE: Test carried out with block size 128
+
+For a small number of Boids, the Naive approach dominates. This is likely due to the Naive approach having significantly less computational overhead. However when the number of Boids becomes large, the Naive approach is completely eclipsed by the Uniform Grid and Coherent Grid implementations. This is because, while they may be slower than Naive on the worst case, these two implementations only check the particles in the neighboring grid cubes, which is highly likely to be much less than the total number of particles.
+
+### For each implementation, how does changing the block count and block size affect performance?
+
+![](images/numThread.png)
+* NOTE: Test carried out with 10,000 Particles (Boids)
+
+For a small number of threads and a large number of blocks, the performance stays generally the same. A block size of 32, 64, and 128, all perform similarly across the algorithms. However, when we start to get to very large block size we start to see a noticeable drop in performance. 
+
+This is likely due to the way we structured our program. In our algorithm if our thread index is larger than the number of particles (N), we return. This essentially leaves any thread with index > N sitting doing nothing. Therefore in order to optimize this algorithm we need to aim to have almost exactly N threads on each task. Since a block size of 512 would only allow us to add threads in increments of 512, we would run the risk of going over N by a significant amount and wasting resources in the process.
+
+### Did you experience any performance improvements with the more coherent uniform grid? Was this the outcome you expected? Why or why not?
+
+Yes I experienced a noticeable performance improvement with the coherent grid. While is it counterintuitive for this to be the faster option (extra movement of data just to save one pointer dereference), it makes sense when you think about the larger context of the GPU and its bottlenecks. 
+
+The largest bottleneck for the GPU is memory access and so we want our algorithm to do as few pointer dereferences as possible. When we shuffle the data to make it coherent, we are doing 2 * N pointer dereferences (per buffer) in order to save 1 dereference (per buffer) every time a neighbor point is accessed in the future flocking algorithm. Thus in order for the coherent approach to be worthwhile we need to average at least 2 neighbors per point, which we achieve as the number of particles in the simulation becomes large.
+
+### Did changing cell width and checking 27 vs 8 neighboring cells affect performance? Why or why not?
+
+While changing the cell width to be 1/2 the search radius would allow you to avoid checking an unnecessarily large search area for neighboring points, the extra overhead it would create during the other steps of the algorithm (multiplying length of grid cell start/end index buffers by 8) would nullify any benefit it could have in all but the most extreme worst case scenarios. 
diff --git a/images/BoidVidLong.gif b/images/BoidVidLong.gif
new file mode 100644
index 0000000..5f473b8
Binary files /dev/null and b/images/BoidVidLong.gif differ
diff --git a/images/Boids.PNG b/images/Boids.PNG
new file mode 100644
index 0000000..76ff999
Binary files /dev/null and b/images/Boids.PNG differ
diff --git a/images/numPart.png b/images/numPart.png
new file mode 100644
index 0000000..833d408
Binary files /dev/null and b/images/numPart.png differ
diff --git a/images/numThread.png b/images/numThread.png
new file mode 100644
index 0000000..7e46717
Binary files /dev/null and b/images/numThread.png differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index fdd636d..b737097 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -10,5 +10,5 @@ set(SOURCE_FILES
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..7f1602e 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -6,7 +6,6 @@
 #include "utilityCore.hpp"
 #include "kernel.h"
 
-// LOOK-2.1 potentially useful for doing grid-based neighbor search
 #ifndef imax
 #define imax( a, b ) ( ((a) > (b)) ? (a) : (b) )
 #endif
@@ -39,8 +38,7 @@ void checkCUDAError(const char *msg, int line = -1) {
 /*! Block size used for CUDA kernel launch. */
 #define blockSize 128
 
-// LOOK-1.2 Parameters for the boids algorithm.
-// These worked well in our reference implementation.
+// Parameters for the boids algorithm.
 #define rule1Distance 5.0f
 #define rule2Distance 3.0f
 #define rule3Distance 5.0f
@@ -61,33 +59,28 @@ void checkCUDAError(const char *msg, int line = -1) {
 int numObjects;
 dim3 threadsPerBlock(blockSize);
 
-// LOOK-1.2 - These buffers are here to hold all your boid information.
-// These get allocated for you in Boids::initSimulation.
-// Consider why you would need two velocity buffers in a simulation where each
-// boid cares about its neighbors' velocities.
-// These are called ping-pong buffers.
+// Buffers for the boid information.
 glm::vec3 *dev_pos;
 glm::vec3 *dev_vel1;
 glm::vec3 *dev_vel2;
 
-// LOOK-2.1 - these are NOT allocated for you. You'll have to set up the thrust
-// pointers on your own too.
+// Ping-pong buffer for shuffling when coherent
+glm::vec3 *dev_shuffle_pos;
+glm::vec3 *dev_shuffle_vel1;
+glm::vec3 *dev_shuffle_vel2;
 
 // For efficient sorting and the uniform grid. These should always be parallel.
 int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
-// needed for use with thrust
+
+// used in key-value pair sort
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
 thrust::device_ptr<int> dev_thrust_particleGridIndices;
 
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
-
-// LOOK-2.1 - Grid parameters based on simulation parameters.
-// These are automatically computed for you in Boids::initSimulation
+// These are automatically computed in Boids::initSimulation
 int gridCellCount;
 int gridSideCount;
 float gridCellWidth;
@@ -109,7 +102,6 @@ __host__ __device__ unsigned int hash(unsigned int a) {
 }
 
 /**
-* LOOK-1.2 - this is a typical helper function for a CUDA kernel.
 * Function for generating a random vec3.
 */
 __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
@@ -120,7 +112,6 @@ __host__ __device__ glm::vec3 generateRandomVec3(float time, int index) {
 }
 
 /**
-* LOOK-1.2 - This is a basic CUDA kernel.
 * CUDA kernel for generating boids with a specified mass randomly around the star.
 */
 __global__ void kernGenerateRandomPosArray(int time, int N, glm::vec3 * arr, float scale) {
@@ -140,8 +131,7 @@ void Boids::initSimulation(int N) {
   numObjects = N;
   dim3 fullBlocksPerGrid((N + blockSize - 1) / blockSize);
 
-  // LOOK-1.2 - This is basic CUDA memory management and error checking.
-  // Don't forget to cudaFree in  Boids::endSimulation.
+  // Malloc space for all buffers
   cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
 
@@ -151,12 +141,21 @@ void Boids::initSimulation(int N) {
   cudaMalloc((void**)&dev_vel2, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel2 failed!");
 
-  // LOOK-1.2 - This is a typical CUDA kernel invocation.
+  cudaMalloc((void**)&dev_shuffle_pos, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shuffle_pos failed!");
+
+  cudaMalloc((void**)&dev_shuffle_vel1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shuffle_vel1 failed!");
+
+  cudaMalloc((void**)&dev_shuffle_vel2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_shuffle_vel2 failed!");
+
+  // Generate random starting position for the boids
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
     dev_pos, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
-  // LOOK-2.1 computing grid params
+  // computing grid params
   gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
@@ -168,7 +167,22 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  // Additional buffers for the uniform grid and coherent grid methods
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+  dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
   cudaDeviceSynchronize();
 }
 
@@ -224,31 +238,102 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 ******************/
 
 /**
-* LOOK-1.2 You can use this as a helper for kernUpdateVelocityBruteForce.
-* __device__ code can be called from a __global__ context
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
 __device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+
+  // Initialize the vectors that will be edited by the three rules
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_vel(0.0f, 0.0f, 0.0f);
+
+  // Initialize other state relating to the rules
+  glm::vec3 rule1_perceived_center(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_c_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_perceived_vel(0.0f, 0.0f, 0.0f);
+  
+  // Loop through all of the boids
+  int rule1Count = 0;
+  int rule3Count = 0;
+  for (int i = 0; i < N; ++i)
+  {
+    if (i != iSelf)
+    {
+      glm::vec3 difference = pos[i] - pos[iSelf];
+      float len = glm::length(difference);
+
+      // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+      if (len < rule1Distance)
+      {
+        rule1Count++;
+        rule1_perceived_center += pos[i];
+      }
+      
+      // Rule 2: boids try to stay a distance d away from each other
+      if (len < rule2Distance)
+      {
+        rule2_c_vel -= (pos[i] - pos[iSelf]);
+      }
+
+      // Rule 3: boids try to match the speed of surrounding boids
+      if (len < rule3Distance)
+      {
+        rule3Count++;
+        if (glm::length(vel[i]) > 0)
+        {
+          rule3_perceived_vel += vel[i];
+        }
+      }
+    }
+  }
+
+  // Scale the perceived center by the number of Boids
+  if (rule1Count > 0)
+  {
+    rule1_perceived_center /= rule1Count;
+    rule1_vel = (rule1_perceived_center - pos[iSelf]) * rule1Scale;
+  }
+
+  // Scale the percieved velocity of the flock by the number of Boids
+  if (rule3Count > 0)
+  {
+    rule3_perceived_vel /= rule3Count;
+    rule3_vel = rule3_perceived_vel * rule3Scale;
+  }
+
+  // Calculate the velocities from the rules
+  rule2_vel = rule2_c_vel * rule2Scale;
+
+  return vel[iSelf] + rule1_vel + rule2_vel + rule3_vel;
 }
 
 /**
-* TODO-1.2 implement basic flocking
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
-  // Compute a new velocity based on pos and vel1
+  
+  // Get my index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // Compute my velocity
+  glm::vec3 newVelocity = computeVelocityChange(N, index, pos, vel1);
+
   // Clamp the speed
-  // Record the new velocity into vel2. Question: why NOT vel1?
+  if (glm::length(newVelocity) > maxSpeed)
+  {
+    newVelocity = (newVelocity / glm::length(newVelocity)) * maxSpeed;
+  }
+
+  // Record the new velocity into vel2
+  vel2[index] = newVelocity;
 }
 
 /**
-* LOOK-1.2 Since this is pretty trivial, we implemented it for you.
 * For each of the `N` bodies, update its position based on its current velocity.
 */
 __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
@@ -272,12 +357,6 @@ __global__ void kernUpdatePos(int N, float dt, glm::vec3 *pos, glm::vec3 *vel) {
   pos[index] = thisPos;
 }
 
-// LOOK-2.1 Consider this method of computing a 1D index from a 3D grid index.
-// LOOK-2.3 Looking at this method, what would be the most memory efficient
-//          order for iterating over neighboring grid cells?
-//          for(x)
-//            for(y)
-//             for(z)? Or some other order?
 __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
   return x + y * gridResolution + z * gridResolution * gridResolution;
 }
@@ -285,14 +364,33 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+
+  // get my index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+  
+  // get my position relative to the minimum corner of the grid
+  glm::vec3 myPos = pos[index];
+  glm::vec3 gridLocation = myPos - gridMin;
+  float myPosX = gridLocation.x;
+  float myPosY = gridLocation.y;
+  float myPosZ = gridLocation.z;
+  
+  // get my grid cell in 1D index
+  int gridCellX = static_cast<int>(myPosX * inverseCellWidth);
+  int gridCellY = static_cast<int>(myPosY * inverseCellWidth);
+  int gridCellZ = static_cast<int>(myPosZ * inverseCellWidth);
+  int gridIndex = gridIndex3Dto1D(gridCellX, gridCellY, gridCellZ, gridResolution);
+
+  // Label me with the index of my grid cell.
+  gridIndices[index] = gridIndex;
+
+  // Set up a parallel array of integer indices as pointers to the actual boid data in pos and vel1/vel2
+  indices[index] = index;
 }
 
-// LOOK-2.1 Consider how this could be useful for indicating that a cell
-//          does not enclose any boids
 __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
   int index = (blockIdx.x * blockDim.x) + threadIdx.x;
   if (index < N) {
@@ -302,26 +400,200 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+
+  // get my index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // now just check that index, and the indices to the left and right
+  int myGridIndex = particleGridIndices[index];
+
+  // if the index to the left is different, or if there isn't an index to the left
+  // then mark this as the start of grid #myGridIndex
+  if (index == 0 || (index > 0 && myGridIndex != particleGridIndices[index - 1]))
+  {
+    gridCellStartIndices[myGridIndex] = index;
+  }
+
+  // if the index to the right is different, or if there isn't an index to the right 
+  // then mark this as the end of grid #myGridIndex
+  if (index == (N-1) || (index < (N - 1) && myGridIndex != particleGridIndices[index + 1]))
+  {
+    gridCellEndIndices[myGridIndex] = index;
+  }
 }
 
+/*
+ * Update a boid's velocity using the uniform grid to reduce the number of boids that need to be checked.
+ */
 __global__ void kernUpdateVelNeighborSearchScattered(
   int N, int gridResolution, glm::vec3 gridMin,
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+ 
+  // first, as per usual, get my index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // Initialize the vectors that will be edited by the three rules
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_vel(0.0f, 0.0f, 0.0f);
+
+  // Initialize other state relating to the rules
+  glm::vec3 rule1_perceived_center(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_c_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_perceived_vel(0.0f, 0.0f, 0.0f);
+  
+  /*** Now we'll get the grid cell and the octant within that cell that this particle is in ***/
+
+  // get my position relative to the minimum corner of the grid
+  glm::vec3 myPos = pos[particleArrayIndices[index]];
+  glm::vec3 gridLocation = myPos - gridMin;
+  float myPosX = gridLocation.x;
+  float myPosY = gridLocation.y;
+  float myPosZ = gridLocation.z;
+  
+  // get my grid cell in 3D float index
+  float gridCellX = myPosX * inverseCellWidth;
+  float gridCellY = myPosY * inverseCellWidth;
+  float gridCellZ = myPosZ * inverseCellWidth;
+
+  // also get my grid cell in int index
+  int gridCellXIndex = static_cast<int>(gridCellX);
+  int gridCellYIndex = static_cast<int>(gridCellY);
+  int gridCellZIndex = static_cast<int>(gridCellZ);
+
+  // also get the middle of each one of those grid cells
+  float gridCellXMiddle = static_cast<float>(static_cast<int>(gridCellX)) + 0.5f;
+  float gridCellYMiddle = static_cast<float>(static_cast<int>(gridCellY)) + 0.5f;
+  float gridCellZMiddle = static_cast<float>(static_cast<int>(gridCellZ)) + 0.5f;
+  
+  // now determine which octant we are in by these three numbers
+  int xSign = (gridCellX < gridCellXMiddle) ? -1 : 1;
+  int ySign = (gridCellY < gridCellYMiddle) ? -1 : 1;
+  int zSign = (gridCellZ < gridCellZMiddle) ? -1 : 1;
+
+  /*** Now that we know our octant, we can check the neighbors in the neighboring cells ***/
+
+  // for loop over the 8 neighbors to check for neighboring boids
+  int rule1Count = 0;
+  int rule3Count = 0;
+  for (int i = 0; i < 8; ++i)
+  {
+    /*** first we have to figure out the 3D grid index of the neighbor we are looping over ***/
+
+    // initialize neighbor's index to our index
+    int neighborGridX = gridCellXIndex;
+    int neighborGridY = gridCellYIndex;
+    int neighborGridZ = gridCellZIndex;
+
+    // based on the three numbers from earlier, calculate the true neighbor grid index
+    if (i % 2 > 0)
+    {
+      neighborGridX += xSign;
+    }
+
+    if (i % 4 > 1)
+    {
+      neighborGridY += ySign;
+    }
+    
+    if (i % 8 > 3)
+    {
+      neighborGridZ += zSign;
+    }
+
+    // if the neighbor's grid index is invalid, then skip it
+    if (neighborGridX < 0 || neighborGridX > (gridResolution - 1) ||
+      neighborGridY < 0 || neighborGridY > (gridResolution - 1) ||
+      neighborGridZ < 0 || neighborGridZ > (gridResolution - 1))
+    {
+      continue;
+    }
+    
+    // now that we have the neighbor's index, get that in 1D and find out where it starts and ends in the buffers
+    int neighborGridIndex = gridIndex3Dto1D(neighborGridX, neighborGridY, neighborGridZ, gridResolution);
+    int neighborGridParticleStart = gridCellStartIndices[neighborGridIndex];
+    int neighborGridParticleEnd = gridCellEndIndices[neighborGridIndex];
+
+    // if that grid cell doesn't have any particles in it or if there is some error then that's fine, just skip it
+    if (neighborGridParticleStart == -1 || neighborGridParticleEnd == -1 || neighborGridParticleStart > neighborGridParticleEnd)
+    {
+      continue;
+    }
+ 
+    // for each particle within that neighboring grid cell
+    for (int j = neighborGridParticleStart; j <= neighborGridParticleEnd; ++j)
+    {
+      // make sure it's not us
+      if (j != index)
+      {
+
+        /*** apply the three rules ***/
+
+        glm::vec3 difference = pos[particleArrayIndices[j]] - pos[particleArrayIndices[index]];
+        float len = glm::length(difference);
+     
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        if (len < rule1Distance)
+        {
+          rule1Count++;
+          rule1_perceived_center += pos[particleArrayIndices[j]];
+        }
+
+        // Rule 2: boids try to stay a distance d away from each other
+        if (len < rule2Distance)
+        {
+          rule2_c_vel -= (pos[particleArrayIndices[j]] - pos[particleArrayIndices[index]]);
+        }
+
+        // Rule 3: boids try to match the speed of surrounding boids
+        if (len < rule3Distance)
+        {
+          rule3Count++;
+          if (glm::length(vel1[particleArrayIndices[j]]) > 0)
+          {
+            rule3_perceived_vel += vel1[particleArrayIndices[j]];
+          }
+        }
+      }
+    }
+  }
+
+  // Scale rules 1 and 3 by the number of Neighbors
+  if (rule1Count > 0)
+  {
+    rule1_perceived_center /= rule1Count;
+    rule1_vel = (rule1_perceived_center - pos[particleArrayIndices[index]]) * rule1Scale;
+  }
+
+  if (rule3Count > 0)
+  {
+    rule3_perceived_vel /= rule3Count;
+    rule3_vel = rule3_perceived_vel * rule3Scale;
+  }
+
+  // Rule 2 doesn't need to be scaled
+  rule2_vel = rule2_c_vel * rule2Scale;
+
+  // calculate the new velocity by adding up all the rule results
+  glm::vec3 newVelocity = vel1[particleArrayIndices[index]] + rule1_vel + rule2_vel + rule3_vel;
+  
+  // scale the velocity to account for max speed
+  if (glm::length(newVelocity) > maxSpeed)
+  {
+    newVelocity = (newVelocity / glm::length(newVelocity)) * maxSpeed;
+  }
+
+  // put it into vel2 so we can later ping-pong it
+  vel2[particleArrayIndices[index]] = newVelocity;
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,72 +601,347 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+
+  // first, as per usual, get my index
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // Initialize the vectors that will be edited by the three rules
+  glm::vec3 rule1_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_vel(0.0f, 0.0f, 0.0f);
+
+  // Initialize other state relating to the rules
+  glm::vec3 rule1_perceived_center(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule2_c_vel(0.0f, 0.0f, 0.0f);
+  glm::vec3 rule3_perceived_vel(0.0f, 0.0f, 0.0f);
+
+  /*** Now we'll get the grid cell and the octant within that cell that this particle is in ***/
+  
+  // get my position relative to the minimum corner of the grid
+  glm::vec3 myPos = pos[index];
+  glm::vec3 gridLocation = myPos - gridMin;
+  float myPosX = gridLocation.x;
+  float myPosY = gridLocation.y;
+  float myPosZ = gridLocation.z;
+  
+  // get my grid cell in 3D float index
+  float gridCellX = myPosX * inverseCellWidth;
+  float gridCellY = myPosY * inverseCellWidth;
+  float gridCellZ = myPosZ * inverseCellWidth;
+
+  // also get my grid cell in int index
+  int gridCellXIndex = static_cast<int>(gridCellX);
+  int gridCellYIndex = static_cast<int>(gridCellY);
+  int gridCellZIndex = static_cast<int>(gridCellZ);
+
+  // also get the middle of each one of those grid cells
+  float gridCellXMiddle = static_cast<float>(static_cast<int>(gridCellX)) + 0.5f;
+  float gridCellYMiddle = static_cast<float>(static_cast<int>(gridCellY)) + 0.5f;
+  float gridCellZMiddle = static_cast<float>(static_cast<int>(gridCellZ)) + 0.5f;
+  
+  // now determine which octant we are in by the three numbers
+  int xSign = (gridCellX < gridCellXMiddle) ? -1 : 1;
+  int ySign = (gridCellY < gridCellYMiddle) ? -1 : 1;
+  int zSign = (gridCellZ < gridCellZMiddle) ? -1 : 1;
+
+  /*** Now that we know our octant, we can check the neighbors in the neighboring cells ***/
+
+  // for loop over the 8 neighbors to check for neighboring boids
+  int rule1Count = 0;
+  int rule3Count = 0;
+  for (int i = 0; i < 8; ++i)
+  {
+    /*** first we have to figure out the 3D grid index of the neighbor we are looping over ***/
+
+    // initialize neighbor's index to our index
+    int neighborGridX = gridCellXIndex;
+    int neighborGridY = gridCellYIndex;
+    int neighborGridZ = gridCellZIndex;
+
+    // based on the three numbers from earlier, calculate the true neighbor grid index
+    if (i % 2 > 0)
+    {
+      neighborGridX += xSign;
+    }
+
+    if (i % 4 > 1)
+    {
+      neighborGridY += ySign;
+    }
+    
+    if (i % 8 > 3)
+    {
+      neighborGridZ += zSign;
+    }
+
+    // if the neighbor's grid index is invalid, then skip it
+    if (neighborGridX < 0 || neighborGridX > (gridResolution - 1) ||
+      neighborGridY < 0 || neighborGridY > (gridResolution - 1) ||
+      neighborGridZ < 0 || neighborGridZ > (gridResolution - 1))
+    {
+      continue;
+    }
+    
+    // now that we have the neighbor's index, get that in 1D and find out where it starts and ends in the buffers
+    int neighborGridIndex = gridIndex3Dto1D(neighborGridX, neighborGridY, neighborGridZ, gridResolution);
+    int neighborGridParticleStart = gridCellStartIndices[neighborGridIndex];
+    int neighborGridParticleEnd = gridCellEndIndices[neighborGridIndex];
+
+    // if that grid cell doesn't have any particles in it or if there is some error then that's fine, just skip it
+    if (neighborGridParticleStart == -1 || neighborGridParticleEnd == -1 || neighborGridParticleStart > neighborGridParticleEnd)
+    {
+      continue;
+    }
+    
+    // for each particle within that neighboring grid cell
+    for (int j = neighborGridParticleStart; j <= neighborGridParticleEnd; ++j)
+    {
+      // make sure it's not us
+      if (j != index)
+      {
+
+        /*** apply the three rules ***/
+
+        glm::vec3 difference = pos[j] - pos[index];
+        float len = glm::length(difference);
+     
+        // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
+        if (len < rule1Distance)
+        {
+          rule1Count++;
+          rule1_perceived_center += pos[j];
+        }
+
+        // Rule 2: boids try to stay a distance d away from each other
+        if (len < rule2Distance)
+        {
+          rule2_c_vel -= (pos[j] - pos[index]);
+        }
+
+        // Rule 3: boids try to match the speed of surrounding boids
+        if (len < rule3Distance)
+        {
+          rule3Count++;
+          if (glm::length(vel1[j]) > 0)
+          {
+            rule3_perceived_vel += vel1[j];
+          }
+        }
+      }
+    }
+  }
+
+  // Scale rules 1 and 3 by the number of Neighbors
+  if (rule1Count > 0)
+  {
+    rule1_perceived_center /= rule1Count;
+    rule1_vel = (rule1_perceived_center - pos[index]) * rule1Scale;
+  }
+
+  if (rule3Count > 0)
+  {
+    rule3_perceived_vel /= rule3Count;
+    rule3_vel = rule3_perceived_vel * rule3Scale;
+  }
+
+  // Rule 2 doesn't need to be scaled
+  rule2_vel = rule2_c_vel * rule2Scale;
+
+  // calculate the new velocity by adding up all the rule results
+  glm::vec3 newVelocity = vel1[index] + rule1_vel + rule2_vel + rule3_vel;
+  
+  // scale the velocity to account for max speed
+  if (glm::length(newVelocity) > maxSpeed)
+  {
+    newVelocity = (newVelocity / glm::length(newVelocity)) * maxSpeed;
+  }
+
+  // put it into vel2 so we can ping-pong it later
+  vel2[index] = newVelocity;
+}
+
+/*
+ * Takes pos, vel1, vel2 and shuffles them based on the indices in particleArrayIndices
+ * editing pos_shuffle, vel1_shuffle, and vel2_shuffle in the process
+ */
+__global__ void kernShuffle(int N, int* particleArrayIndices, glm::vec3* pos,
+  glm::vec3* pos_shuffle, glm::vec3* vel1, glm::vec3* vel1_shuffle,
+  glm::vec3* vel2, glm::vec3* vel2_shuffle) {
+
+  // get the index of the thing to switch
+  int index = threadIdx.x + (blockIdx.x * blockDim.x);
+  if (index >= N) {
+    return;
+  }
+
+  // shuffle all of the indices in parallel
+  pos_shuffle[index] = pos[particleArrayIndices[index]];
+  vel1_shuffle[index] = vel1[particleArrayIndices[index]];
+  vel2_shuffle[index] = vel2[particleArrayIndices[index]];
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  // Brute force the velocity Naively
+  kernUpdateVelocityBruteForce << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos, dev_vel1, dev_vel2);
+
+  // Update position
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+
+  // ping-pong the velocity buffers
+  glm::vec3 *temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  // Assign all particles a grid index
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+                                                            gridInverseCellWidth, dev_pos, dev_particleArrayIndices, 
+                                                            dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  cudaDeviceSynchronize();
+
+  // Unstable key sort using Thrust
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + (numObjects - 1), 
+                      dev_thrust_particleArrayIndices);
+  checkCUDAErrorWithLine("thrust sort_by_key failed!");
+
+  // reset the grid start and end indices back to their defaults of -1 to denote empty grid
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  cudaDeviceSynchronize();
+
+  // find the start and end indices of each grid
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+                                                                  dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  cudaDeviceSynchronize();  
+
+  // Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount,
+                                                                              gridMinimum, gridInverseCellWidth, gridCellWidth, 
+                                                                              dev_gridCellStartIndices, dev_gridCellEndIndices, 
+                                                                              dev_particleArrayIndices, dev_pos, 
+                                                                              dev_vel1, dev_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+ 
+  cudaDeviceSynchronize();
+
+  // Update positions
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos, dev_vel1);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+  // Ping-pong vel1 and vel2
+  glm::vec3 *temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+  dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+
+  // Assign all particles a grid index
+  kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+                                                            gridInverseCellWidth, dev_pos, dev_particleArrayIndices, 
+                                                            dev_particleGridIndices);
+  checkCUDAErrorWithLine("kernComputeIndices failed!");
+
+  cudaDeviceSynchronize();
+
+  // Unstable key sort using Thrust
+  thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + (numObjects - 1), 
+                      dev_thrust_particleArrayIndices);
+  checkCUDAErrorWithLine("thrust sort_by_key failed!");
+
+  // reset the grid start and end indices back to their defaults of -1 to denote empty grid
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellStartIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  kernResetIntBuffer << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_gridCellEndIndices, -1);
+  checkCUDAErrorWithLine("kernResetIntBuffer failed!");
+
+  cudaDeviceSynchronize();
+
+  // find the start and end indices of each grid
+  kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices,
+                                                                  dev_gridCellStartIndices, dev_gridCellEndIndices);
+  checkCUDAErrorWithLine("kernIdentifyCellStartEnd failed!");
+
+  cudaDeviceSynchronize();  
+  
+  // COHERENT ONLY: reshuffle all of the particle data
+  kernShuffle << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos, 
+                                                    dev_shuffle_pos, dev_vel1, dev_shuffle_vel1, dev_vel2, dev_shuffle_vel2);
+  checkCUDAErrorWithLine("kernShuffle failed!");
+
+  cudaDeviceSynchronize();
+
+  // Perform velocity updates using neighbor search
+  kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum, 
+                                                                             gridInverseCellWidth, gridCellWidth, 
+                                                                             dev_gridCellStartIndices, dev_gridCellEndIndices,
+                                                                             dev_shuffle_pos, dev_shuffle_vel1, dev_shuffle_vel2);
+  checkCUDAErrorWithLine("kernUpdateVelNeighborSearchScattered failed!");
+ 
+  cudaDeviceSynchronize();
+
+  // Update positions
+  kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_shuffle_pos, dev_shuffle_vel1);
+  checkCUDAErrorWithLine("kernUpdatePos failed!");
+
+
+  // ping-pong the newly shuffled buffers
+  glm::vec3 *pingPongTemp1 = dev_pos;
+  glm::vec3 *pingPongTemp2 = dev_vel1;
+  glm::vec3 *pingPongTemp3 = dev_vel2;
+  dev_pos = dev_shuffle_pos;
+  dev_vel1 = dev_shuffle_vel1;
+  dev_vel2 = dev_shuffle_vel2;
+  dev_shuffle_pos = pingPongTemp1;
+  dev_shuffle_vel1 = pingPongTemp2;
+  dev_shuffle_vel2 = pingPongTemp3;
+
+  // ping pong vel1 and vel2
+  glm::vec3 *temp = dev_vel1;
+  dev_vel1 = dev_vel2;
+  dev_vel2 = temp;
 }
 
 void Boids::endSimulation() {
+  // free main buffers
   cudaFree(dev_vel1);
   cudaFree(dev_vel2);
   cudaFree(dev_pos);
 
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  // free ping-pong buffers
+  cudaFree(dev_shuffle_vel1);
+  cudaFree(dev_shuffle_vel2);
+  cudaFree(dev_shuffle_pos);
+
+  // free grid state buffers
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
   // test unstable sort
   int *dev_intKeys;
   int *dev_intValues;
@@ -435,7 +982,6 @@ void Boids::unitTest() {
   // Wrap device vectors in thrust iterators for use with thrust.
   thrust::device_ptr<int> dev_thrust_keys(dev_intKeys);
   thrust::device_ptr<int> dev_thrust_values(dev_intValues);
-  // LOOK-2.1 Example for using thrust::sort_by_key
   thrust::sort_by_key(dev_thrust_keys, dev_thrust_keys + N, dev_thrust_values);
 
   // How to copy data back to the CPU side from the GPU
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..a5e5ba8 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -12,13 +12,13 @@
 // Configuration
 // ================
 
-// LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
+// toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
-// LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+// change this to adjust particle count in the simulation
+const int N_FOR_VIS = 10000;
 const float DT = 0.2f;
 
 /**
@@ -217,7 +217,7 @@ void initShaders(GLuint * program) {
     double timebase = 0;
     int frame = 0;
 
-    Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
+    Boids::unitTest(); // We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
 
     while (!glfwWindowShouldClose(window)) {