Sunday, 3 August 2014

OpenCL on the Parallella using Eclipse

Writing OpenCL code on the Parallella board using Eclipse for people who just want to write code and not become Linux experts.

Finally, what I would call my first "execution" blog post. I got my board a couple of months ago and while it's been fun playing with it, achieving the ultimate goal - writing code - has been frustrating to say the least.

I've tried a number of different approaches only to realise (yet again) that keeping it simple while you don't know what you are doing is by far the best approach. 

Where I ended up last week was - "how about I try and get some existing code running in eclipse on the parallella". Hopefully, the code will be compilable and correct so the only think I'll have to get right are the settings. To this end I found a short program that generated a colourised mandelbrot set posted by dar on the Parallella site. Don't try and find this code because it has some pretty glaring errors. If you want to try this procedure use the code below.

This really is a beginners guide. If you are already familiar with Eclipse you won't need most of this. I also don't make any claim about how efficient the code is.

0. Before you begin

The prerequisites are:

  • the Parallella SDK
  • the Brown Deer Technology SDK
  • the environment variables PATH and LD_LIBRARY_PATH set correctly
  • the Eclipse development environment

The first three come with the Ubuntu 14.04 release and I'm assuming future releases will also include them. For the record, the environment variables on my system are:



Installing eclipse is a matter of:

sudo apt-get install eclipse

Once you down load eclipse you need to C/C++ development addin CDT. To get this you use Help | Get New Software and pack your patience - it is SLOOOOW.

On tricky thing: for a program to gain access to the Epiphany co-processor it must be run by root. That means that you must log in as root (using sudo won't work) and therefore you need a root password. To reset the root password you run:

sudo passwd

which will give you the usual new password prompt followed by the are you really sure prompt. Then to run you program you use the command:


from your non-root login (linaro by default). This will then prompt you for the root password. Once you have logged on as root, check the environment variables PATH and LD_LIBRARY_PATH. They are needed at run time so root must have them set as described above.

1. Your first Eclipse Project

1.1 Get Yourself a New Project

To start with you need a new C/C++ project. The File | New | Project pops up the box to choose what type of project you want. I always choose C++ but for this particular project a standard C Project would be fine. I called my project mandelbrot (not very original I know).

1.2 Set Up your Compile Settings

Not a lot to do here but absolutely critical.

Your compile settings are accessible from the Project | Properties menu. For Properties to be active you need to have the project tab and your project within it selected.

Your tool chain should look like this:

Note I've got the C++ compiler and linker in there.

Your includes should look like this:

This will tell the compiler to find all of the Brown Deer Technology stuff.

Finally, your linker settings should look like this:

I'm pretty sure that this is all you need. I did fumble around a lot with other settings so if this does not work please let me know.

1.3 Some Code

For CL projects you need host code and Epiphany code. The host code is compiled using the gcc/g++ compilers/linker etc in the tool chain. The Epiphany code is compiled at run time by the Brown Deer JIT compiler.

I created a source file folder (src) under the mandelbrot project folder. For the embedded paths in the code to work you should do this as well. The host code I ended up with was:

// The modifications porting this code to OpenCL are
// Copyright (c) 2012 Brown Deer Technology, LLC.
// Mandelbrot.c
// Written by User:Evercat
// This draws the Mandelbrot set and spits out a .bmp file.
// Should be quite portable (endian issues have been taken
// care of, for example)
// Released under the GNU Free Documentation License
// or the GNU Public License, whichever you prefer:
// 9 February, 2004.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdcl.h>
#include <errno.h>

#define OUTFILE "./mandelbrot.bmp"

#define WIDTH 1024
#define HEIGHT 768

#define CENTRE_X -0.5
#define CENTRE_Y 0
#define ZOOM 300

#define ITERATIONS 1024  // Higher is more detailed, but slower...

// Plotting functions and parameters...

#define bailoutr(n) (5*n  )
#define bailoutg(n) (20*n  )
#define bailoutb(n) 0

#define min(a,b) ((a<b)?a:b)

// Colours for the set itself...

#define IN_SET_R 0
#define IN_SET_G 0
#define IN_SET_B 0

void drawbmp(int width, int height, unsigned char* pixels, char * filename);

/////////////////////////////////// MAIN PROGRAM ///////////////////////////////////

int main (void)

   float startx; float endx;
   float starty; float endy;
   float dx; float dy;
   float dx_over_width,dy_over_height;

   char kern[] = "../src/";
   void * openHandle;

   int iterations = ITERATIONS;
   int width = WIDTH;
   int height = HEIGHT;

   char strInfo[20];
   FILE * pFile;

   pFile = fopen(kern, "r");
   if (pFile == NULL)
  printf("Opening the Kernel file: %s produced an error(%d). Make sure that the source code variable kern has a valid path to the cl code and that the code is readable.\n", kern, errno);
  fclose(pFile); // only open the file to check that it is there and readable

   pFile = fopen("./debug", "w");

   fprintf(pFile, "About to malloc pixels\n");
   cl_uchar* pixels = (cl_uchar*) clmalloc(stdacc, width * height * 3, 0);

   startx = CENTRE_X - ((float) WIDTH / (ZOOM * 2));
   endx = CENTRE_X + ((float) WIDTH / (ZOOM * 2));

   starty = CENTRE_Y - ((float) HEIGHT / (ZOOM * 2));
   endy = CENTRE_Y + ((float) HEIGHT / (ZOOM * 2));

   fprintf(pFile, "Plotting from (%f, %f) to (%f, %f)\n", startx, starty, endx, endy);

   dx = endx - startx;
   dy = endy - starty;
   dx_over_width = dx / width;
   dy_over_height = dy / height;

   fprintf(pFile, "Opening kernel file:%s\n", kern);
   openHandle = clopen(stdacc, kern, CLLD_NOW);

   fprintf(pFile, "Getting the kernel with clsym\n");
   cl_kernel krn = clsym(stdacc, openHandle, "mandel_kern", CLLD_NOW);

   clGetKernelInfo(krn, CL_KERNEL_FUNCTION_NAME, sizeof(strInfo), strInfo, NULL);
   fprintf(pFile, "The kernel is called: %s\n", strInfo);

   fprintf(pFile, "Calling clndrange\n");
   clndrange_t ndr = clndrange_init1d(0, height, 16);

   fprintf(pFile, "Calling clforka\n");
   clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT,
      iterations, width, startx, starty, dx_over_width, dy_over_height, pixels);

   fprintf(pFile, "Transferring memory contents from the Epiphany using clmsync\n");
   clmsync(stdacc, 0, pixels, CL_MEM_HOST|CL_EVENT_WAIT);

   fprintf(pFile, "Calling drawbmp\n");
   drawbmp(width, height, pixels, OUTFILE);

   fprintf(pFile, "Saved bitmap to %s. Done.\n", OUTFILE);

   return 0;

void drawbmp (int width, int height, unsigned char* pixels, char * filename) {

   unsigned int headers[13];
   FILE * outfile;
   int extrabytes;
   int paddedsize;
   int x; int y; int n;

   extrabytes = 4 - ((width * 3) % 4); // How many bytes of padding to add to
                                       // each horizontal line - the size of
                                       // which must be a multiple of 4 bytes.
   if (extrabytes == 4)
      extrabytes = 0;

   paddedsize = ((width * 3) + extrabytes) * height;

   // Headers...

   headers[0]  = paddedsize + 54;      // bfSize (whole file size)
   headers[1]  = 0;                    // bfReserved (both)
   headers[2]  = 54;                   // bfOffbits
   headers[3]  = 40;                   // biSize
   headers[4]  = width;  // biWidth
   headers[5]  = height; // biHeight
                                       // 6 will be written directly...
   headers[7]  = 0;                    // biCompression
   headers[8]  = paddedsize;           // biSizeImage
   headers[9]  = 0;                    // biXPelsPerMeter
   headers[10] = 0;                    // biYPelsPerMeter
   headers[11] = 0;                    // biClrUsed
   headers[12] = 0;                    // biClrImportant

   outfile = fopen (filename, "wb");

   // Headers begin...
   // When printing ints and shorts, write out 1 character at time to
   // avoid endian issues.

   fprintf (outfile, "BM");

   for (n = 0; n <= 5; n++)
      fprintf(outfile, "%c", headers[n] & 0x000000FF);
      fprintf(outfile, "%c", (headers[n] & 0x0000FF00) >> 8);
      fprintf(outfile, "%c", (headers[n] & 0x00FF0000) >> 16);
      fprintf(outfile, "%c", (headers[n] & (unsigned int) 0xFF000000) >> 24);

   // These next 4 characters are for the biPlanes and biBitCount fields.

   fprintf(outfile, "%c", 1);
   fprintf(outfile, "%c", 0);
   fprintf(outfile, "%c", 24);
   fprintf(outfile, "%c", 0);

   for (n = 7; n <= 12; n++)
      fprintf(outfile, "%c", headers[n] & 0x000000FF);
      fprintf(outfile, "%c", (headers[n] & 0x0000FF00) >> 8);
      fprintf(outfile, "%c", (headers[n] & 0x00FF0000) >> 16);
      fprintf(outfile, "%c", (headers[n] & (unsigned int) 0xFF000000) >> 24);

   // Headers done, now write the data...

   for (y = height - 1; y >= 0; y--)  // BMPs are written bottom to top.
      for (x = 0; x <= width - 1; x++)
         // Also, it's written in (b,g,r) format...

         fprintf(outfile, "%c", pixels[(x * 3) + 2 + (y * width * 3)]);
         fprintf(outfile, "%c", pixels[(x * 3) + 1 + (y * width * 3)]);
         fprintf(outfile, "%c", pixels[(x * 3) + 0 + (y * width * 3)]);
      if (extrabytes) // See above - BMP lines must be of lengths divisible by 4
         for (n = 1; n <= extrabytes; n++)
            fprintf(outfile, "%c", 0);

   fclose (outfile);

Create a new file in your project and paste this into it.

The Epiphany code is:

#define set_red(n) (5*n  )
#define set_green(n) (20*n  )
#define set_blue(n) 0

__kernel void mandel_kern(
   int iterations,
   int width,
   float startx, 
   float starty, 
   float dx, 
   float dy, 
   __global uchar* pixels
int threeXwidth = 3 * width;
   unsigned char line[threeXwidth];
   int i, j, n, m, pixelBase;
   float x, y, r, s, nextr, nexts;

   j = get_global_id(0);

   for (i = 0; i < width; i++) 

      x = startx + i*dx;
      y = starty + j*dy;

      r = x; 
      s = y;

      for (n = 0; n < iterations; n++) 

         nextr = ((r * r) - (s * s)) + x;
         nexts = (2 * r * s) + y;
         r = nextr;
         s = nexts;
         if ((r * r) + (s * s) > 4) break;


      if (n == iterations) n=0;

      line[(i * 3) + 0 ] = min(255,set_red(n));
      line[(i * 3) + 1 ] = min(255,set_green(n));
      line[(i * 3) + 2 ] = min(255,set_blue(n));   


pixelBase = j * threeXwidth;
  for (m =0; m < threeXwidth; m++)
  pixels[pixelBase + m] = line[m];


Paste it into a new file called The compiler used by eclipse won't do anything with this file, it is here just for convenience.

1.4 Save and Build

If you've gotten this far you should now be able to save and build the program without any problems.

2 The Fun Starts Here

2.1 Debugging Host Code

Nothing works first time right? Right! I know that we are all geniuses but even we slip up occasionally.

Compile time on the host side is pretty much as usual. Get the includes and libraries right and it will all compile and link.

Run time debugging is when things start to get tricky. Prior to the execution of any code on the Epiphany accelerator using the Eclipse front end to the gdb debugger works as you would expect. However, the Epiphany expects the calling application to have root privileges. I started my project as an ordinary user (linaro) and when it came to starting Eclipse as root it got a bit sticky. First, logging in as root in a terminal window and then calling Eclipse didn't work at all and typing sudo eclipse from Run on the start menu got Eclipse running but then struck all sorts of permission problems. I reverted to old school debugging.

Notice I have opened a file called ./debug and I write a line to it pretty much before every call in the program. This is the cleaned up version of the debug writes that I used to figure out what was going on. I write them to a file to separate my output from the text that the stdcl libraries produce on the console (especially when the cl code actually runs). This is generally enough and if the Brown Deer documentation was a little more comprehensive would see you through.

Another lack in the documentation is how you go about checking if your call actually worked. (Notice that the only check I do is to open the cl file to see if it is there and readable.) If something does not work you get a Segmentation Fault a bit later. The Brown Deer's stdcl library takes away a lot of the verbosity of "generic" OpenCL code. For example, you don't have to create a context of your own, the library provides you with the stdacc global variable and that is the context for the Epiphany accelerator. However, what is lacking in the documentation is how to check if the calls worked e.g. if the clopen command found and successfully compiled the code. It might be there. If I find it I'll edit this post or write a new one.

2.2 Debugging Epiphany Code

I'm glad that I didn't really have to do this much.

Compile time with the JIT compiler is at run time for the host code. The output for the compiler is displayed onto the calling console. It reminded me of learning Pascal at university in the early 80's. I was fortunate that the cl code was correct enough for the compile time error to be obvious and with a little nutting out the changes I had to make were not too onerous. I think the lesson here is "Keep your kernels as simple as you can".

For run time debugging on the Epiphany there is a version of gdb called e-gdb. I did have not to go that far for this project but I think it will be key when things get more complicated. Andreas Olofsson got back to me about the developments in this area. There is significant work being done and things will get a lot easier in the near future. I think that for the moment, getting your code running as best you can in a friendly environment before you package it up into a kernel. Just make sure that you are only using the somewhat limited functions that are available if cl (the little copy loop at the end of the cl code above replaces a memcpy command in the original code).

I think I'll write another blog post once I get some idea about debugging on the Epiphany.

3. Final Thoughts

This guide is really only a toe in the water. So far I have only used the stdcl library and not gone anywhere near the COPRTHR library. There's lots of good stuff there that I have not gotten into yet.

The ARM cores on the Parallella are fine for a little snippet like the example but a bit too low powered for significant development. Cross compilation from a more powerful machine is the next step and I hope to write another blog entry on that in the near future.

Oh and one final thing... the output:

Friday, 20 December 2013

Sorting of Spatial Data for Meshed Multi-Core Processors

Sorting of Spatial Data for Meshed Multi-Core Processors

Many algorithms that operate on two dimensional data sets are well suited to multi-core processors such as the Parallella. The nature of raw two dimensional data is that it generally arrives as an unsorted “point cloud” where neither the x or y axis are sorted. In order to gain the most efficient use of the processor, the points must be sorted so that each core can work on a subset of the space while minimising the need for inter-core communication. Therefore it makes sense to start the process with an efficient, two dimensional sort where the data ends up evenly distributed amongst the cores in clusters of “near neighbours”.
The target of this and subsequent blogs will be a design for Delaunay Triangulation. The target architecture is the Parallella board. I believe that the Epiphany III and Epiphany IV chips on the Parallella is well suited to spatial processing given that its cores are organised in a mesh, i.e. each core is connected to it's nearest neighbours to the north, south, east and west via multiply high speed buses. Thus, if close data points are clustered on each core and neighbouring clusters are on neighbouring cores, the distance to be travelled when the clusters are finally matched up will be minimised.

Goals of the Algorithm

The goals of the sorting algorithm are:
  • The x value of each point in a cluster are less than those in the clusters on the cores to the east and greater than those on the cores to the west.
  • The y value of each point in a cluster are less than those in the clusters on the cores to the north and greater than those in the clusters on the cores to the south.
  • The points are evenly distributed between cores
In addition we should take care that no unnecessary steps are introduced so that we end up with the most efficient process over all.

Distributing the Points

Given that the point cloud is assumed to be random and we want to have the same number of points per core then a round robin distribution seems the best way to start. The can be done as the data is being read in which should give the cores enough time to complete at least some of the initial sort in parallel with the data load phase.

Initial Sort

Let's not reinvent the wheel here. I think that a quick sort of the x values into one index and the y values into another index is the way to go here.

Swap Sort

After the initial sort, each core will have an sorted group of points that could be from anywhere in the point cloud. The purpose of the swap sort is to push each point to the neighbouring cores where the point is with it's nearest neighbours, or at least closer to them. I'm using a push or write-only method of communication between cores because the write message is a lot quicker than the read message. The cores must swap points (i.e receive one point for every point sent) in order to preserve the even distribution of points.
The cores must start the process by passing their larges x and y values to the cores to the east and north respectively via the c-mesh (let's call this the MyLargestX message). If the lowest x value of the core to the east is smaller than the highest x value then the these two cores must initiate a swap. This can be done in response to the MyLargestX message (let's call the response MySmallesXY). Simultaneously, the cores can be swapping along the y axis with a MyLargestY message. The swap is then completed with a MyLargestXY call from the initial core passing both the x and y values to the points new “home”.
If the MyLargestX message contains an argument that is smaller than the smallest value on the receiving core then the receiving core does not initiate the swap. If, at a later time that core receives a point that is smaller than that received from the core to the south or west then the swap can be initiated.

End Conditions

The end conditions cannot be determined by any one core on behalf of all cores because each core only knows it's own values and the largest values of the cores to the south and west. Therefore, each core must indicate to the controlling application that it is has received a MyLargestX or MyLargestY that does not cause it to initiate a swap. This is only a tentative state because a core that is currently “happy” (i.e. has values that are larger than the cores to the south and west) may receive a point from the north or east that it must pass on. Therefore the controlling application must only terminate the process when all cores are “happy”.

Potential Problems

Sub-Optimal Swapping Pathways

Because each core can only make decisions based on the data that is currently in it's local memory there may be some cases that swaps are initiated early in the process are then undone due to some smaller values arriving from cores further to the north. Right now I can't see how this can be avoided.
Similarly, a swap to the west may be undone at a later stage after some swaps to the north. This could be avoided by swapping based on the x values (north-south) first and, when x is sorted, sorting on y (east-west). This would also free up some memory on each core given that there would only be one sort index needed at a time (or indeed the points could be moved around in memory removing the need for an index at all).

Skewed Data Sets

This kind of sorting will end up with the point cloud spread out into a square. This is fine if the original data set is also roughly square or at least regularly distributed into a regular sort of shape.

Diagram 1 – Near neighbours stay near.

This distribution probably will not work so well if the point cloud is not regular. Points that are not true near neighbours may end up on the same core and points that are near neighbours may end up on distant cores.

Diagram 2 – Distant points that are not near neighbours get squashed onto the same core(s)

Diagram 3 – Near neighbours get separated onto distant cores

In this case the user must be aware of the limitations of the sort and the algorithm that uses the sorted data must be robust enough to handle this situation.

Next: Using the sorted data

My next entry will a description of how to Triangulate the sorted points.
Don't hold your breath – the Parallella boards are being shipped and when I get mine, I'll be writing code, not english.

Tuesday, 3 December 2013

Something to play with while we wait.

I've just posted a light weight C++ library that implements a three layer feed-forward back propagation neural network simulator on github. The repository is here:

I developed this version on the pi using the C++ standard library. This is an update of a version that I wrote a couple of years ago and is a very C++ oriented solution. There are classes for links, nodes, layers using templates, multiple inheritance and pretty much every C++ feature I know.

My intention is to use this as a base version and from which I'll develop a version for the parallella when it arrives. The next thing I'll work on is a matrix based implementation along the lines of the design outlined in the two previous posts on this blog.

If you want to download it and have a play please do. I hope my documentation is clear enough to get you started. 

One caveat though. I found github really confusing. If you are not familiar with it, downloading the files and not using git locally is probably the best way to go. Also, please don't post git or github questions - the answer will be "I have NO idea".

Please comment if you find any problems either on this post or on the repository wiki.

Thanks and have fun.

PS I just managed to get the Shuttle Statlog data uploaded. There are four training sets and one test set, the original data and transformed data in an open office spreadsheet and two scripts, one for create/train/save and one for test.

Tuesday, 1 October 2013

Training in Parallel

Training - the hard bit

Feeding forward is only part of the story. Any useful, real-world application with a significant number of inputs will need to be trained in an automatic fashion using a significant number of example data sets. Reed and Marks[1], the text that I mainly use, quotes Hinton[2] who says that a typical network with w weights requires requires O(w) training patterns and O(w) weight updates for good generalisation. On a serial machine this would require O(w^3) that would only be reduce by a factor of w on parallel hardware giving a training time of O(w^2). Clearly, an efficient implementation of the training algorithm is required.

How training works

Typically, before training commences, the weights on the links and bias values on the nodes are set to random values. When you feed your input pattern into the untrained network and calculated the output pattern, it will not bear any resemblance to the desired output. The weights and biases will have to be adjusted so that the calculated output matches the desired output (within predefined limits). The adjustment of each weight and bias is proportional to the amount of error it contributed to the final result.

For an output node the calculation is straight forward. The error is the desired value minus the calculated value. This error is then multiplied by the first derivative of the activation function giving the delta for the bias and incoming links. This delta is is used to "nudge" the incoming weights and output node bias in the direction that would result in a calculated output closer to the desired pattern. 

The magnitude of the nudge is determined by a value called the "learning rate". This is a value between 0 and 1 which is a multiplicand when updating the weights and bias values.

The error for the hidden layer is a little more difficult. Each weight between the hidden and output layer has contributed a little to the error of the output nodes. The error of a hidden node is the sum of these contributions (specifically, the weight times the output node delta). Again a delta for the hidden node is calculated summing the error on each weight and the first derivative of the activation function of the node.

Once the error on each hidden and output node has been calculated the incoming weights and node biases must be updated. Reed and Marks describe two variations called Batch Update and Online Update.

Training Variation 1 - Batch Update

In Batch Update the whole training set is passed through the network. The deltas for every node for every training set are calculated. The "total error" for the training set is the average of each node's deltas. Once this has happened the weights and biases are updated once.

Training Variation 2 - Online Update

In Online Update, each training pattern, (i.e. one set of inputs and the matching output) is run through the network, the deltas are calculated and the weights are updated straight away.

Practical Considerations

When to Stop

The idea with training is that the network is updated so that the discrepancy between the generated output and desired output is small while maintaining the generality of the solution. This is a balancing act. Clearly you want the network to produce an output that is a recognisable pattern (e.g. if an output is ON then it is greater that 0.8 and if it is OFF then it is less that -0.8). However, if you train the network too much it will eventually get to the stage that in only recognises the training set that you used. 

Recognising the situation where you have achieved the desired output levels can be done using a technique such as the Sum of the Squared Error (SSE).

Over training is not so easy. In practice, you keep a known set of inputs with their corresponding outputs aside as a test set. When you think that the network is getting close to a general solution you would run the test set (without doing any updates) and see if its output is as good a result as the training set. If it does then great! Press Save. If the test set results are significantly worse than the training test then you might have gone too far.

Once you have a well generalised network you may want continue training to see if you can improve the result but if the test set results start to diverge then you should back up to your last known general solution.

For the purposes of this design we need to keep in mind the calculation of the SSE and the ability to run a test set and not update afterwards. The maintenance and use of the test set will be left to the host application or as a manual process.

How do you know if you have the best network?

You can think of the weights in a neural network like a large multi-dimensional surface. A good solution represents a low point on this surface where all of the important weights are at a minimum. 

Diagram 8 - 2D Representation of a solution space

It is possible that your network to get stuck in a local minimum that does not represent a good solution. In this case no amount of training will make it better. There also may be a number of good solutions, one of which is the best. 

The only way of finding the best solution is to train the network many times from different starting points. The starting point in training is the set of random number that represent the initial weights. Therefore our system must have the ability to "unlearn" everything (hopefully after the user has pressed Save) and start again using a new set of random numbers.

Keeping the Cores Busy

To get the best performance out of the available hardware we should also consider how best to use all the features of the epiphany architecture.

Clearly the feed forward pass and error calculation (and weight update in online mode) are going to keep the core busy for a significant time and I presume that this task will be the processing bottleneck. Therefore keeping the cores busy, reducing the waiting time will be the key to optimum performance.

The off-chip network is connected to local memory via the DMA controller. To keep the cores "fed" with data, we should try to arrange the host process to send the next training batch to the core's local memory while it is still working on the current one. This should allow the next batch to commence as soon as it has finished. 

Where we left off...

At the end of the feed forward pass (assuming that the host has been diligent and passed the target output values while the cores were busy) our local memory would look like this:

Diagram 9 - Local memory at the end of the Feed Forward Pass

In this diagram:

  • red indicates values that have been passed to the core by the host (t(u) and t(v) are the target values for z(u) and z(v))
  • blue indicates values that have been calculated by a core (z(u), z(v), y(p) and y(q) have been calculated on Core J while y(1).. y(p-1) have been passed from upstream cores and y(q+1) .. y(N) from downstream cores)
  • purple indicates values that could either be calculated or passed (i.e. the weights)

Training Stage 1: Calculate the Output Error and Delta

Calculating the error and associated delta for an output node is trivial. The host can determine which core will calculate each final output value and send the target values to it.

Training Stage 2: Calculate the Hidden Error and Delta

The hidden node error is a little more difficult. The problem is that in my current model, the outbound weights from each hidden node are distributed across all of the cores. A few possible solutions come to mind:

1. Swap space for speed

In the example, Core J can only calculate the "hidden" error that is associated with Output(zu) and Output(zv) because it only has the links between Hidden(yp), Hidden(yq) and Output(zu) and Output(zv). It actually wants to calculate the whole error attributed to Hidden(yp) and Hidden(uq). To do this it would have to have a copy of all the weights between it's hidden nodes (yp and yq) and all the output node deltas. 

This is possible if each core had a copy of its own outbound weights and we could distribute the output deltas by using the same mechanism we used with the hidden layer values.

Diagram 10 - Space for Speed compromise

Clearly this strategy requires each core to have two sets of Hidden-Output links, the inbound links for its output nodes and the outbound links for it's hidden nodes. When training in batch mode the weights don't change from one training set to the next so the two sets of weights start synchronised and remain so until the update pass. 

The additional work to constantly update two sets of weights required for on-line mode suggests that this strategy would only be contemplated for batch mode.

2. Calculate and distribute

A less memory intensive method would be to calculate the weight * delta for all weights and deltas available to the core and to pass the those to the core that needs them. 

This would mean that data flowing around would use the fast the on-chip write network to its fullest extent. The value calculated by each core would only have to be sent to the core that needs it therefore the path would be determined by the epiphany's "x then y" rule. The largest number of hops would be 6 (on an epiphany-16) which would be for example between Core 1 and Core 13 at opposite corners as described in Diagram 7.

Once the hidden deltas have been calculated by the core that owns the hidden node it is at least in the right place. That core can either accumulate it for a later batch update or it can be used to update the node's inbound weights straight away in online mode.

3. Let the ARM cores look after it

Clearly neither of these a great solutions. There is another possibility however. The host CPUs (i.e. the ARM cores) also have a full set of data. Up until now we have only required them to keep the sending the data to the cores and not do any computation. There are two of them and both have considerable resources available in terms of memory and number crunching power.

If the output value for each output core or the output delta is passed back to the host, then it could work out remaining deltas while it is waiting for the next training pattern to be processed. The decision what to pass back to the ARM core would be based on how long the host takes to do its part of the job. The host's main task is still to keep the work up to the epiphany.

Again, batch mode would be fine with this. The ARM cores would accumulate the deltas and when the training set was done, send the deltas to each core which could then update the weights. This would introduce a short pause for the epiphany cores while the final output and hidden deltas and total batch errors are calculated and the sent to each core for the update.

Online mode... again... would have a problem. If the weights are to be updated every training example then the epiphany cores would be waiting around for the ARM cores to calculate and send the updates. This does not seem to be a good solution.

Training Stage 3: Update Weights and Biases

Once the output and hidden node errors have been calculated then each bias and weight needs to be nudged towards a better solution. Given that each core has a copy of each incoming weight and (after we figure out the best way of determining the hidden layer error) each will have the error of its own nodes then the update of the weights is straight forward. Each weight is assigned weight (i.e. itself) * delta * learningRate.

In online mode this would happen straight away and after the update the delta could be discarded.

In batch mode the error would have to be accumulated somewhere, either by each core or by the host CPU and when the training set was complete the final "total batch error" could be calculated. The accumulated errors would then be used to update the weights.

Up Next...

While we wait for our Parallellas to arrive I though I'd pull apart my Windows version and get a half decent interface together. I'll start on some documentation for that but it won't mean much until the guts have been written.

Also, I'll look around for some decent well known test data. I want to be about to get a couple of big problems to run through it and test out the scalability of the solution. If anyone knows of some good public test data please send me a link.

[1] Reed, R.D. and Marks, R. J. "Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks", MIT Press 1999.
[2] Hinton, G.E. Connectionist learning procedures. Artificial Intelligence 40(1): 143-150. 1989.