Monday, 11 May 2015

Tweaks for the new Parallella image (20150130)

Upgrading to the new Parallella image: Recommended - but I have some tips and warnings.

Why Upgrade? - Performance

The first thing that you will notice when you have everything running is that it all goes a little faster. For me, the lag associated with using the Parallella has gone from just noticeable to mostly unnoticeable. The big applications still take a while to load but the day to day stuff does not annoy me any more.

Also, the reboot is a lot more reliable.

Where to get it

The Getting Started pages have been radically upgraded as has the download page. Getting the right image is easy and loading this image is a single write operation that is well explained. To expand the file system to fill the whole SD card I used gparted on my laptop which also serves as my file backup machine.

Tip and Warnings In Summary:

  1. Get your heating under control
  2. Install LXDE and tightvncserver (and code::blocks if you are using it)
  3. Upgrade to the Brown Deer OpenCL version 1.6.2
  4. Tweak your paths


First let me say that blurting out a comment that the new image is hopelessly unreliable and that you are reverting to the old one is a dumb thing to do. Just get your fan running so that the Zync stays under 70C (156F) and the new (and extremely rigorous) temperature daemon does not shut your Epiphany down. Nuff said.

PS. If you don't want to run xtemp continuously, I suggest you try olajep temperature script:

dir=$(dirname $(grep -rl xadc /sys/bus/iio/devices/*/name))

raw=`cat ${raw_file}`
offset=`cat ${offset_file}`
scale=`cat ${scale_file}`

c_temp=`echo "scale=1;(($raw + $offset) * $scale) / 1000" | bc`
f_temp=`echo "scale=1;(($c_temp * 9) / 5) + 32" | bc`

echo "Zynq Temp: $c_temp C / $f_temp F"

which will print the current temperature out on the command line.


I'm using a micro-server Parallella and I am NOT a command line jock. I need a mouse, windows, buttons, menus and all that good stuff. The image I downloaded did not come with LXDE (the Lightweight X-windows Desktop Environment) which was confusing at first. So, once you have booted up I suggest you run:

sudo apt-get update

to update the existing packages, followed by:

sudo apt-get upgrade

to pick up any upgrades. Then:

sudo apt-get install lxde tightvncserver

to install all that Xerox-Parc windowing goodness.


The image also came with version 1.6.0 of the Brown Deer OpenCL compiler. Attempting to run executables compiled with v1.6.0 will not work. To get the latest version:


to log in as root. To put the source files somewhere out of the way I used to root home directory, so:


The clone the sources files:

git clone

and from there it was plain sailing:

cd coprthr
./configure --enable-epiphany
make install (or sudo make install if you have not logged in as root)

I didn't have to install any required packages or point to any oddly placed libraries. (NB: this was my experience - as always if something goes wrong you'll have to sniff around to figure out the solution)


When I upgraded, I took a copy of my various projects and copied them, byte-for-byte to the new file system. After installing code::blocks and sorting out the compiler, I really expected them to work. What I found was that there were numerous "file not found" errors when I didn't get them before. I'm not sure why it could be either the new environment or the new compiler but you need to add a few more paths.

To find local include files the compiler search paths needs to include . (dot, the current directory).

Click Add and type .

To use the JIT compiler while Debugging your host code you need to include a PATH variable in the invocation command. 

The full string in the Executable path: text box is:

sudo -E LD_LIBRARY_PATH=/usr/local/browndeer/lib:/opt/adapteva/esdk/tools/host/lib PATH=/usr/local/browndeer/bin:/opt/adapteva/esdk/tools/host/bin:/opt/adapteva/esdk/tools/e-gnu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin EPIPHANY_HDF=/opt/adapteva/esdk/bsps/current/platform.hdf /usr/bin/gdb

This is the path from before with a cut down PATH variable inserted.

I have not checked this on Eclipse but I would presume that it would be the same.

Final Thoughts...

The only other thing I can think of is that your default login is now parallella (pwd: parallella) and so your home path is /home/parallella and not /home/linaro. If you use any explicit paths in your code or anywhere else, you'll need to change that.

If I have forgotten anything else I've tweaked I'll edit this post.

On balance, I think the upgrade was worthwhile. I use the parallella as a development machine with no massive libraries (e.g. OpenCV) and so code::blocks is the biggest thing I need. I run BOINC but getting that going is easy. If the UI response does not bother you and you have a lot of stuff installed (and are in the middle of something monumental) than maybe you might want to keep going with the old image for now.

Up Next...

More neural networks.

Friday, 27 March 2015

Intercore Communication and Profiling (sort of...)

In this post I will describe three ways of passing data between cores and provide some empirical comparisons using a rudimentary method of profiling.

In my last post I described a method of passing data from one core to the others using a broadcast strategy. Each core calculated the base address of each other core and wrote to a data structure at a known address. At the time I wrote my last post, I believed that this is could be improved and made a little more programmer friendly. To test out some alternative data passing methods, I designed a little experiment and used a rudimentary profiling strategy. I did find a significantly better alternative, but not where I expected.

Writing this post I'm using Brown Deer OpenCL (version 1.6) and the original Ubuntu 14.04 image.

To get my sample code execute:

git clone

Profiling in OpenCL

OpenCL provides the cl_event structure to pass onto the execution queue, created with the CL_QUEUE_PROFILING_ENABLED flag set to allow timing information to be retrieved using the clGetEventProfilingInfo call. This seems pretty straight forward but unfortunately this is currently not supported (according to the Parallella Quick Start Guide).

This leaves us with the timing support provided by C++. The library <ctime> provides the calls time(time_t & time) and clock_t clock = clock(). The time() call returns the number of seconds since 1 Jan 1970 which is too coarse to measure the execution of a small kernel. The clock() call returns the number of system clock ticks which is much smaller time intervals and, if all of the profiling is done on the same machine, will be accurate enough to get a feel for comparative execution times.

Surrounding the forka call time stamps seems simple enough but one must remember that the command queue is asynchronous so any calls prior to the forka and the forka itself must be called with the flag CL_EVENT_WAIT.

#include <ctime>

clock_t tstart, tend;

clmsync(stdacc, 0, debug, CL_MEM_DEVICE|CL_EVENT_WAIT);

tstart = clock();
clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT, n, 1, debug);
tend = clock();

The main problem with this approach is that it measures everything that the forka does. Have a look at Adam Taylor's host code for a simple "Hello, World!" application. There is a lot of setting up code, which is buried within the forka. This is a little clumsy but will have to do for now.

Intercore Value Passing

Core ID Abstraction

In my last post I used the little gem (from djm):

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

to calculate the base address of the core and then used:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

Now, I love bit shifting as much as the next programmer but I don't really have to work out what every base address is. If I know my global id ( gid = get_global_id() ) I can figure out where the executing core is in the mesh and its relationship with all other cores because the base addresses don't change.

My second love is hexadecimal memory locations but they really increase your nerd status when your wife or girlfriend looks over your shoulder and wonders what on earth you are doing late into the night. Seriously though, does 0x8080 really mean anything to you? And what is it's relationship with 0x84b0. How many hops are there between the two?

The eSDK has a way of partitioning the cores based on their location in the grid. The stdcl library does not support this but it is still a good idea. Instead of 0x8080 we could say core00 and then it is obvious that there are 5 hops to core31 (formerly known as 0x84b0).

That suggests to me that the first snippet above (although pure poetry) can be abstracted away into some nice, convenient and most importantly, efficient include file that looks like this:

#define core00 0x80800000
#define core10 0x80900000
#define core20 0x80a00000


Topology Abstraction

The Epiphany is a rectangular mesh with each core connected to its nearest neighbours. This is great, but what if your application is better suited to a ring structure where it only had to communicate with two cores on each side of it in the ring. For this application, it would be convenient to relate to the next core as NEXT and the previous core as PREV. In this case, the ring topology would look like this:

If data only needed to flow around in one direction, then you would only need to refer to NEXT.

Similarly, you could define a row topology:

Or, if you really did need 3 Dimensional processing, a mesh topology where you only had to refer to NORTH, SOUTH, EAST and WEST would be really handy:

In the row and grid topology, the ends (e.g. RIGHT of core00) get assigned 0x0.

I have implemented these three topologies and used the ring in my sample code. The relevant code is:

unsigned int NEXT, PREV, ringIndex, gidOrder[CORECOUNT];
initRing(&NEXT, &PREV, &ringIndex, gidOrder);

The initRing call initialises all of the variables which are:
  • NEXT: the next core base address in the ring
  • PREV: the previous core base address in the ring
  • ringIndex: the position of the executing core within the ring (core00 is assigned 0 and core01 is 15)
  • gidOrder: an array with the global_ids in the order in which they come in the ring (gidOrder[ringIndex] == get_global_id(0)

Such definitions would even abstract away most of the need to refer to coreXX and would mean that, once set up, no core would have to execute an if... else... to figure out where it is in the mesh. The only exception would be using the row and mesh topologies, the code would have to check that it is not on the edge with no core further down the chain.

Abstracting the Assignment

One final abstraction is to clean up the assignment. To this end I've defined:


(The only difference between the two is the spelling of neighbour/neighbor.) 

Which is used (for an int assignment):

*(int*)NEIGHBOUR_LOC(NEXT, vLocal, i, sizeof(int)) = vLocal[i];

I still find this a bit clumsy but better than the original.

The Passing Experiment

The whole point of this exercise was to see if I could find a better alternative to the core-to-core broadcast method I implemented in my feed-forward pass of my neural network.

(For those who don't want to wade through my first post, the feed-forward pass is a series of matrix multiplication steps where the task is split between all cores. Each step produces another matrix which then is one input of the next step. Every core needs all of the values produced in the previous step, therefore every core must pass its results to every other core.)

The Original Method - Broadcast

The original method iterated through every remote core and wrote directly into it's memory. The code was simple enough but my "back of the envelope" calculation estimated that if every core just sent one value, there would be 640 intercore hops in the whole process.

I did improve on the original a little. I only calculated the remote core's base ID once (rather than every time I sent a value which is a obviously a waste of time).

Alternative Zero - Broadcast No Wait

I was also pretty sure that I didn't need so many calls to barrier() given that each core "owned" a chunk of the storage array on all cores and it alone wrote to it. So my first change was to implement the same broadcast strategy but with no barriers.

I wanted to write the best kernel I could so I generated an array of all core IDs (core00... core33) and iterated through the array sending all values to it except when the destination core ID was the same as the local core ID (no point in sending the value to itself).

Then I came up with two alternative strategies to test using my ring topology described above.

Alternative One - Unicast

Unicast uses the ring topology and passes it's local values to the next core. Then, it passes the values it has just received from the previous core in the ring onto the next core and so on until all values have flowed around the ring.

This would minimise the number of clashes but it does not use very much of the available bandwidth. Hence:

Alternative Two - Multicast

I read somewhere that the Epiphany has two channels on the cMesh (used for intercore write messages). Multicast (which should really be called Bicast) again uses the ring topology but passes values in two directions, thereby using both channels. Like Unicast, it first sends its own values (but in both directions) and then sends the values it has just received from its neighbours - values received from PREV are passed onto NEXT and values received from NEXT are passed onto PREV).

The Results

To see some sort of sensitivity I called each method 16 times. Each call I incremented an argument (n) from 1 to 16. All kernels passed n values to its neighbours. I expected to see some difference in execution time based on the kernels algorithm and some sort of trend line as n increased. However, my initial experiment did not show any difference between the methods or the volume of data passed at all. The chart looked like this:

The Y-axis is the number of clock ticks per call and the X-axis has the passing method (broardcast, broadcast No Wait, multicast and unicast), in ascending order of the number of values passed.

This was a bit disappointing at first. Taking into account a bit of random scatter, all the methods looked the same and there was no discernible trend line based on n. The average for all methods was just over 119,000 ticks. I thought I should take a larger sample.

At first, I surrounded everything in a loop and ran it 100 times. This made hardly any difference at all. The average nudged up a time bit but the overall picture looked the same.

As I mentioned above, the timing method I used measured the whole of the forka call. Because there was so little change between 1  and 100 iterations, I can only assume that the 119,000 odd clock ticks were all overhead!

Clearly, to see any sort of difference I'd have to dramatically increase the workload. So... 100,000 iterations...

Finally, a data set that tells a story! The averages (which include the 119,000 tick loading time) were:

  • Broadcast: 2,545,798 ticks
  • Broadcast No Wait: 302,535 ticks
  • Multicast: 1,521,429 ticks
  • Unicast: 2,623,446 ticks
And a roughly steady increase with the amount of data being sent.

But hang on... what story does it tell?

I was completely amazed by this result. Getting rid of the calls to barrier() decreased the execution time by almost 90%! Sure the algorithm was a little better but not that much!

Similarly, Multicast, with almost half the calls to barrier(), ran in almost half the time.

Here I was thinking that the number of hops was the dominant time consumer!


From this experiment, I propose the following conclusions:

0. Don't Wait... Minimise the use of barrier()

Coordinating cores using calls to barrier() is expensive.

While I thought my fancy topologies would speed things up by reducing the number of hops, the algorithm needed the cores to coordinate. In a "pass-it-along" scenario, each core had to make sure that the cores around it had successfully delivered their value(s) before it could pass it (or them) along.

Broadcast, using point-to-point communication didn't need coordination so the extra transmission cost was insignificant.

1. Give your cores a lot of work to do. 

The thing that slows parallel architectures down in general and especially with distributed memory architectures is communication overhead. This is usually interpreted as, "The cores need to pass information between each other and wait for results". However, when working with a remote device, communication overhead includes the amount of time the main processor needs to send the accelerator the kernel and the data to work on as well as the intercore communication.

So, when designing your system, cut off as big a chunk of work you can and get the little cores to do as much as possible - they are quite powerful... they can handle it.

Final thought...

While my topologies didn't deliver any significant gain in this case, I think that they would still be useful if the topology suited the algorithm. If I come across one, I'll write another post.

Up Next

I'm going to upgrade my system to the new image. I believe that the reboot is more reliable and you don't need to run you programs as root. Not a big deal but I still forget to login using su every now and then. I'll also revisit my old posts to update them if need be.

Then I'll keep going on my neural network simulator... Stay tuned.

Sunday, 1 February 2015

OpenCL on the Parallella - Structure, Communication and Debugging for Noobs

In this post, I'll cover the basic program structure of a simple OpenCL program, inter-core communication, dynamic space allocation and debugging when there is no debugger.

I'm writing on a Parallella-16 (server version) using the Brown Deer stdcl library (version 1.6) which is using the eSDK (version from the 14.04 Ubuntu build. My example code can be cloned from github with the command:

git clone -b blog9Files

Programming in OpenCL

The first thing to remember about OpenCL was that it was devised to provide a "standard" language to off load numerically intensive tasks from the CPU to a GPU. The Epiphany accelerator is similar to a GPU in that it is a remote device, with no operating system and memory is not shared with the CPU.

This means that you have to transmit the program and data to the accelerator, execute the program and retrieve the results. Once the program has halted, you have to start the whole process again. The data and code you had there previously will have disappeared.

Also, there are no basic operating system functions like threads or interrupts. You can program them yourself if you want but you have to start from scratch (or use something like freeRTOS). This is what's known as "bare metal" programming.

So, if this is what you are up for, let's start.

Basic Structure

If you've read my post on NDRanges will be familiar with the calls clndrange_init1D and forka. These are a short hand way of distributing a kernel to a specific number of cores where you want those cores to carry out exactly the same task on data you supply. This is the classic Single Instruction Multiple Data (SIMD) execution plan that suits some problems very well. Matrix multiplication which lies at the heart of the feed-forward back-propagation neural network is one such problem.

The following diagram and code illustrates how this works for the Parallella.

In the above diagram, I've included all the calls you need to get your program and data to and from the accelerator. Here are the critical bits on the host side:

  • You must use the OpenCL types for all variables that you will pass. Here I've use cl_T to indicate OpenCL types. They start with cl_ followed by one of the basic types (int, float etc.). This will ensure that the types that you use on the host side will be the same width as the accelerator.
  • Pointer variables (e.g. arrays) must be allocated to an OpenCL context (the standard context for the accelerator is stdacc) using clmalloc. Note that argument 2 is the size of the space allocated in bytes, so multiply your array size by the size of the contained type.
  • Use the clopen command that suits your chosen compilation strategy. The first one is for kernels that have been linked at compile time (see my previous blog post on Code::Blocks). The second is for the JIT compiler with the cl code in a file and the third (clsopen) is for the JIT compiler with the kernel in a string (for more discussion see the section on Space Allocation below).
  • The clmsync using the CL_MEM_DEVICE indicates that the data is outbound from the host to the accelerator. CL_MEM_HOST indicates inbound data.
  • For a discussion of cl_ndrange and forka see my previous blog post titled "What was that NDRange thing?". Note that all arguments are passed by value, therefore it is not necessary to call clmsync for simple variable types like cl_int and cl_float.
  • All calls that enqueue an activity to the accelerator have the flag CL_EVENT_NO_WAIT. This indicates that the host should execute the command and continue without waiting for the accelerator to complete the task. The cl_wait command ensures that they are all finished before continuing. (The cl_flush is there because the example code from Brown Deer uses one. The clmesg that appears during execution indicates that cl_flush has not been implemented. I think it is supposed to complete all "enqueuements".)
On the Epiphany side you are (almost) back in the wonderful world of C-language programming. I say almost because there are a few additional things you have to think about and a few restrictions (especially if you have no debugger).

The arguments to the kernel are passed by value therefore, the arrays (arr_in, and arr_out) are pointer into the Epiphany's global memory (I'm not sure where this memory actually is but it is remote from the core). I think that it is better to treat these arguments like Unix named pipes rather than variables. You can use them directly but the communication overhead in doing so is going to be large. Therefore, I declared local arrays and copied the remote data to them before starting the computation. Once this is complete, the results are copied back to remote memory before the kernel exits.

Note that I have used a simple for loop to copy the data from global to local memory. The examples supplied by Brown Deer all use the memcopy extension. I have not been able to find out how OpenCL extensions are used on the Parallella and so I've opted for a simple alternative that works in the same way which I'll replace as soon as I'm able.

Note also that the local arrays are declared statically with a fixed size at compile time. There was a post on the forums where djm asked about dynamic allocation and the response from dar (David Ritchie at Brown Deer) was not very promising (see: here). Therefore I've opted again for a simple, obvious approach using static arrays the size of which I manipulate on the fly (see Space Allocation below). 

Once you are in your kernel, the need to put cl_ on the front of variables disappears. I've put the symbol __private in front of the arrays but I think that is redundant. With no memory equivalent to OpenCL __local memory (memory shared between groups of cores), anything declared within the kernel is automatically private.

Space Allocation

Static array sizes are a problem if you have to decide how large they are before you know what data they will have to accommodate. If you use the offline compiler (as described in my Code::Blocks post) this is what you have to do. However, with a Just In Time compiler you can calculate the size required and generate code (or compiler directives) when the host code is running and alter the kernel source code before it is compiled.

Calling the JIT compiler can be done in two ways. The simpler way is using the call:

void * handle = clopen(stdacc, strKernelSourceFile, CLLD_NOW);

This reads the source file named in argument 2 and compiles it straight away. The source file is just like any other.

You can also use:

void * handle = clsopen(stdacc, strKernelCode, CLLD_NOW);

In this call, argument 2 is a string containing the actual code. The advantage of this call is that there are no files to be read and your kernel code is compiled into your executable. It is still not exactly secure (anyone can use a text editor to open the executable file and look for readable text) but at least you don't need to deliver two files when you deploy your system. The downside is that the code needs new line characters to make it humanly readable. You can deliver your kernel in one enormous line and the compiler will not complain but that is not really practical for humans. To use a new line character in a string you need to use the \ character at the end of each line. I decided to use the first method just to avoid having to maintain the \ characters. 

I think the solution is to develop using the first strategy and then write a quick pre-processing step that generates the string from the source code for Release version. This would either add the \ before the new line character or strip out the new line characters all together (along with the // comments that need a new line to terminate).


In my feed forward example I use a dynamically create include file filled with #defined variables to contain all characteristics of the neural network that do not change. To generate them I open, write to and close a file called

At the top of the file:
#define PATHTOCLDEFSFILE "//home//linaro//Work//nnP//"

Inside the class, I calculate the values I want to define (see the function setNetworkTopology) and then call:

void writeDefsFile()
    fstream * pFile;
    cl_int i;

    if (checkExists(PATHTOCLDEFSFILE, true))

        pFile = new fstream();
        pFile->open(PATHTOCLDEFSFILE, ios::out);
  (*pFile) << "#define CORECOUNT 16\n";
  (*pFile) << "#define LAYERCOUNT " << layerCount << "\n#define OUTPUTLAYER " << (layerCount - 1) << "\n";
  (*pFile) << "#define MAXWEIGHTTOLAYER " << maxWeightToLayer << "\n";
  (*pFile) << "#define LARGESTDERIVEDLAYER " << largestDerivedLayer << "\n";
  (*pFile) << "#define LARGESTINPUTLAYER " << largestInputLayer << "\n";
  (*pFile) << "#define INITWIDTHARRAY {";
for (i=0; i<layerCount-1; i++)
            (*pFile) << (*layers)[i].nodeCount << ",";
        (*pFile) << (*layers)[i].nodeCount << "}\n";
  delete pFile;
  throw format_Error(ENN_ERR_CL_DEFS_NOT_FOUND);

Which produces the file:

#define CORECOUNT 16
#define LAYERCOUNT 4
#define INITWIDTHARRAY {8,17,29,12}

And finally, in the cl file the #include and an example to remind me:

#include "/home/linaro/Work/nnP/"
/// contains #defines for all static variables
/// example contents of
///#define CORECOUNT 16
///#define LAYERCOUNT 4
///#define OUTPUTLAYER 3          
///#define MAXWEIGHTTOLAYER 1024
///#define LARGESTINPUTLAYER 32   
///#define INITWIDTHARRAY {32,32,16,16}/

And later the declarations:

    /// local storage
    __private int   widths[] = INITWIDTHARRAY;
    __private float wgt[MAXWEIGHTTOLAYER];
    __private float biases[LARGESTDERIVEDLAYER];
    __private float in[LARGESTINPUTLAYER];
    __private float derived[LARGESTDERIVEDLAYER];

Works a treat!

Intercore Communication

If you've read my very first post on the design of the neural network simulator, you will have seen that each intermediate value calculated by one core needs to be distributed to all of the other cores for use in the next round. To do this I use this little gem which was brought to my attention by djm:

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

This calculates the memory base address of a core given the global id (int gid = get_global_id(0)). It is defined as a macro to avoid the overhead of a function call.

To use this you then need to provide the offset of the structure into which you want to write your value:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

The components of this snippet are:

  • LOCAL_MEM_ADDRESS_BASE calculates the beginning of the memory range
  • ((unsigned int) derived) is the address (pointer to) the derived value array which is being cast to an unsigned int for the addition
  • (index * sizeof(float)) is the index of the array into which we want to write the value_being_passed. Don't forget, this is all byte arithmetic so you need to multiply the index by the size of the contain type to land up in the right place.
One BIG note here: this will only work if the code on the core that is being written to is the same as the core doing the writing. The location of the remote array is based on the address on the local core. If the core that is being written to has allocated the array in a different place you'll be writing somewhere you don't expect. 


Here is an example from my feed forward kernel:

gid_next = (gid == (CORECOUNT - 1)) ? 0 : gid + 1;
while (gid_next != gid)
    for (n=localFirstNode; n < localLastNode; n++)
       *(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (n*sizeof(float))) = derived[n];
    gid_next = (gid_next == CORECOUNT - 1) ? 0 : gid_next + 1;

//  debug - watch the derived values arriving at core 0 from the other nodes
//  if (gid == 0)
//      for (i=0; i<curLayerWidth; i++)
//          debug[d++] = derived[i];


Each core iterates through the values it has derived and writes them into the derived array of each other core.

Note the barrier at the end of each iteration through the nodes. This is to keep all cores in sync while they go about messing with each other memory. This is handy during debugging but barriers should be used sparingly given they stall all cores until they all have reached the barrier. This ensures that the slowest core determines the execution time of all cores up to that point in time. In this case I don't believe that they are necessary because I don't need anything else to be ready before I do the next write. I'll check that out when I focus on performance rather than correctness.

I've left some debugging code in this snippet. I'll cover that below.

A Few Notes About Performance and Efficiency

While I have not spent too much time of performance or efficiency there were a few things that occurred to me while I was writing this kernel.  

One thing that was obvious enough to be put in from the beginning. I pass all the weights needed to calculate the output for the whole network into one big array called weights. This array dominates the space requirement for the calculation. For this reason, I only copy in the weights needed to calculate the nodes assigned to each core into a local array called wgt. In this way not only is the calculate spread out over all cores but the space requirement is as well. (NB now that I come to write this I notice that I've actually allocated enough space on each core for all the weights between the biggest two layers - 16 times as much as they actually need. I'll fix that up in the next version).

There is also something that I can improve on the host side. The structure that I've presented above is that the host (ARM cores) provide the input data to the accelerator, fires off the kernel and then waits til it is done. With neural networks and many other applications, the kernel is called multiple times with new data. Therefore, it would be more efficient for the host to leave the accelerator running while it collects the next input set. Then, after the next input set is ready to be sent, would wait. Thus, the execution time would be the slower of the host or the accelerator rather than the addition of the two.

The inter-core communication method that I've described above also leave a lot to be desired. Each core writes to each other core's memory regardless of how far it is away in the mesh. By my calculation, if each core wrote one value to each other core, that would generate 640 inter-core write operations. It's hard to determine how long this would take given that they all happen in parallel and there would be clashes, but I'm sure there is a better way. Two possibilities come to mind:

  • Use the base address for the core to determine where it is in the mesh and then only write to it's nearest neighbour in a predefined topology, starting with the value the core had calculated itself and then passing on values that it had received from it's neighbours.
  • Use the Epiphany's broadcast mechanism.
I've got no idea how to do either of these just yet so they will have to be the subject of another post.


I mentioned in my last post that I did not have any luck getting the Epiphany debugger going with OpenCL code. This is still the case.

Because I was mainly working with floats, I declared an array of floats on the host and made it 2048 items wide. I passed this to the kernel and gave myself an index int d = 0;. As the kernel executed I wrote values directly into the __global array, incrementing d on each assignment. I could write from one core as in the debug example above. Or, if I wanted for example 18 values from each core, I reinitialised d as 18 times the global id and each core wrote to different parts of the array. Then, on the host side, I pulled them back just like my output values and wrote them to standard output separated by commas. I then copied the output to a file and loaded it into a spreadsheet as a csv file. I had to then make sense out of it with a lot of copying and pasting but at least I could see what was going on. Primitive, but the best I came up with and it did get the job done. I've left a few examples in the source on github.

The process I used was:

  • Check the input using the debugger on the host side. If you don't start from a known position you'll have no hope of understanding what's going on in the kernel. 
  • Check the input on the accelerator side. Copy the input from the input arrays directly to the debug array and make sure that the data is arriving as you expect
  • Follow the execution through writing the intermediate results to the debug array.
  • Keep the execution constrained using barriers so that you know where the cores are up to when you write to the debug array
A few tips that I picked up along the way:

  • If you get a Segmentation Fault there is a problem on the host side. It could show up during a clmsync but it is a host problem. I found that it was usually that I had not allocated enough space in an array.
  • If there is a problem on the accelerator side the kernel will not return. Kill the waiting host program, scratch your head for a while and try again.
  • Initialise your arrays so that you can tell the difference between a calculated value that is wrong and a memory location that has not been written to. When you are dealing with lots of floats, the rubbish that is in the array before it is used looks very much like wrong answers.

Up Next...

I hope this summary of what I've learnt to date on a "serious" working kernel is of assistance to you. If you want to know more about the neural network side I'll annotate the code on github which hopefully give you a better idea of how it works.

From here I'm going to move onto the simplest implementation of the back propagation algorithm that I can write. Given that the feed forward algorithm is the first part of the training sequence followed by back propagation I'll have to figure out how best to get these two to work together. Along with that I'll figure out how the performance benchmarking works so that as I work on increasing the performance I'll be able to tell if my efforts have paid off.

As always, please let me know if I've missed anything of if you have any questions...