Sunday, 1 February 2015

OpenCL on the Parallella - Structure, Communication and Debugging for Noobs

In this post, I'll cover the basic program structure of a simple OpenCL program, inter-core communication, dynamic space allocation and debugging when there is no debugger.

I'm writing on a Parallella-16 (server version) using the Brown Deer stdcl library (version 1.6) which is using the eSDK (version from the 14.04 Ubuntu build. My example code can be cloned from github with the command:

git clone -b blog9Files

Programming in OpenCL

The first thing to remember about OpenCL was that it was devised to provide a "standard" language to off load numerically intensive tasks from the CPU to a GPU. The Epiphany accelerator is similar to a GPU in that it is a remote device, with no operating system and memory is not shared with the CPU.

This means that you have to transmit the program and data to the accelerator, execute the program and retrieve the results. Once the program has halted, you have to start the whole process again. The data and code you had there previously will have disappeared.

Also, there are no basic operating system functions like threads or interrupts. You can program them yourself if you want but you have to start from scratch (or use something like freeRTOS). This is what's known as "bare metal" programming.

So, if this is what you are up for, let's start.

Basic Structure

If you've read my post on NDRanges will be familiar with the calls clndrange_init1D and forka. These are a short hand way of distributing a kernel to a specific number of cores where you want those cores to carry out exactly the same task on data you supply. This is the classic Single Instruction Multiple Data (SIMD) execution plan that suits some problems very well. Matrix multiplication which lies at the heart of the feed-forward back-propagation neural network is one such problem.

The following diagram and code illustrates how this works for the Parallella.

In the above diagram, I've included all the calls you need to get your program and data to and from the accelerator. Here are the critical bits on the host side:

  • You must use the OpenCL types for all variables that you will pass. Here I've use cl_T to indicate OpenCL types. They start with cl_ followed by one of the basic types (int, float etc.). This will ensure that the types that you use on the host side will be the same width as the accelerator.
  • Pointer variables (e.g. arrays) must be allocated to an OpenCL context (the standard context for the accelerator is stdacc) using clmalloc. Note that argument 2 is the size of the space allocated in bytes, so multiply your array size by the size of the contained type.
  • Use the clopen command that suits your chosen compilation strategy. The first one is for kernels that have been linked at compile time (see my previous blog post on Code::Blocks). The second is for the JIT compiler with the cl code in a file and the third (clsopen) is for the JIT compiler with the kernel in a string (for more discussion see the section on Space Allocation below).
  • The clmsync using the CL_MEM_DEVICE indicates that the data is outbound from the host to the accelerator. CL_MEM_HOST indicates inbound data.
  • For a discussion of cl_ndrange and forka see my previous blog post titled "What was that NDRange thing?". Note that all arguments are passed by value, therefore it is not necessary to call clmsync for simple variable types like cl_int and cl_float.
  • All calls that enqueue an activity to the accelerator have the flag CL_EVENT_NO_WAIT. This indicates that the host should execute the command and continue without waiting for the accelerator to complete the task. The cl_wait command ensures that they are all finished before continuing. (The cl_flush is there because the example code from Brown Deer uses one. The clmesg that appears during execution indicates that cl_flush has not been implemented. I think it is supposed to complete all "enqueuements".)
On the Epiphany side you are (almost) back in the wonderful world of C-language programming. I say almost because there are a few additional things you have to think about and a few restrictions (especially if you have no debugger).

The arguments to the kernel are passed by value therefore, the arrays (arr_in, and arr_out) are pointer into the Epiphany's global memory (I'm not sure where this memory actually is but it is remote from the core). I think that it is better to treat these arguments like Unix named pipes rather than variables. You can use them directly but the communication overhead in doing so is going to be large. Therefore, I declared local arrays and copied the remote data to them before starting the computation. Once this is complete, the results are copied back to remote memory before the kernel exits.

Note that I have used a simple for loop to copy the data from global to local memory. The examples supplied by Brown Deer all use the memcopy extension. I have not been able to find out how OpenCL extensions are used on the Parallella and so I've opted for a simple alternative that works in the same way which I'll replace as soon as I'm able.

Note also that the local arrays are declared statically with a fixed size at compile time. There was a post on the forums where djm asked about dynamic allocation and the response from dar (David Ritchie at Brown Deer) was not very promising (see: here). Therefore I've opted again for a simple, obvious approach using static arrays the size of which I manipulate on the fly (see Space Allocation below). 

Once you are in your kernel, the need to put cl_ on the front of variables disappears. I've put the symbol __private in front of the arrays but I think that is redundant. With no memory equivalent to OpenCL __local memory (memory shared between groups of cores), anything declared within the kernel is automatically private.

Space Allocation

Static array sizes are a problem if you have to decide how large they are before you know what data they will have to accommodate. If you use the offline compiler (as described in my Code::Blocks post) this is what you have to do. However, with a Just In Time compiler you can calculate the size required and generate code (or compiler directives) when the host code is running and alter the kernel source code before it is compiled.

Calling the JIT compiler can be done in two ways. The simpler way is using the call:

void * handle = clopen(stdacc, strKernelSourceFile, CLLD_NOW);

This reads the source file named in argument 2 and compiles it straight away. The source file is just like any other.

You can also use:

void * handle = clsopen(stdacc, strKernelCode, CLLD_NOW);

In this call, argument 2 is a string containing the actual code. The advantage of this call is that there are no files to be read and your kernel code is compiled into your executable. It is still not exactly secure (anyone can use a text editor to open the executable file and look for readable text) but at least you don't need to deliver two files when you deploy your system. The downside is that the code needs new line characters to make it humanly readable. You can deliver your kernel in one enormous line and the compiler will not complain but that is not really practical for humans. To use a new line character in a string you need to use the \ character at the end of each line. I decided to use the first method just to avoid having to maintain the \ characters. 

I think the solution is to develop using the first strategy and then write a quick pre-processing step that generates the string from the source code for Release version. This would either add the \ before the new line character or strip out the new line characters all together (along with the // comments that need a new line to terminate).


In my feed forward example I use a dynamically create include file filled with #defined variables to contain all characteristics of the neural network that do not change. To generate them I open, write to and close a file called

At the top of the file:
#define PATHTOCLDEFSFILE "//home//linaro//Work//nnP//"

Inside the class, I calculate the values I want to define (see the function setNetworkTopology) and then call:

void writeDefsFile()
    fstream * pFile;
    cl_int i;

    if (checkExists(PATHTOCLDEFSFILE, true))

        pFile = new fstream();
        pFile->open(PATHTOCLDEFSFILE, ios::out);
  (*pFile) << "#define CORECOUNT 16\n";
  (*pFile) << "#define LAYERCOUNT " << layerCount << "\n#define OUTPUTLAYER " << (layerCount - 1) << "\n";
  (*pFile) << "#define MAXWEIGHTTOLAYER " << maxWeightToLayer << "\n";
  (*pFile) << "#define LARGESTDERIVEDLAYER " << largestDerivedLayer << "\n";
  (*pFile) << "#define LARGESTINPUTLAYER " << largestInputLayer << "\n";
  (*pFile) << "#define INITWIDTHARRAY {";
for (i=0; i<layerCount-1; i++)
            (*pFile) << (*layers)[i].nodeCount << ",";
        (*pFile) << (*layers)[i].nodeCount << "}\n";
  delete pFile;
  throw format_Error(ENN_ERR_CL_DEFS_NOT_FOUND);

Which produces the file:

#define CORECOUNT 16
#define LAYERCOUNT 4
#define INITWIDTHARRAY {8,17,29,12}

And finally, in the cl file the #include and an example to remind me:

#include "/home/linaro/Work/nnP/"
/// contains #defines for all static variables
/// example contents of
///#define CORECOUNT 16
///#define LAYERCOUNT 4
///#define OUTPUTLAYER 3          
///#define MAXWEIGHTTOLAYER 1024
///#define LARGESTINPUTLAYER 32   
///#define INITWIDTHARRAY {32,32,16,16}/

And later the declarations:

    /// local storage
    __private int   widths[] = INITWIDTHARRAY;
    __private float wgt[MAXWEIGHTTOLAYER];
    __private float biases[LARGESTDERIVEDLAYER];
    __private float in[LARGESTINPUTLAYER];
    __private float derived[LARGESTDERIVEDLAYER];

Works a treat!

Intercore Communication

If you've read my very first post on the design of the neural network simulator, you will have seen that each intermediate value calculated by one core needs to be distributed to all of the other cores for use in the next round. To do this I use this little gem which was brought to my attention by djm:

#define LOCAL_MEM_ADDRESS_BASE(gid) (((32 + ((gid) / 4)) << 26) | ((8 + ((gid) % 4)) << 20))

This calculates the memory base address of a core given the global id (int gid = get_global_id(0)). It is defined as a macro to avoid the overhead of a function call.

To use this you then need to provide the offset of the structure into which you want to write your value:

*(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (index * sizeof(float))) = value_being_passed;

The components of this snippet are:

  • LOCAL_MEM_ADDRESS_BASE calculates the beginning of the memory range
  • ((unsigned int) derived) is the address (pointer to) the derived value array which is being cast to an unsigned int for the addition
  • (index * sizeof(float)) is the index of the array into which we want to write the value_being_passed. Don't forget, this is all byte arithmetic so you need to multiply the index by the size of the contain type to land up in the right place.
One BIG note here: this will only work if the code on the core that is being written to is the same as the core doing the writing. The location of the remote array is based on the address on the local core. If the core that is being written to has allocated the array in a different place you'll be writing somewhere you don't expect. 


Here is an example from my feed forward kernel:

gid_next = (gid == (CORECOUNT - 1)) ? 0 : gid + 1;
while (gid_next != gid)
    for (n=localFirstNode; n < localLastNode; n++)
       *(float *)(LOCAL_MEM_ADDRESS_BASE(gid_next) + ((unsigned int) derived) + (n*sizeof(float))) = derived[n];
    gid_next = (gid_next == CORECOUNT - 1) ? 0 : gid_next + 1;

//  debug - watch the derived values arriving at core 0 from the other nodes
//  if (gid == 0)
//      for (i=0; i<curLayerWidth; i++)
//          debug[d++] = derived[i];


Each core iterates through the values it has derived and writes them into the derived array of each other core.

Note the barrier at the end of each iteration through the nodes. This is to keep all cores in sync while they go about messing with each other memory. This is handy during debugging but barriers should be used sparingly given they stall all cores until they all have reached the barrier. This ensures that the slowest core determines the execution time of all cores up to that point in time. In this case I don't believe that they are necessary because I don't need anything else to be ready before I do the next write. I'll check that out when I focus on performance rather than correctness.

I've left some debugging code in this snippet. I'll cover that below.

A Few Notes About Performance and Efficiency

While I have not spent too much time of performance or efficiency there were a few things that occurred to me while I was writing this kernel.  

One thing that was obvious enough to be put in from the beginning. I pass all the weights needed to calculate the output for the whole network into one big array called weights. This array dominates the space requirement for the calculation. For this reason, I only copy in the weights needed to calculate the nodes assigned to each core into a local array called wgt. In this way not only is the calculate spread out over all cores but the space requirement is as well. (NB now that I come to write this I notice that I've actually allocated enough space on each core for all the weights between the biggest two layers - 16 times as much as they actually need. I'll fix that up in the next version).

There is also something that I can improve on the host side. The structure that I've presented above is that the host (ARM cores) provide the input data to the accelerator, fires off the kernel and then waits til it is done. With neural networks and many other applications, the kernel is called multiple times with new data. Therefore, it would be more efficient for the host to leave the accelerator running while it collects the next input set. Then, after the next input set is ready to be sent, would wait. Thus, the execution time would be the slower of the host or the accelerator rather than the addition of the two.

The inter-core communication method that I've described above also leave a lot to be desired. Each core writes to each other core's memory regardless of how far it is away in the mesh. By my calculation, if each core wrote one value to each other core, that would generate 640 inter-core write operations. It's hard to determine how long this would take given that they all happen in parallel and there would be clashes, but I'm sure there is a better way. Two possibilities come to mind:

  • Use the base address for the core to determine where it is in the mesh and then only write to it's nearest neighbour in a predefined topology, starting with the value the core had calculated itself and then passing on values that it had received from it's neighbours.
  • Use the Epiphany's broadcast mechanism.
I've got no idea how to do either of these just yet so they will have to be the subject of another post.


I mentioned in my last post that I did not have any luck getting the Epiphany debugger going with OpenCL code. This is still the case.

Because I was mainly working with floats, I declared an array of floats on the host and made it 2048 items wide. I passed this to the kernel and gave myself an index int d = 0;. As the kernel executed I wrote values directly into the __global array, incrementing d on each assignment. I could write from one core as in the debug example above. Or, if I wanted for example 18 values from each core, I reinitialised d as 18 times the global id and each core wrote to different parts of the array. Then, on the host side, I pulled them back just like my output values and wrote them to standard output separated by commas. I then copied the output to a file and loaded it into a spreadsheet as a csv file. I had to then make sense out of it with a lot of copying and pasting but at least I could see what was going on. Primitive, but the best I came up with and it did get the job done. I've left a few examples in the source on github.

The process I used was:

  • Check the input using the debugger on the host side. If you don't start from a known position you'll have no hope of understanding what's going on in the kernel. 
  • Check the input on the accelerator side. Copy the input from the input arrays directly to the debug array and make sure that the data is arriving as you expect
  • Follow the execution through writing the intermediate results to the debug array.
  • Keep the execution constrained using barriers so that you know where the cores are up to when you write to the debug array
A few tips that I picked up along the way:

  • If you get a Segmentation Fault there is a problem on the host side. It could show up during a clmsync but it is a host problem. I found that it was usually that I had not allocated enough space in an array.
  • If there is a problem on the accelerator side the kernel will not return. Kill the waiting host program, scratch your head for a while and try again.
  • Initialise your arrays so that you can tell the difference between a calculated value that is wrong and a memory location that has not been written to. When you are dealing with lots of floats, the rubbish that is in the array before it is used looks very much like wrong answers.

Up Next...

I hope this summary of what I've learnt to date on a "serious" working kernel is of assistance to you. If you want to know more about the neural network side I'll annotate the code on github which hopefully give you a better idea of how it works.

From here I'm going to move onto the simplest implementation of the back propagation algorithm that I can write. Given that the feed forward algorithm is the first part of the training sequence followed by back propagation I'll have to figure out how best to get these two to work together. Along with that I'll figure out how the performance benchmarking works so that as I work on increasing the performance I'll be able to tell if my efforts have paid off.

As always, please let me know if I've missed anything of if you have any questions...

Wednesday, 10 December 2014

Debugging with gdb in Code::Blocks

Part 2.1 Debugging the Host code

In my last post I covered setting up the GCC and Brown Deer OpenCL compilers to develop on the Parallella. In this post I'll cover debugging host code. Hopefully I'll be in a position to provide some thoughts on debugging accelerator code at some point in the future. I have to admit to having completely failed to getting the e-gdb epiphany debugger working with OpenCL. I'll discuss what happened at the end of this post.

Again I am indebted to AndyC for his assistance on this process. What I have written below is really just the same process he described ( with pictures.

Setting Up

To start, you need to set up your debugger. This is in a similar vein to setting up a new compiler but simpler. You get there via the Settings | Debugger... menu.

Initially, you get the default settings and some common entries.

To create a new debugger configuration, you press the Create Config button. You could just edit the default entry but I prefer to leave defaults where they are and work of copies. That means that it will be obvious what you have changed if you need to tweak the settings later on.

I gave my host debugger the name Debug Host to make it really obvious. I also created one for the epiphany debugger but, as I mentioned above, that did not lead to a happy outcome.

The critical string is the Executable path: which is:

sudo -E LD_LIBRARY_PATH=/opt/adapteva/esdk/tools/host/lib EPIPHANY_HDF=/opt/adapteva/esdk/bsps/current/platform.hdf /usr/bin/gdb

Similar to the Brown Deer compiler settings, we need to launch gdb as a super user and therefore we have to pass in out environment settings LD_LIBRARY_PATH and EPIPHANY_HDF.

The box is read because Code::Blocks expects a path only and a sudo command leaves it confused. Just leave it red - it will still work.

You also need to make sure that gdb executes in the same directory as the executable - therefore change to it using the Debugger initialization commands. I prefer to use the full path thereby eliminating the possibility that you are not where you think you are and bin/Debug does not exist in the directory tree below you.

Press OK.

The Compiler settings

Now you need to associate the output of the GCC compiler with this debugger configuration.

Go to the Settings | Compiler... menu and (with GNU GCC compiler selected) click the Toolchain executables tab.

Your new debugger configuration will now appear in the Debugger: selection list. Choose it and press OK.

For the debugger to be actually useful, it needs an executable with debugging symbols. To produce these the Produce debugging symbols [-g] option needs to be set when compiling. This can be done either in the Global compiler settings dialog (in the Compiler settings tab shown above) or in the Project build options dialog (via the Project | Build options... menu). If you want to deliver a "Release" version with a slightly smaller executable file, you would use the Project build options with -g set for the Debug version and without -g for the Release version.

Running, Breaking, Examining

Now you have associated your host executable with your Debug Host configuration.

To control the debugging session, there is the Debug menu and the Debug tool bar, shown below in the red box. 

The Debugging windows menu selection is used to select what you want to see during the debugging session. I find the Watches window the most useful. In your Debugger configuration you can ask it to automatically track local variables (way better than Eclipse where most of the values are "Optimized out" most of the time). The funny icon second from the right on the Debug toolbar (circled in red) is a quick way of getting to this selection.

To start a host code debugging session the host build target (in my example Debug) needs to be selected. Then give it somewhere to stop by setting a break point with F5 (in my example on line 80) and then press the red arrow.

After that, you are on your own. Use the Continue, Run to cursor, Next line and Step commands to navigate through your program and the various Debugging windows to figure out what is going on.

What about Epiphany Code?

Well, I don't know.

According to, Epiphany code uses the e-server program as a "host" on the ARM chip. The epiphany sends messages back to e-server which you connect using the remote command. You use the load command to load the code to the epiphany and then set breakpoints etc from the e-dgb command line.

For me, the load command failed every time. When I tried to do the same thing in Code::Blocks the message "(no debugging symbols found)" appeared during the initialization phase and I could not set breakpoints. I checked the compile commands that are sent to the Epiphany SDK and they all had -g set.

If you want to have a go here are my settings (that I pretty much guarantee will not work):

Then set your e-gdb config in your Brown Deer compiler settings.

In your project Build options... 

Note here that ndfork.elf is my build target for my OpenCL code.

And in your Project properties on the very last tab...

But use the IP address of the computer running e-server.

Good luck.

Final Words...

I really hope that someone can figure out how to debug OpenCL code running on the epiphany. I know that Adapteva is doing a lot of work on debugging at the moment and I'm confident that they will find a solution some time soon.

Thursday, 6 November 2014

Developing OpenCL with Code::Blocks on the Parallella

Part 1: Compiling and Linking OpenCL with Code::Blocks

In this post I'll go through the steps you need to set up Code::Blocks IDE to compile and link both host code and accelerator (OpenCL) code using GNU GCC and the Brown Deer offline compiler, clcc and linker clld.

While Brown Deer provide an OpenCL JIT compiler, using it during development can be tedious given that you can only tell if your OpenCL code is compiling correctly by running your application. Using the offline compiler from within Code::Blocks allows you to stay in one environment for both host and accelerator code.

I've been learning Code::Blocks because I was finding Eclipse tediously slow. This might improve in future but for now Code::Blocks performs better on the Parallella and critically, the compile time is as good as you would get if you were using the command line. I found that Eclipse was a little better organised but once I'd figured out my way around Code::Blocks it was fine. Most of the oddities are in the compiler and project setup and you don't tend to spend much time there once these are set.

I have to acknowledge the contribution of AndyC firstly for bringing my attention to Code::Blocks and then for helping me along the way when I got stuck. His original post is here:

I'm using the Ubuntu 14.04 distro with Brown Deer OpenCL ver. 1.6.0 and Code::Blocks 13.12.


1. Download Code::Blocks using: sudo apt-get install codeblocks

2. Rebuild the Brown Deer compiler! Yes I was surprised too. The installation in the standard image has a problem with the linker. The getting started guide is here: It was written for version 1.5 but section 2.3 is relevant to version 1.6 as well.

Get the sources: git clone and pay careful attention to the prerequisites, libconfig, libelf and libevent (GCC will be installed already). When I ran configure it could not find event.h. If this happens to you, first check if you can find the library with whereis:

linaro-nano:~> whereis libevent
libevent: /usr/lib/ /usr/lib/ /usr/lib/libevent.a

This shows that libevent was installed but I had to look into configure to discover that it was expecting event.h to be in .../include/event2. The ... turned out to be /usr so I had to use the --with-libevent=/usr. So my final configure command was:

./configure --enable-epiphany --with-libevent=/usr

As with anything on Linux, if it does not work first time you have to sniff around a bit.

The make and sudo make install worked fine for me.


Code::Blocks has various ways of organising your work. In this post I'm going to concentrate on the setup you would use if you are primarily writing in C++ and OpenCL. Therefore I'm going to set the global settings to be those use for OpenCL development. These will be inherited by all subsequent projects. 

I'm also developing a command line project. Other types of project may need variations on these settings.

Setting up the Compilers

The first thing to do is to set up the compilers that you are going to use. You can do this before creating a project.

The object of this section is to set up a C++ compiler and an OpenCL compiler. If you only have one file with your accelerator code and want to keep your life a little simpler, read the sections on setting up GNU GCC and the basic build targets (Debug and Release) and use the method described in the last Project Setup section entitled Aside: Single File Compilation.


The "standard" compiler on the Parallella is the GNU GCC compiler and we'll use this for the host code.

To get to the compiler settings go to the Settings menu and choose Compiler... which will bring up this box:

GNU GCC is the default selected compiler and there are some standard flags set. These can stay where they are.

Go to the Linker settings tab and add the Brown Deer libraries that you will be using. Here I've only added stdcl and ocl, but for example, if you are using COPRTHR then add coprthr.

Next go to the Search directories tab and add:

and similarly add /usr/local/browndeer/lib to the Linker sub-tab.

The Toolchain executables and Custom variables tabs can be alone. In the next tab along, Build options I set as following but these are not terribly important:

[Note: if you only want to compile one file with accelerator code in it, now is the time to skip forward to the section on Creating a Project and then the section titled: Aside: Single File Compilation.]

Brown Deer OpenCL

Next we have to set up the tool chain to compile the cl code. This is a little more complicated because the offline compiler has to be invoked with root privileges. We'll get to that at the very end.

First, create a copy of the GNU GCC compiler settings using the Copy button in the top section of the Global compiler settings box. This will first ask you for the name of the new compiler which I have called BD OpenCL:

This will create an exact copy of everything that we had for the GNU GCC compiler. From here on, make sure you select the BD OpenCL compiler otherwise you can get into a real mess.

First, switch off the standard flags:

Next, remove the settings from the Linker settings and Search directory settings that we set for GNU GCC. These are not needed.

The next tab to look at is the Toolchain executables. This is critical. Ignoring the Debugger entry for now it must look like this:

The next tab, Custom variables is a way of setting "environment like" variables from within Code::Blocks. I found that I had to be explicit with all environment variables because my Linux user environment got a little mangled on the way in. I defined the following three:

These are merely local copies of your bash environment variables $PATH, $LD_LIBRARY_PATH and $EPIPHANY_HDF.

This next bit brings it all together.

Code::Blocks puts all of the settings together using a template and then passes the resulting string to a shell process. The template is found at the very end of the compiler settings - on the Other settings tab. Click the Advanced options button and don't be put off by the scary warning box:

Once you have pressed the Yes button with confidence, you get:

See, not that scary after all.

The template is the text in the Command line macro box. Because it is chopped off the whole compiler string is:

sudo EPIPHANY_HDF=$EPIPHANY_HDF LD_LIBRARY_PATH=$LD_LIBRARY_PATH PATH=$PATH bash -c "$compiler $options $includes $file -o $object"

Take note of the Command macros: list on the right hand side of the box. If the Compiler complains of something being missing (e.g. if you are want to use a resource compiler) you will have to find the macro from the list and include it in the appropriate place in the Command line macro.

Also, a quick word about sudo. Sudo is a funny little thing that gives you super user powers and little else. If you type sudo env at the command prompt you will see that sudo's environment is very sparse. The -E switch is supposed to preserve the callers environment but I found that the environment that Code::Blocks stripped out LD_LIBRARY_PATH and even by explicitly setting it I found that -E did not seem to pass it on. Therefore, the most reliable way of creating the correct environment for clcc was to set it explicitly at invocation.

Similarly, the linker template screen is:

The full string is:

sudo EPIPHANY_HDF=$EPIPHANY_HDF LD_LIBRARY_PATH=$LD_LIBRARY_PATH PATH=$PATH:/bin bash -c "$linker $link_options $libdirs $link_objects $libs  -o $exe_output $object"

Note that the Command: I've selected is Link object files to console executable. There are a number of different options here. You need to figure out which one is invoked based on your build target (see below).

Project Setup

We have not quite gotten to the stage of being able to compile an application yet. There are some project specific steps that are needed first. Unfortunately they are a little circular and difficult to lay out in a nice linear fashion so you might have to go through this process a couple of times before everything is in place.

Creating a Project

The File | New > Project... will do the job without any tricks. I've only used Console Applications so far and I generally assume that I'm writing in C++. You can leave the default compiler as GNU GCC.

Adding Files

At this point is is a good idea to add some files.

Adding a C++ file for the host code is also straight forward. Use the File | New > File... menu. When you get to the screen below, just add the file name with path and select the C++ build targets (more on this below). 

Add a CL file is essentially the same. I just choose an Empty C/C++ file and give it a name ending in .cl. Leave the build targets unchecked.

Once added it will appear in the Others folder in your project.

Build Targets

I think of a build target as an output that you want to produce in your project. For the purposes of this post, I want to produce a Linux console application containing the host code and an elf (executable loadable file) containing the accelerator code.

You have already created at least one build target when you created your project. The default targets are called Debug and Release and are created as a result of this step in the new project sequence:

To add, change or remove a build target you can either choose the Project | Properties menu or right click on the project and choose Properties. Either way you will get this screen:

The only changes I've made here are the Platforms (remove Windows and Mac) and Object names generation because I like to call my main cpp file and cl file the same name.

The default build targets will be pretty much created correctly. The only little wrinkle is that the project creation dialog creates a "Hello World!" called main.cpp and this is set as the build target file. As I said above, I like to call my main file after my project so I have to create a new cpp file in the project and assign to the Debug and Release build target and then delete main.cpp.

[This next section is only relevant if you have set up the Brown Deer OpenCL compiler as a separate build step. If you are going to use the single .cl file option then skip forward to the section titled: Aside: Single File Compilation.]

To set up the build target for your cl files click on the Build targets tab and then click the Add button. In my example below, I've deleted the default Release build target and called my new target CL. Fill out the fields as shown here and select you cl file(s) in the Build target files box at the bottom:

Now that you have a build target for your accelerator code, you can select that when you create a new cl source file.

Project Build Options

Notice the Build options... button. This is where you choose which compiler is used for the selected build target and the default compiler switches for the project. Click Build options... or select Project | Build options from the main screen.

Two things to change here. With the CL build target selected, first set the Selected compiler to the OpenCL compiler we set up above, in this case we called it BD OpenCL. Second, set the Policy to be "Use target options only". The reason for this is that common compiler options (e.g. -Wall) are set globally and this setting overrides these global settings. To change the global settings click on the project name (in this case "ndfork") in the left hand panel.

This dialog is where you set all of your project specific settings. If there are include files, libraries, compiler variables that are only relevant to this project and not to all projects then dig around here and find where to set them. The Pre/post build steps tab can be useful if you want to do some behind-the-scenes shuffling before or after compilation. The output will appear on the compile log.

Final Step (the bit I always forget...)

The last thing you need to do is to set the build target and build steps for your cl file(s). In the project panel right-click on the cl file and choose Properties and then click on the Build  tab:

Select both Compile file and Link file and CL in the Belongs in targets box. For the cl file that you added before setting up the CL build target none of these boxes will be checked. Once you have added the CL build target you can select it as the build target when you add the file. The Compile file and Link file check boxes are not selected by default and so, if you wish to compile and link them, you must check these for every file you add.

Aside: Single File Compilation

By now you are probably quite astounded about the amount of complexity there is getting the compilers and project set up. I was. If you are only going to have one cl file for your accelerator code and want to avoid some of that complexity then there is a way to compile your accelerator code without setting up a separate compiler and build targets. It is "back door" way of compiling that by-passes the selected compiler.

If you right-click on the cl file, click Properties and select the Advanced tab you can define a Custom build. With the GNU GCC Compiler selected check the Use custom command to build this file: box. Then in the text box below it you need to put the whole compile string:

Note that the compiler has a full path and the Options (-c and -v) are now explicit. This is because we are now using the settings for the GNU GCC compiler and are in its environment. Therefore the $options variable refer the the GNU GCC options. The $file and $output will still work because they refer to this file. 

I've left in the $EPIPHANY_HDF, $LD_LIBRARY_PATH and $PATH variables to be passed to sudo. If you want to do this you still have to set them up as custom variables (it would be better to set them using Project | Build options... rather than as a global setting associated with the compiler). If you include /usr/local/browndeer/bin in the $PATH then you can leave it out of the bash call.

For those who skipped the OpenCL compiler setup, the values of the variables are those from your bash environment.


Now that everything is in place we can compile a target.

From the main screen I use the tool bar but the Build menu provides the same functions or you can use Ctrl-F9 which must be the dumbest short-cut ever devised.

There are a couple of quirks to the OpenCL compilation. 

The offline compiler, clcc creates an intermediate file which is then passed to the epiphany compiler in the standard SDK. This means that the file name associated with the error will look something like UfLa0a.cpp. This means that double-clicking on the error will not highlight it in the editor. Oddly enough, if you #include another cl file which has an error, the file name associated with that error is correct and double-clicking works.

The intermediate stages also means that any error causes a cascade of errors after it. Just start from the top of the list recompiling as you go... the actual errors should be pretty obvious.

Adding a Virtual Target

If we leave things as they are now, we will have to build twice, once for each build target. If this is too tedious, Code::Blocks allows the creation of virtual targets which are combinations of "real" targets.

To create a virtual target go to the build targets tab in the project properties box (Project | Properties > Build targets). From here click the Virtual targets... button:

All pretty simple but there is one trick here. Click on your cl target first (CL in my example) and close. Click Virtual targets... again and click on your host code target (Debug in the example). This is to get the cl target to compile first which (if it compiles correctly) will produce the object file ready for linking (see linking below).

One word of warning here... check your target files after you do this. My cl file in the CL target was switched off and switched on in the Debug target.

Now you can choose your virtual target on the button bar and have both targets compile sequentially.

Checking your ELF file

Brown Deer have included a version of the nm utility called clnm. This will show you what the content of the elf file is:

If everything worked properly you should see your kernels in the list and some entries showing your source code if you have not used the -b (only binaries) compile option. (I'm not exactly sure why the kernels appear twice.)

Using Compiled Accelerator Object Code

At this point we have generated a .elf file and confirmed its contents. I believe that there is a way of using the .elf file directly but I can't find a description of how to do it. If anyone can give me a hand on this I'd appreciate it. While working on the debugging I had a thought, "Maybe I can link the elf file just like the object file?". After all, an elf seems to be analogous to a DLL on Windows. Turns out that this is the case. So the method described below works for either the object file or the elf file.


The method described in the Brown Deer documentation uses the object file containing your accelerator code and links it into your executable. To do this you need to add the accelerator code object file onto the end of the link stage of the main executable. Go to the Project | Build options... click on the options for your main executable and then the Linker settings tab:

The Other linker options string must be the path to the cl object code. This will be added onto the end of the list of object files that are passed to the linker.


Calling your kernel from your host code is a tiny change. Your call to clopen using the JIT compiler would look like this:

void* clhandle = clopen(context, (char*)string_to_cl_file, CLLD_NOW);

With the code linked in:

void* clhandle = clopen(context, 0, CLLD_NOW); 

The path to your accelerator code has been replace with a ZERO! That's it... nothing more to do. The short pause that happens as the JIT compiler goes about its business will disappear and everything else will look the same.

Final Thoughts

I hope that this guide has saved you some time and I hope Code::Blocks suits your way of working. As always, please let me know if I've missed anything or if any bits are confusing in any way.

Up next in Part 2 I'm going to get into some Debugging action. I split the two up because this guide is big enough as it is. I have not gotten into it yet so I hope that there will not be any back tracking to do on the compiler setup.