Thursday, 6 November 2014

Developing OpenCL with Code::Blocks on the Parallella

Part 1: Compiling and Linking OpenCL with Code::Blocks


In this post I'll go through the steps you need to set up Code::Blocks IDE to compile and link both host code and accelerator (OpenCL) code using GNU GCC and the Brown Deer offline compiler, clcc and linker clld.

While Brown Deer provide an OpenCL JIT compiler, using it during development can be tedious given that you can only tell if your OpenCL code is compiling correctly by running your application. Using the offline compiler from within Code::Blocks allows you to stay in one environment for both host and accelerator code.

I've been learning Code::Blocks because I was finding Eclipse tediously slow. This might improve in future but for now Code::Blocks performs better on the Parallella and critically, the compile time is as good as you would get if you were using the command line. I found that Eclipse was a little better organised but once I'd figured out my way around Code::Blocks it was fine. Most of the oddities are in the compiler and project setup and you don't tend to spend much time there once these are set.

I have to acknowledge the contribution of AndyC firstly for bringing my attention to Code::Blocks and then for helping me along the way when I got stuck. His original post is here: http://forums.parallella.org/viewtopic.php?f=13&t=1658.

I'm using the Ubuntu 14.04 distro with Brown Deer OpenCL ver. 1.6.0 and Code::Blocks 13.12.






Prerequisites


1. Download Code::Blocks using: sudo apt-get install codeblocks

2. Rebuild the Brown Deer compiler! Yes I was surprised too. The installation in the standard image has a problem with the linker. The getting started guide is here: http://www.browndeertechnology.com/docs/Parallella_quick_start_guide.pdf. It was written for version 1.5 but section 2.3 is relevant to version 1.6 as well.

Get the sources: git clone https://github.com/browndeer/coprthr.git and pay careful attention to the prerequisites, libconfig, libelf and libevent (GCC will be installed already). When I ran configure it could not find event.h. If this happens to you, first check if you can find the library with whereis:

linaro-nano:~> whereis libevent
libevent: /usr/lib/libevent.so /usr/lib/libevent.la /usr/lib/libevent.a

This shows that libevent was installed but I had to look into configure to discover that it was expecting event.h to be in .../include/event2. The ... turned out to be /usr so I had to use the --with-libevent=/usr. So my final configure command was:

./configure --enable-epiphany --with-libevent=/usr

As with anything on Linux, if it does not work first time you have to sniff around a bit.

The make and sudo make install worked fine for me.


Code::Blocks


Code::Blocks has various ways of organising your work. In this post I'm going to concentrate on the setup you would use if you are primarily writing in C++ and OpenCL. Therefore I'm going to set the global settings to be those use for OpenCL development. These will be inherited by all subsequent projects. 

I'm also developing a command line project. Other types of project may need variations on these settings.

Setting up the Compilers


The first thing to do is to set up the compilers that you are going to use. You can do this before creating a project.

The object of this section is to set up a C++ compiler and an OpenCL compiler. If you only have one file with your accelerator code and want to keep your life a little simpler, read the sections on setting up GNU GCC and the basic build targets (Debug and Release) and use the method described in the last Project Setup section entitled Aside: Single File Compilation.

GNU GCC


The "standard" compiler on the Parallella is the GNU GCC compiler and we'll use this for the host code.

To get to the compiler settings go to the Settings menu and choose Compiler... which will bring up this box:


GNU GCC is the default selected compiler and there are some standard flags set. These can stay where they are.

Go to the Linker settings tab and add the Brown Deer libraries that you will be using. Here I've only added stdcl and ocl, but for example, if you are using COPRTHR then add coprthr.




Next go to the Search directories tab and add:



and similarly add /usr/local/browndeer/lib to the Linker sub-tab.

The Toolchain executables and Custom variables tabs can be alone. In the next tab along, Build options I set as following but these are not terribly important:



[Note: if you only want to compile one file with accelerator code in it, now is the time to skip forward to the section on Creating a Project and then the section titled: Aside: Single File Compilation.]

Brown Deer OpenCL


Next we have to set up the tool chain to compile the cl code. This is a little more complicated because the offline compiler has to be invoked with root privileges. We'll get to that at the very end.

First, create a copy of the GNU GCC compiler settings using the Copy button in the top section of the Global compiler settings box. This will first ask you for the name of the new compiler which I have called BD OpenCL:



This will create an exact copy of everything that we had for the GNU GCC compiler. From here on, make sure you select the BD OpenCL compiler otherwise you can get into a real mess.

First, switch off the standard flags:


Next, remove the settings from the Linker settings and Search directory settings that we set for GNU GCC. These are not needed.

The next tab to look at is the Toolchain executables. This is critical. Ignoring the Debugger entry for now it must look like this:


The next tab, Custom variables is a way of setting "environment like" variables from within Code::Blocks. I found that I had to be explicit with all environment variables because my Linux user environment got a little mangled on the way in. I defined the following three:


These are merely local copies of your bash environment variables $PATH, $LD_LIBRARY_PATH and $EPIPHANY_HDF.

This next bit brings it all together.

Code::Blocks puts all of the settings together using a template and then passes the resulting string to a shell process. The template is found at the very end of the compiler settings - on the Other settings tab. Click the Advanced options button and don't be put off by the scary warning box:


Once you have pressed the Yes button with confidence, you get:


See, not that scary after all.

The template is the text in the Command line macro box. Because it is chopped off the whole compiler string is:

sudo EPIPHANY_HDF=$EPIPHANY_HDF LD_LIBRARY_PATH=$LD_LIBRARY_PATH PATH=$PATH bash -c "$compiler $options $includes $file -o $object"

Take note of the Command macros: list on the right hand side of the box. If the Compiler complains of something being missing (e.g. if you are want to use a resource compiler) you will have to find the macro from the list and include it in the appropriate place in the Command line macro.

Also, a quick word about sudo. Sudo is a funny little thing that gives you super user powers and little else. If you type sudo env at the command prompt you will see that sudo's environment is very sparse. The -E switch is supposed to preserve the callers environment but I found that the environment that Code::Blocks stripped out LD_LIBRARY_PATH and even by explicitly setting it I found that -E did not seem to pass it on. Therefore, the most reliable way of creating the correct environment for clcc was to set it explicitly at invocation.

Similarly, the linker template screen is:


The full string is:

sudo EPIPHANY_HDF=$EPIPHANY_HDF LD_LIBRARY_PATH=$LD_LIBRARY_PATH PATH=$PATH:/bin bash -c "$linker $link_options $libdirs $link_objects $libs  -o $exe_output $object"

Note that the Command: I've selected is Link object files to console executable. There are a number of different options here. You need to figure out which one is invoked based on your build target (see below).

Project Setup


We have not quite gotten to the stage of being able to compile an application yet. There are some project specific steps that are needed first. Unfortunately they are a little circular and difficult to lay out in a nice linear fashion so you might have to go through this process a couple of times before everything is in place.

Creating a Project


The File | New > Project... will do the job without any tricks. I've only used Console Applications so far and I generally assume that I'm writing in C++. You can leave the default compiler as GNU GCC.


Adding Files

At this point is is a good idea to add some files.


Adding a C++ file for the host code is also straight forward. Use the File | New > File... menu. When you get to the screen below, just add the file name with path and select the C++ build targets (more on this below). 



Add a CL file is essentially the same. I just choose an Empty C/C++ file and give it a name ending in .cl. Leave the build targets unchecked.



Once added it will appear in the Others folder in your project.

Build Targets


I think of a build target as an output that you want to produce in your project. For the purposes of this post, I want to produce a Linux console application containing the host code and an elf (executable loadable file) containing the accelerator code.

You have already created at least one build target when you created your project. The default targets are called Debug and Release and are created as a result of this step in the new project sequence:




To add, change or remove a build target you can either choose the Project | Properties menu or right click on the project and choose Properties. Either way you will get this screen:




The only changes I've made here are the Platforms (remove Windows and Mac) and Object names generation because I like to call my main cpp file and cl file the same name.

The default build targets will be pretty much created correctly. The only little wrinkle is that the project creation dialog creates a "Hello World!" called main.cpp and this is set as the build target file. As I said above, I like to call my main file after my project so I have to create a new cpp file in the project and assign to the Debug and Release build target and then delete main.cpp.

[This next section is only relevant if you have set up the Brown Deer OpenCL compiler as a separate build step. If you are going to use the single .cl file option then skip forward to the section titled: Aside: Single File Compilation.]

To set up the build target for your cl files click on the Build targets tab and then click the Add button. In my example below, I've deleted the default Release build target and called my new target CL. Fill out the fields as shown here and select you cl file(s) in the Build target files box at the bottom:



Now that you have a build target for your accelerator code, you can select that when you create a new cl source file.

Project Build Options


Notice the Build options... button. This is where you choose which compiler is used for the selected build target and the default compiler switches for the project. Click Build options... or select Project | Build options from the main screen.



Two things to change here. With the CL build target selected, first set the Selected compiler to the OpenCL compiler we set up above, in this case we called it BD OpenCL. Second, set the Policy to be "Use target options only". The reason for this is that common compiler options (e.g. -Wall) are set globally and this setting overrides these global settings. To change the global settings click on the project name (in this case "ndfork") in the left hand panel.

This dialog is where you set all of your project specific settings. If there are include files, libraries, compiler variables that are only relevant to this project and not to all projects then dig around here and find where to set them. The Pre/post build steps tab can be useful if you want to do some behind-the-scenes shuffling before or after compilation. The output will appear on the compile log.

Final Step (the bit I always forget...)


The last thing you need to do is to set the build target and build steps for your cl file(s). In the project panel right-click on the cl file and choose Properties and then click on the Build  tab:





Select both Compile file and Link file and CL in the Belongs in targets box. For the cl file that you added before setting up the CL build target none of these boxes will be checked. Once you have added the CL build target you can select it as the build target when you add the file. The Compile file and Link file check boxes are not selected by default and so, if you wish to compile and link them, you must check these for every file you add.


Aside: Single File Compilation


By now you are probably quite astounded about the amount of complexity there is getting the compilers and project set up. I was. If you are only going to have one cl file for your accelerator code and want to avoid some of that complexity then there is a way to compile your accelerator code without setting up a separate compiler and build targets. It is "back door" way of compiling that by-passes the selected compiler.

If you right-click on the cl file, click Properties and select the Advanced tab you can define a Custom build. With the GNU GCC Compiler selected check the Use custom command to build this file: box. Then in the text box below it you need to put the whole compile string:



Note that the compiler has a full path and the Options (-c and -v) are now explicit. This is because we are now using the settings for the GNU GCC compiler and are in its environment. Therefore the $options variable refer the the GNU GCC options. The $file and $output will still work because they refer to this file. 

I've left in the $EPIPHANY_HDF, $LD_LIBRARY_PATH and $PATH variables to be passed to sudo. If you want to do this you still have to set them up as custom variables (it would be better to set them using Project | Build options... rather than as a global setting associated with the compiler). If you include /usr/local/browndeer/bin in the $PATH then you can leave it out of the bash call.



For those who skipped the OpenCL compiler setup, the values of the variables are those from your bash environment.

Compiling


Now that everything is in place we can compile a target.



From the main screen I use the tool bar but the Build menu provides the same functions or you can use Ctrl-F9 which must be the dumbest short-cut ever devised.

There are a couple of quirks to the OpenCL compilation. 

The offline compiler, clcc creates an intermediate file which is then passed to the epiphany compiler in the standard SDK. This means that the file name associated with the error will look something like UfLa0a.cpp. This means that double-clicking on the error will not highlight it in the editor. Oddly enough, if you #include another cl file which has an error, the file name associated with that error is correct and double-clicking works.

The intermediate stages also means that any error causes a cascade of errors after it. Just start from the top of the list recompiling as you go... the actual errors should be pretty obvious.

Adding a Virtual Target


If we leave things as they are now, we will have to build twice, once for each build target. If this is too tedious, Code::Blocks allows the creation of virtual targets which are combinations of "real" targets.

To create a virtual target go to the build targets tab in the project properties box (Project | Properties > Build targets). From here click the Virtual targets... button:


All pretty simple but there is one trick here. Click on your cl target first (CL in my example) and close. Click Virtual targets... again and click on your host code target (Debug in the example). This is to get the cl target to compile first which (if it compiles correctly) will produce the object file ready for linking (see linking below).

One word of warning here... check your target files after you do this. My cl file in the CL target was switched off and switched on in the Debug target.

Now you can choose your virtual target on the button bar and have both targets compile sequentially.

Checking your ELF file


Brown Deer have included a version of the nm utility called clnm. This will show you what the content of the elf file is:


If everything worked properly you should see your kernels in the list and some entries showing your source code if you have not used the -b (only binaries) compile option. (I'm not exactly sure why the kernels appear twice.)

Using Compiled Accelerator Object Code


At this point we have generated a .elf file and confirmed its contents. I believe that there is a way of using the .elf file directly but I can't find a description of how to do it. If anyone can give me a hand on this I'd appreciate it. While working on the debugging I had a thought, "Maybe I can link the elf file just like the object file?". After all, an elf seems to be analogous to a DLL on Windows. Turns out that this is the case. So the method described below works for either the object file or the elf file.

Linking


The method described in the Brown Deer documentation uses the object file containing your accelerator code and links it into your executable. To do this you need to add the accelerator code object file onto the end of the link stage of the main executable. Go to the Project | Build options... click on the options for your main executable and then the Linker settings tab:


The Other linker options string must be the path to the cl object code. This will be added onto the end of the list of object files that are passed to the linker.

Calling


Calling your kernel from your host code is a tiny change. Your call to clopen using the JIT compiler would look like this:

void* clhandle = clopen(context, (char*)string_to_cl_file, CLLD_NOW);

With the code linked in:

void* clhandle = clopen(context, 0, CLLD_NOW); 

The path to your accelerator code has been replace with a ZERO! That's it... nothing more to do. The short pause that happens as the JIT compiler goes about its business will disappear and everything else will look the same.

Final Thoughts


I hope that this guide has saved you some time and I hope Code::Blocks suits your way of working. As always, please let me know if I've missed anything or if any bits are confusing in any way.

Up next in Part 2 I'm going to get into some Debugging action. I split the two up because this guide is big enough as it is. I have not gotten into it yet so I hope that there will not be any back tracking to do on the compiler setup. 

Cheers.

Saturday, 16 August 2014

What was that NDRange thing?

Digging a little deeper and pulling apart the OpenCL parts of the Mandelbrot Example

(This blog post refers to version 1.6.0 of the Brown Deer OpenCL libraries on the Parallella. If you are using newer versions there could be significant differences.)

I thought I'd stop and go back over the Mandelbrot example and make sure I understood how the thing worked. 

There were two things that seemed odd. Firstly, there was only one fork command (or rather a forka command) and I thought there would have been 16 and secondly, what was that clndrange_init1D command doing?

Turns out that the two are intimately linked. The ndRange controls the number of calls to your kernel and the fork kicks off the process. If you use forka you can pass in additional arguments. 

The nd part is short for n-Dimensional so the clndrange_init1D seems to suggest that the space created in the malloc statement as a 1-Dimensional space. Therefore, clndrange_init2D and clndrange_init3D would somehow create 2D and 3D spaces - or that is what I thought. It turns out that things are not exactly as clever as they might first appear.

Let's start from the begining. OpenCL has the standard call:

clEnqueueNDRangeKernel(cl_command_queue queue,
cl_kernel kernel,
cl_uint work_dims
const size_t *global_work_offset,
const size_t *global_work_size
const size_t *local_work_size,
cl_uint num_events, 
const cl_event *wait_list, 
cl_event *event);


Where:
  • work_dims is the number of dimensions (1, 2 or 3)
  • global_work_offset is the global ID offsets in each dimension
  • global_work_size is the number of work-items in each dimension
  • local_work_size is the number of work-items in a work-group, in each dimension

The last three are pointers to arrays of integers with for example global_work_offset[0] referring to the 1st dimension, global_work_offset[1] referring to the 2nd dimension etc.
(Note: this is the "official" version of what these are supposed to do)


The Brown Deer stdcl library replaces this one call with a choice of:

clndrange_init1d(gtoff0, gtsz0, ltsz0) OR
clndrange_init2d(gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1) OR
clndrange_init3d(gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1, gtoff2, gtsz2, ltsz2)

followed by:

clfork(context, devnum, kernel, ndrange, flags ) OR, if you want to pass additional arguments:
clforka(context, devnum, kernel, ndrange, flags [, arg0, ..., argN ]) 

where 

gtoff0 is the global_work_offset for dimension zero
gtsz0 is the global_work_size for dimension zero and
ltsz0 is the local_work_size for dimension zero etc.

So clndrange_init?d is a convenient way of setting those variables when is then passed into the fork command via the clndrange_t* variable.

So, what does it actually do?


clndrange_init1D


I started with a 1 dimensional experiment. After much digging and experimentation here's the DIRT!

global_work_offset does NOTHING, Seriously! Don't believe me... check out the khronos documentation here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
global_work_size is the number of times your kernel will be called (one call in opencl talk is one work-item) and
local_work_size is the number returned to you in your kernel when you call get_local_size(0)

That's it! No fancy partitioning of the data space. Nothing. It took me quite a while to realise that it was that simple. 

There are some quirks to this call. The combination of global_work_size and local_work_size are a bit sensitive. I have not tried every combination available but you often get a segmentation fault due to changes in both the global_work_size and local_work_size. For global_work_size less than 16 it seems to have problems if it and the local_work_size are not equal. Greater than 16 you could get an error "Exceeded maximum thread block size" but apart from getting an error, it seems to still work sometimes. It seems to work if global_work_size is a multiple of 16. If things are not right you can encounter Segmentation faults and Invalid Instruction faults or you may find that your kernel does not seem to be called at all. (I have used 16 here because I'm using an Epiphany-16. I should be referring to the number of cores on your accellerator.)

Let's have a look what the mandelbrot example did.

First, it allocated some space in the standard accelerator context stdacc with:

cl_uchar* pixels = (cl_uchar*) clmalloc(stdacc, width * height * 3, 0);

(The *3 is because it creates an RGB bit map which needs 3 bytes per pixel.)

After grabbing the compiled kernel from the context with clsym(...), it calls:

clndrange_t ndr = clndrange_init1d(0, height, 16);

so it wants is kernel to be called height times. The 16 is actually ignored because the kernel never calls get_local_size(0).

Finally it calls:

clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT, iterations, width, startx, starty, dx_over_width, dy_over_height, pixels);

where it passes in the width of each line as an argument along with the data structure, pixels. The kernel then generates one horizontal line of the final picture using the call get_global_id(0) to determine which line it is on. (A slightly cleaner way would have been to pass width as the third argument in the clndrange_init1d() and then to call get_local_size(0)).

So much for clndrange_init1D. The mandelbrot set is a good example of a problem where the calculation of each point is independent of the next. If you have such a problem then this is a simple model that would be sufficient.

Next I tried a 2 dimensional version.


clndrange_init2D


Let me first say that I found clndrange_init2D to be a little temperamental. It seems that some combinations of values result in the kernel not being called at all. What's more, once it is not working, regressing to the previous state did not result it in working again. It was very frustrating. Therefore, everything written below must be read in with the knowledge that I gave up without actually understanding what was going on.

The 2D call seems (see previous paragraph) to call the kernel gtsz0 multiplied by gtsz1 times. The local size values ltsz0 and ltsz1 are merely returned by get_local_size(0|1).

clndrange_init2d( gtoff0, gtsz0, ltsz0, gtoff1, gtsz1, ltsz1)

gtoff0: NULL! Nothing to do here
gtsz0: the number of calls for the first dimension 
ltsz0: the value returned by calling get_local_size(0)
gtoff1: NULL, as you might expect
gtsz1: the number of calls for the second dimension (one for each gtsz0 call)
ltsz1: the number returning by calling get_local_size(1)

I didn't go into clndrange_init3D but I'd lay money on it working in the same was as clndrange_init2D.


nD Example 


I took the mandelbrot example and removed all the fancy mathematics. I allocated two chunks of global memory and wrote a 1D kernel and a 2D kernel that just initialised the space to a given value.

The critical bits are:

int  bytesPerCore = 16; // how many bytes we want each core to process
int workItems = 32;     // the total number workItems (threads sent to the Epiphany)

wrkArea1D = (cl_uchar*) clmalloc(stdacc, workItems * bytesPerCore, 0);
wrkArea2D = (cl_uchar*) clmalloc(stdacc, workItems * bytesPerCore, 0);

clndrange_t ndr1D = clndrange_init1d(NULL, workItems, bytesPerCore); 
clndrange_t ndr2D = clndrange_init2d(NULL, workItems, bytesPerCore/4, NULL, workItems, bytesPerCore/4);

and then in a common function call:

clforka(stdacc, 0, krn, ndr, CL_EVENT_WAIT, 1, rows, cols, wrkArea);

where 
krn is either "k_init1D" or "k_init2D"
ndr is either ndr1D or ndr2D and
wrkArea is either wrkArea1D or wrkArea2D

The kernel k_init1D works in the same way as the mandelbrot example processing 16 bytes of data in each of the 32 calls using get_global_id(0) to figure out where it should be.

The k_init2D kernel breaks the data set of the same size into 4x4 "tiles" so that adjacent data could be processed at the same time. I thought that I could cast the global data into a 2 dimensional array but that didn't work so I had to do all of the offset arithmetic in code. While this is not difficult it did make the 2D kernel considerably longer and given that speed is of the essence I would suggest that the only reason to do this would be if the algorithm overall would work better in 2D than in 1D (or that you dig unnecessary complexity).

In the resultant data sets I include the value of get_global_id(0) to show which call processed which chunk of data. The 1D data has the global id in the first column and the result of get_local_size(0) in the second column. The first five lines are:

0       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
2       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
3       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1
4       16      1       1       1       1       1       1       1       1       1       1       1       1       1       1

The 2D data set includes the result of get_global_id(1) which shows it jumping around between 28 and 31. This is the value returned by get_global_id(1) as at the last call to the kernel. The top left value is the value returned by get_global_id(0) and the next value is value returned by get_global_id(1). The results from the first four tiles (1 to 3) are:

0       28     1       1       1       31     1       1       2       31     1       1       3       29     1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1
1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1

The 2D call was also a lot more sensitive than the 1D. As I mentioned above some combinations of values in the 2D version work and some don't. I suspect that the total number of calls must be a multiple of the number of cores available. So, before writing a big kernel with the assumption of clndrange_init2D working, think about how you want to process your data and check that a simple version works first and 

The nD example code


The code for the example is a little long to list here so grab it from github. From you eclipse workspace directory execute:

git clone https://github.com/nickoppen/nD.git

load it into eclipse and push, pull, tweak and generally rip it up till you are an ndrange EXPERT. And, if I've gotten anything wrong, please let me know.


Up Next


I don't know about you but I'm getting a bit tired of waiting around for Eclipse. AndyC posted a procedure on getting Code::Blocks working on the Parallella (http://forums.parallella.org/viewtopic.php?f=13&t=1658). 

I've installed Code::Blocks and it does perform better than Eclipse. Andy's procedure is designed for the standard SDK and I got into a mess when I tried to compile an OpenCL program on it so I'm going to have to do some tweaking. If I make some progress I'll write a post about it.

Sunday, 3 August 2014

OpenCL on the Parallella using Eclipse

Writing OpenCL code on the Parallella board using Eclipse for people who just want to write code and not become Linux experts.

Finally, what I would call my first "execution" blog post. I got my board a couple of months ago and while it's been fun playing with it, achieving the ultimate goal - writing code - has been frustrating to say the least.

I've tried a number of different approaches only to realise (yet again) that keeping it simple while you don't know what you are doing is by far the best approach. 

Where I ended up last week was - "how about I try and get some existing code running in eclipse on the parallella". Hopefully, the code will be compilable and correct so the only think I'll have to get right are the settings. To this end I found a short program that generated a colourised mandelbrot set posted by dar on the Parallella site. Don't try and find this code because it has some pretty glaring errors. If you want to try this procedure use the code below.

This really is a beginners guide. If you are already familiar with Eclipse you won't need most of this. I also don't make any claim about how efficient the code is.


0. Before you begin


The prerequisites are:

  • the Parallella SDK
  • the Brown Deer Technology SDK
  • the environment variables PATH and LD_LIBRARY_PATH set correctly
  • the Eclipse development environment


The first three come with the Ubuntu 14.04 release and I'm assuming future releases will also include them. For the record, the environment variables on my system are:

LD_LIBRARY_PATH=/usr/local/browndeer/lib:/usr/local/lib:/opt/adapteva/esdk/tools/host/lib:


PATH=/usr/local/browndeer/bin:/opt/adapteva/esdk/tools/host/bin:/opt/adapteva/esdk/tools/e-gnu/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games

Installing eclipse is a matter of:

sudo apt-get install eclipse

Once you down load eclipse you need to C/C++ development addin CDT. To get this you use Help | Get New Software and pack your patience - it is SLOOOOW.

On tricky thing: for a program to gain access to the Epiphany co-processor it must be run by root. That means that you must log in as root (using sudo won't work) and therefore you need a root password. To reset the root password you run:

sudo passwd

which will give you the usual new password prompt followed by the are you really sure prompt. Then to run you program you use the command:

su

from your non-root login (linaro by default). This will then prompt you for the root password. Once you have logged on as root, check the environment variables PATH and LD_LIBRARY_PATH. They are needed at run time so root must have them set as described above.

1. Your first Eclipse Project


1.1 Get Yourself a New Project

To start with you need a new C/C++ project. The File | New | Project pops up the box to choose what type of project you want. I always choose C++ but for this particular project a standard C Project would be fine. I called my project mandelbrot (not very original I know).

1.2 Set Up your Compile Settings

Not a lot to do here but absolutely critical.

Your compile settings are accessible from the Project | Properties menu. For Properties to be active you need to have the project tab and your project within it selected.

Your tool chain should look like this:



Note I've got the C++ compiler and linker in there.

Your includes should look like this:



This will tell the compiler to find all of the Brown Deer Technology stuff.

Finally, your linker settings should look like this:


I'm pretty sure that this is all you need. I did fumble around a lot with other settings so if this does not work please let me know.

1.3 Some Code

For CL projects you need host code and Epiphany code. The host code is compiled using the gcc/g++ compilers/linker etc in the tool chain. The Epiphany code is compiled at run time by the Brown Deer JIT compiler.

I created a source file folder (src) under the mandelbrot project folder. For the embedded paths in the code to work you should do this as well. The host code I ended up with was:

// The modifications porting this code to OpenCL are
// Copyright (c) 2012 Brown Deer Technology, LLC.
//
// Mandelbrot.c
// Written by User:Evercat
//
// This draws the Mandelbrot set and spits out a .bmp file.
// Should be quite portable (endian issues have been taken
// care of, for example)
//
// Released under the GNU Free Documentation License
// or the GNU Public License, whichever you prefer:
// 9 February, 2004.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdcl.h>
#include <errno.h>

#define OUTFILE "./mandelbrot.bmp"

#define WIDTH 1024
#define HEIGHT 768

#define CENTRE_X -0.5
#define CENTRE_Y 0
#define ZOOM 300

#define ITERATIONS 1024  // Higher is more detailed, but slower...

// Plotting functions and parameters...

#define bailoutr(n) (5*n  )
#define bailoutg(n) (20*n  )
#define bailoutb(n) 0

#define min(a,b) ((a<b)?a:b)

// Colours for the set itself...

#define IN_SET_R 0
#define IN_SET_G 0
#define IN_SET_B 0

void drawbmp(int width, int height, unsigned char* pixels, char * filename);

/////////////////////////////////// MAIN PROGRAM ///////////////////////////////////

int main (void)
{

   float startx; float endx;
   float starty; float endy;
   float dx; float dy;
   float dx_over_width,dy_over_height;

   char kern[] = "../src/mandel_kern.cl";
   void * openHandle;

   int iterations = ITERATIONS;
   int width = WIDTH;
   int height = HEIGHT;

   char strInfo[20];
   FILE * pFile;

   pFile = fopen(kern, "r");
   if (pFile == NULL)
   {
  printf("Opening the Kernel file: %s produced an error(%d). Make sure that the source code variable kern has a valid path to the cl code and that the code is readable.\n", kern, errno);
  exit(0);
   }
   else
  fclose(pFile); // only open the file to check that it is there and readable

   pFile = fopen("./debug", "w");

   fprintf(pFile, "About to malloc pixels\n");
   cl_uchar* pixels = (cl_uchar*) clmalloc(stdacc, width * height * 3, 0);

   startx = CENTRE_X - ((float) WIDTH / (ZOOM * 2));
   endx = CENTRE_X + ((float) WIDTH / (ZOOM * 2));

   starty = CENTRE_Y - ((float) HEIGHT / (ZOOM * 2));
   endy = CENTRE_Y + ((float) HEIGHT / (ZOOM * 2));

   fprintf(pFile, "Plotting from (%f, %f) to (%f, %f)\n", startx, starty, endx, endy);

   dx = endx - startx;
   dy = endy - starty;
   dx_over_width = dx / width;
   dy_over_height = dy / height;


   fprintf(pFile, "Opening kernel file:%s\n", kern);
   openHandle = clopen(stdacc, kern, CLLD_NOW);

   fprintf(pFile, "Getting the kernel with clsym\n");
   cl_kernel krn = clsym(stdacc, openHandle, "mandel_kern", CLLD_NOW);

   clGetKernelInfo(krn, CL_KERNEL_FUNCTION_NAME, sizeof(strInfo), strInfo, NULL);
   fprintf(pFile, "The kernel is called: %s\n", strInfo);

   fprintf(pFile, "Calling clndrange\n");
   clndrange_t ndr = clndrange_init1d(0, height, 16);

   fprintf(pFile, "Calling clforka\n");
   clforka(stdacc, 0, krn, &ndr, CL_EVENT_WAIT,
      iterations, width, startx, starty, dx_over_width, dy_over_height, pixels);

   fprintf(pFile, "Transferring memory contents from the Epiphany using clmsync\n");
   clmsync(stdacc, 0, pixels, CL_MEM_HOST|CL_EVENT_WAIT);

   fprintf(pFile, "Calling drawbmp\n");
   drawbmp(width, height, pixels, OUTFILE);

   fprintf(pFile, "Saved bitmap to %s. Done.\n", OUTFILE);
   clfree(pixels);
   fclose(pFile);

   return 0;
}


void drawbmp (int width, int height, unsigned char* pixels, char * filename) {

   unsigned int headers[13];
   FILE * outfile;
   int extrabytes;
   int paddedsize;
   int x; int y; int n;

   extrabytes = 4 - ((width * 3) % 4); // How many bytes of padding to add to
                                       // each horizontal line - the size of
                                       // which must be a multiple of 4 bytes.
   if (extrabytes == 4)
      extrabytes = 0;

   paddedsize = ((width * 3) + extrabytes) * height;

   // Headers...

   headers[0]  = paddedsize + 54;      // bfSize (whole file size)
   headers[1]  = 0;                    // bfReserved (both)
   headers[2]  = 54;                   // bfOffbits
   headers[3]  = 40;                   // biSize
   headers[4]  = width;  // biWidth
   headers[5]  = height; // biHeight
                                       // 6 will be written directly...
   headers[7]  = 0;                    // biCompression
   headers[8]  = paddedsize;           // biSizeImage
   headers[9]  = 0;                    // biXPelsPerMeter
   headers[10] = 0;                    // biYPelsPerMeter
   headers[11] = 0;                    // biClrUsed
   headers[12] = 0;                    // biClrImportant

   outfile = fopen (filename, "wb");

   // Headers begin...
   // When printing ints and shorts, write out 1 character at time to
   // avoid endian issues.

   fprintf (outfile, "BM");

   for (n = 0; n <= 5; n++)
   {
      fprintf(outfile, "%c", headers[n] & 0x000000FF);
      fprintf(outfile, "%c", (headers[n] & 0x0000FF00) >> 8);
      fprintf(outfile, "%c", (headers[n] & 0x00FF0000) >> 16);
      fprintf(outfile, "%c", (headers[n] & (unsigned int) 0xFF000000) >> 24);
   }

   // These next 4 characters are for the biPlanes and biBitCount fields.

   fprintf(outfile, "%c", 1);
   fprintf(outfile, "%c", 0);
   fprintf(outfile, "%c", 24);
   fprintf(outfile, "%c", 0);

   for (n = 7; n <= 12; n++)
   {
      fprintf(outfile, "%c", headers[n] & 0x000000FF);
      fprintf(outfile, "%c", (headers[n] & 0x0000FF00) >> 8);
      fprintf(outfile, "%c", (headers[n] & 0x00FF0000) >> 16);
      fprintf(outfile, "%c", (headers[n] & (unsigned int) 0xFF000000) >> 24);
   }

   // Headers done, now write the data...

   for (y = height - 1; y >= 0; y--)  // BMPs are written bottom to top.
   {
      for (x = 0; x <= width - 1; x++)
      {
         // Also, it's written in (b,g,r) format...

         fprintf(outfile, "%c", pixels[(x * 3) + 2 + (y * width * 3)]);
         fprintf(outfile, "%c", pixels[(x * 3) + 1 + (y * width * 3)]);
         fprintf(outfile, "%c", pixels[(x * 3) + 0 + (y * width * 3)]);
      }
      if (extrabytes) // See above - BMP lines must be of lengths divisible by 4
      {
         for (n = 1; n <= extrabytes; n++)
         {
            fprintf(outfile, "%c", 0);
         }
      }
   }

   fclose (outfile);
   return;
}

Create a new file in your project and paste this into it.

The Epiphany code is:

#define set_red(n) (5*n  )
#define set_green(n) (20*n  )
#define set_blue(n) 0

__kernel void mandel_kern(
   int iterations,
   int width,
   float startx, 
   float starty, 
   float dx, 
   float dy, 
   __global uchar* pixels
)
{
int threeXwidth = 3 * width;
   unsigned char line[threeXwidth];
   int i, j, n, m, pixelBase;
   float x, y, r, s, nextr, nexts;

   j = get_global_id(0);

   for (i = 0; i < width; i++) 
   {

      x = startx + i*dx;
      y = starty + j*dy;

      r = x; 
      s = y;

      for (n = 0; n < iterations; n++) 
      {

         nextr = ((r * r) - (s * s)) + x;
         nexts = (2 * r * s) + y;
         r = nextr;
         s = nexts;
         
         if ((r * r) + (s * s) > 4) break;

      }

      if (n == iterations) n=0;

      line[(i * 3) + 0 ] = min(255,set_red(n));
      line[(i * 3) + 1 ] = min(255,set_green(n));
      line[(i * 3) + 2 ] = min(255,set_blue(n));   

   }

pixelBase = j * threeXwidth;
  for (m =0; m < threeXwidth; m++)
  pixels[pixelBase + m] = line[m];

}

Paste it into a new file called mandelbrot.cl. The compiler used by eclipse won't do anything with this file, it is here just for convenience.

1.4 Save and Build

If you've gotten this far you should now be able to save and build the program without any problems.

2 The Fun Starts Here


2.1 Debugging Host Code

Nothing works first time right? Right! I know that we are all geniuses but even we slip up occasionally.

Compile time on the host side is pretty much as usual. Get the includes and libraries right and it will all compile and link.

Run time debugging is when things start to get tricky. Prior to the execution of any code on the Epiphany accelerator using the Eclipse front end to the gdb debugger works as you would expect. However, the Epiphany expects the calling application to have root privileges. I started my project as an ordinary user (linaro) and when it came to starting Eclipse as root it got a bit sticky. First, logging in as root in a terminal window and then calling Eclipse didn't work at all and typing sudo eclipse from Run on the start menu got Eclipse running but then struck all sorts of permission problems. I reverted to old school debugging.

Notice I have opened a file called ./debug and I write a line to it pretty much before every call in the program. This is the cleaned up version of the debug writes that I used to figure out what was going on. I write them to a file to separate my output from the text that the stdcl libraries produce on the console (especially when the cl code actually runs). This is generally enough and if the Brown Deer documentation was a little more comprehensive would see you through.

Another lack in the documentation is how you go about checking if your call actually worked. (Notice that the only check I do is to open the cl file to see if it is there and readable.) If something does not work you get a Segmentation Fault a bit later. The Brown Deer's stdcl library takes away a lot of the verbosity of "generic" OpenCL code. For example, you don't have to create a context of your own, the library provides you with the stdacc global variable and that is the context for the Epiphany accelerator. However, what is lacking in the documentation is how to check if the calls worked e.g. if the clopen command found and successfully compiled the code. It might be there. If I find it I'll edit this post or write a new one.


2.2 Debugging Epiphany Code

I'm glad that I didn't really have to do this much.

Compile time with the JIT compiler is at run time for the host code. The output for the compiler is displayed onto the calling console. It reminded me of learning Pascal at university in the early 80's. I was fortunate that the cl code was correct enough for the compile time error to be obvious and with a little nutting out the changes I had to make were not too onerous. I think the lesson here is "Keep your kernels as simple as you can".

For run time debugging on the Epiphany there is a version of gdb called e-gdb. I did have not to go that far for this project but I think it will be key when things get more complicated. Andreas Olofsson got back to me about the developments in this area. There is significant work being done and things will get a lot easier in the near future. I think that for the moment, getting your code running as best you can in a friendly environment before you package it up into a kernel. Just make sure that you are only using the somewhat limited functions that are available if cl (the little copy loop at the end of the cl code above replaces a memcpy command in the original code).

I think I'll write another blog post once I get some idea about debugging on the Epiphany.


3. Final Thoughts


This guide is really only a toe in the water. So far I have only used the stdcl library and not gone anywhere near the COPRTHR library. There's lots of good stuff there that I have not gotten into yet.

The ARM cores on the Parallella are fine for a little snippet like the example but a bit too low powered for significant development. Cross compilation from a more powerful machine is the next step and I hope to write another blog entry on that in the near future.

Oh and one final thing... the output: