Tuesday, 19 April 2016

Getting COPRTHR and MPI running

In this post I get to grips with COPRTHR with the new MPI library installed and solve the puzzle as to why direct calls to the epiphany SDK did not work. There were some tricks to getting COPRTHR working but some unexpected benefits as well.


Introduction


Working towards my goal of a general purpose infrastructure to wrap around my neural network simulator, I did some investigation into the Message Passing Interface (MPI) and direct memory access (DMA) memory transfers. The plan is to use MPI to provide the signalling between the host and epiphany and between epiphany cores. The DMA part comes in as a way to parallelise the processing so that you are transferring the next data set into local memory while you are processing the last set. DMA is also a lot quicker for larger data sets than individual reads.

To get MPI working I decided to bite the bullet and get into COPRTHR. While this was a little confusing at first once I'd gotten everything wired properly it all made sense. If you understand stdcl the overall structure is the same and so far, I have not found anything that you can do in stdcl that you cannot do in COPRTHR.

I also want to use MPI because it allows asynchronous data transfer between cores. While direct memory writes are fast, the core being written to plays no part in the process and thus other coordination methods are required to ensure that the data has arrived before it is used. It also requires that the core doing the writing writes to the correct location. This is fine if all cores have the same code base but if they do not there needs to be signaling between sender and receiver to ensure that the sender is not writing over the wrong data (or program code!). MPI provides the abstraction needed to ensure an orderly, arms-length transfer between cores.

There was an unexpected benefit in that COPRTHR has a call that allows you to allocate a chunk of memory from within you kernel. While you still have the option to dynamically change your program and recompile on the fly as I described in a previous post, there is a call that allows you access to unused memory on the epiphany which, if used carefully, can store data for local processing.


Preliminaries


The programs developed for this blog post were done on the 2015.01.30 image using COPRTHR version 1.6.2. The 2016.03 image was release recently so I'll install that soon and update anything that fails (which I don't expect it to do). Please also be aware that version 2.0 of COPRTHR is in the works so check your version before proceeding.

You can have a look at the example code or to clone the source files execute:

git clone https://github.com/nickoppen/coprthr_basic.git


Installing MPI


MPI has been around for a number of years and is a set of calls, implemented on different systems that allows message passing between those systems. Brown Deer brought out a COPRTHR extension recently that implemented those calls for message passing between epiphany cores. While it is an extension to version 1.6.2 I expect that COPRTHR version 2 will have MPI built in with greater functionality (although no details have been release as yet).

Get the COPRTHR mpi extension from here:


http://www.browndeertechnology.com/code/bdt-libcoprthr_mpi-preview.tgz

Just download it, un-compress it somewhere out of the way and run the install script as root: 

sudo ./install.sh


Using eSDK calls in an OpenCL program


You can actually use eSDK calls directly from your OpenCL kernel code. However, I found that there is an inconsistency between the layout of the eSDK and the hard wired compile scripts used by the clcc compiler.

To test if this inconsistency exists on your system, fetch the latest version of the parallella examples from github which includes a couple MPI examples:

cd
git fetch parallella-examples
cd parallella-examples/mpi-nbody
make

If this compiles without errors then you are fine.

If you get errors such as:

undefined reference to `e_dma_copy(void*, void*, unsigned long)
undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*) and
undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*)

try the following:

ls /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

if this gives you "No such file or directory" run the following (note: this is a single command without any breaks):

sudo ln -s /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

Then run:

make clean
make

again from the mpi-nbody directory to check if that fixed it. 

If that does not work?


This was the problem with my installation. The way I discovered it (thanks to jar) was by using the cldebug tool:

cldebug -t ./temp -- clcc --coprthr-cc -mtarget=e32 -D__link_mpi__ --dump-bin -I/opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include -I ./ -DECORE=16 -DNPARTICLE=2048 -DSTEPS=32 -DSIZE=1 -DMPI_BUF_SIZE=1024 -DCOPRTHR_MPI_COMPAT mpi_tfunc.c

This looks very complicated but it really just calls the clcc command on the kernel file. The output looks even more horrendous but it contains the output of all the steps that the Brown Deer pre-processor goes through to produce a working executable.

If there is still something missing in your compile, sift through all of this text and look for something that is wrong. Given that I was missing a reference, I focused on include files and directories which occur after -I and -i switches. I found that it was including sys-include which didn't exist on my system. Instead, the .h files (e_dma.h and e_mutex.h) were in a directory called include. To fix it with minimum disruption, I created a soft link (ln -s) of the existing include and called it sys-include. Thus sys-include was an alias for include and the compiler was happy.


Compiling


Below are the steps changes you need to make in Code::Block to get a program to compile using COPRTHR. If you have not used Code::Blocks I suggest that you have a read through my post about setting it up from scratch. In this post I'm only going to cover the changes that you need to make to the stdcl set up described there. If you are an emacs/make fan, I've included a Makefile along with the example code.


Compiler Options


Kernel Compilation

The first two changes are critical for correct compilation of your kernel code. Go to the Compiler Settings menu (Settings | Compiler) and selection the compiler you have set up for compiling with clcc (mine is called BD OpenCL).

Under the Compiler Settings tab choose the Other options tab and type the switches shown below:


Then under the #defines tab define the following two definitions:




Host Application Compilation

Go back to the gcc compiler settings and change the Linker settings tab to include the COPRTHR link libraries (I'm not sure what the m library is for. It is in all the examples so I included it for good measure.):




Then under Search directories maker sure that the compiler can find your included header files:


And that the linker can find the COPRTHR libraries:





Project Properties


As always in a new project, you need to define a build target but in this case the output attributes are not important:



This is the tricky bit. The compiler, with the switches defined above, generates a binary file in the Execution working directory (your project directory by default). This file has the rather verbose extension of bin.3.e32 tacked onto the end of your kernel file name. Forget the .o file generated by the compiler, it is the bin.3.e32 file that you need at execution time. If your host application is being generated and run from your bin/Debug or bin/Release directory, you must copy the bin.3.e32 file to the same place.

To do this use the Post-build steps with a copy command and while you are at it, make the file readable by everyone. (Note: my kernel file is called pfunc.c. Replace this with your kernel file name.)


Also, while you are in this box and after saving your new settings, pop up to the root project settings (coprthr_basic in this example) and switch off All Warnings (-Wall). It will just be annoying.

Then, because we are not interested in the .o or the linker output, switch off linking for the kernel file or files.




Using COPRTHR Host-side


The overall structure of a COPRTHR program is the same as an stdcl program as described in my previous post. The steps you need to go through are also the same but with different calls.


Loading and Compiling


Stdcl has the commands clopen for opening the epiphany and to load a pre-compiled and JIT compiled kernels and clsopen for a kernel stored as an internal string.

COPRTHR has two commands:

int coprthr_dopen( const char* path, int flags); to open the epiphany (which should be matched with a coprthr_dclose()) where path is the #defined value COPRTHR_DEVICE_E32 and flags is COPRTHR_O_THREAD. The integer returned is the device descriptor that is needed in a number of subsequent calls.

and at least one of the following:

coprthr_program_t coprthr_cc_read_bin( const char* path, int flags ); to open a pre-compiled object file (our bin.3.e32 file), where path is the file location of the binary and flags is zero. The return value is a handle to the compiled file.

or

coprthr_program_t coprthr_cc( const char* src, size_t len, const char* opt, char** log ); compiles a string (src) using the compile options (opt) returning handle to the program. The log parameter is a pointer the compiler output. Pass in a NULL pointer and a suitable amount of space will be allocated (which must be then freed).


e.g.

int dd = coprthr_dopen(COPRTHR_DEVICE_E32, COPRTHR_O_THREAD);
printf("dd=%d\n",dd);
if (dd<0)
{
printf("device open failed\n");
exit(0);
}

coprthr_program_t prg;
prg = coprthr_cc_read_bin("./pfunc.cbin.3.e32", 0);
printf("prg=%p \n",prg);
if (!(prg))
{
printf("file mpi_pfunc.cbin.3.e32 not found\n");
coprthr_dclose(dd);
exit(0);

}

There is no call to compile source code in a file so read it into a string and compile it from there.

Allocating Shared Memory


Similar to the stdcl call clmalloc, COPRTHR has the call the call:

coprthr_mem_t coprthr_dmalloc(int dd, size_t size, int flags); where a memory space of size is allocated on the device (dd) returning a handle of type coprthr_mem_t. The argument flags is not used.

I call it a handle rather than a pointer because it is really not working memory. If you try and use it as working memory you will be bitterly disappointed. The only way to write to and read from it is with the calls described in the next section.

Shared memory can be resized using the coprthr_drealloc(dd, size, flags) call and should be freed using coprthr_dfree(int dd, coprthr_mem_t mem) when you are done.

e.g.

coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0);

Writing to and Reading From Shared Memory


In stdcl you would declare a variable of type cl_T *  where T is a compatible simple type (e.g.int or float), call clmalloc(), initialise it in your host application and then call clmsync to transfer the data to shared memory.

COPRTHR does things a little differently. The handle returned by coprthr_dmalloc refers to memory in shared space only. To initialise it you need to declare, allocate and write to memory in your host program and then call:


coprthr_event_t coprthr_dwrite(int dd, coprthr_mem_t mem, size_t offset, void* ptr, size_t len, int flags); where dd is your device descriptor, mem is your memory handle, offset is how far into the shared memory you want to start writing, ptr is a pointer to your host storage the contents of which you want written into shared memory, len is the length (in bytes) of the data you want written and flags is one of COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event handle which can be used in the call coprthr_dwaitevent(dd, event).

The contents of your host memory will be written to shared memory. In a similar way, once your kernel has finished its computation and written its results back, you call:

coprthr_event_t coprthr_dread(int dd, coprthr_mem_t mem, size_t offset, void* ptr,
size_t len, int flags); where the arguments and return values are the same except of course the data gets read from shared memory into your host memory.

There is also coprthr_dcopy(...) which copies data from one device's shared memory into another but that is not relevent on the Parallella given that there is only one device.

e.g.

float host_buffer[WIDTH];
for (i=0; i < WIDTH; i++) host_buffer[i] = -1 * i; /// write data to shared DRAM coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0); coprthr_dwrite(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float), COPRTHR_E_WAIT);

followed later by:

coprthr_dread(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float),COPRTHR_E_WAIT);

Retrieving a Kernel


To call a kernel you need to retrieve a link to it from the program file you just read in or compiled. Similar to stdcl clsym call COPRTHR has:

coprthr_sym_t coprthr_getsym( coprthr_program_t program, const char* symbol); where program is the returned program handle and symbol is the name of the kernel. The return value is the handle to the kernel.

e.g.

coprthr_sym_t thr = coprthr_getsym(prg,"p_func_thread_mpi");
if (thr == (coprthr_sym_t)0xffffffff)
{
printf("kernel p_func_thread not found\n");
coprthr_dclose(dd);
exit(0);

}


Calling a Kernel


There are three ways to call a kernel or kernels. The MPI library uses the call:

coprthr_mpiexec(int dd, int thrCount, comprthr_sym_t thr, void * args, int size, int flags); where dd is the device descriptor, thrCount is the number of threads you want to launch, thr is the handle to the kernel you are calling, args is a type-less pointer to a structure containing the kernels arguments (see next paragraph), size is the size (in bytes) of the structure you are passing and flags (presumably because I have not seen them used) is one of  COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. I also presume that the call returns a variable of type coprthr_event_t.

To use the args variable in this call you need to typedef a structure that contains the variables you want to pass to your kernel. Do this in a separate include file and include it in both the host file and kernel file. Declare a variable of this type in your host program in host memory. The simple data types (int, float etc) can be written to the structure but the arrays need to be allocated in shared memory and initialised as described above. The structure is then passed to the kernel as a type-less pointer which can then be cast back to the structure type inside your kernel code.

e.g.

my_args_t args = {
.n = 2.0,
.width = WIDTH,
.p_mem = coprthr_memptr(p_data_mem, 0),
};

coprthr_mpiexec(dd, ECORES, thr, &args, sizeof(args), 0);



The alternative ways are the COPRTHR calls that were available before the MPI library appeared. These are:

coprthr_event_t coprthr_dexec( int dd, coprthr_kernel_t krn, unsigned int nargs, void** args, unsigned int nthr, void* reserved, int flags ); where dd is the device descriptor, krn is the handle to the kernel, nargs is the number of elements in the args variable (see next paragraph), args is a pointer to an array of pointers of values to be passed to the kernel, nthr is the number of times the kernel should be executed, reserved is not used and flags is one of  COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event that can be used used in the call coprthr_dwaitevent(dd, event).

and:

coprthr_event_t coprthr_dnexec( int dd, unsigned int nkrn, coprthr_kernel_t v_krn[], unsigned int v_nargs[], void** v_args[], unsigned int v_nthr[], void* v_reserved[], int flags); which allows a number of different kernels to be executed at once. Each array is a is a collection of the same arguments in the dexec call.

The big difference between mpiexec and dexec is how the arguments are passed. The dexec call only accepts arrays and those arrays have to be in shared memory. This means that passing singleton data via a one element array which is a bit clumsy. But remember, if you use mpiexec to call the kernel, you don't have to use any MPI.

e.g.

float n[] = { 2.0 };
int width[] = { WIDTH };
coprthr_mem_t p_arg_n = coprthr_dmalloc(dd, sizeof(float), 0);
coprthr_mem_t p_arg_width = coprthr_dmalloc(dd, sizeof(int), 0);
coprthr_dwrite(dd, p_arg_n,0,&n, sizeof(float), COPRTHR_E_WAIT);
coprthr_dwrite(dd, p_arg_width, 0, &width, sizeof(int),   COPRTHR_E_WAIT);
void * p_args[] = { &p_arg_n, &p_arg_width, &p_data_mem };

coprthr_dexec(dd, thr, 3, p_args, ECORES, 0, COPRTHR_E_WAIT);


Changes to Kernel Code


Finally, now that you have called your kernel, you need to retrieve your arguments. Remember that shared memory is slow so if you want to use them more than once it is better to make a local copy.

Kernels called with coprthr_mpiexec you need to cast the single kernel argument as your defined structure:


__kernel void p_func_thread_mpi( void * p_args )
{
my_args_t* pargs = (my_args_t*)p_args;
...
}


With coprthr_dexec you need to de-reference the singleton arguments:


__kernel void p_func_thread_std( float * p_arg_n,  int* p_arg_width, void * p_mem )
{
float n = p_arg_n[0];
int cnt = p_arg_width[0] / ECORES;
float * g_pmem = (float *)p_mem;  /// to be used in dma copy
...
}

Reserving Local Memory


In stdcl there is no way to dynamically allocate space other than with source code modification and dynamic compilation. COPRTHR has two calls that allows you to use free space in local memory:

void * coprthr_tls_sbrk(int size); which returns the contents of the system variable containing the first byte of free memory and sets the free memory address system variable to be size bytes further along

and:

coprthr_tls_brk(void * addr); sets the system variable containing address of the first byte of free memory to be addr.

These calls should be used as following (if you are into neatness):

/// store the starting position of free space
void * memfree = coprthr_tls_sbrk(0);

/// grab a chunk of free space for local storage
float * localSpace = (float*)coprthr_tls_sbrk(cnt * sizeof(float));

/// do something with the space


/// reset the free space pointer to where it was initially
coprthr_tls_brk(memfree);

WARNING: The value returned by coprthr_tls_sbrk is an actual memory address. Remember this is bare metal programming with no comfy operating system preventing you writing data where you shouldn't. 


Copying using DMA


Once you have a reference to shared memory (called global memory in OpenCL terms) and allocated some space, you need to copy it into memory local to the core. The eSDK documentation shows the basic synchronous copy command:

int e_dma_copy(void *dst, void *src, size_t bytes); which copies bytes of data from src to dst returning 0 if successful. This is a synchronous copy using the E_DMA_1 channel.

In our example the src is the pointer to global memory and the dst is the memory address returned by coprthr_tls_sbrk. You'd better make sure that bytes in e_dma_copy is the same as size in coprthr_tls_sbrk.

e.g.

e_dma_copy(localSpace, g_pmem + (cnt * rank), cnt*sizeof(float));

Asynchronous use of the DMA channels requires the use of:


int e_dma_start(e_dma_desc_t *descriptor, e_dma_id_t chan);
int e_dma_busy(e_dma_id_t chan);
void e_dma_wait(e_dma_id_t chan);
void e_dma_set_desc(e_dma_id_t chan, unsigned config, e_dma_desc_t *next_desc, unsigned stride_i_src, unsigned stride_i_dst, unsigned count_i, unsigned count_o, unsigned stride_o_src, unsigned stride_o_dst, void *addr_src, void *addr_dst, e_dma_desc_t *descriptor);

Which I will cover in a future post.

Final Thoughts


This post is more a how-to post on using COPRTHR and a toe-in-the-water look at MPI and DMA copying. While I'm convinced that the combination will lead to a better structured and faster kernels, I'm not sure by how much and at what cost in terms of the complexity of the program.

My next step with be to compare MPI with direct memory writes along the lines of my data passing experiment. Until then, please let me know if I've missed anything.

Saturday, 16 January 2016

Eclipse Part 2 - Host Debugging with the Runtime Environment

Using the Eclipse Run Time Environment


As a followup to my post about compiling on Eclipse I had a quick look at how to use the Eclipse runtime environment. This his handy for debugging host code running on the ARM chip.

I am currently using the 2015.1 Ubuntu version running on a Parallella server. The OpenCL version is 1.6.2. If you want some source code to practice on, use the mandelbrot example from the first Eclipse post.


Preliminaries


For this post I'm assuming that you have a program in Eclipse that has successfully compiled. The steps in the first Eclipse post are still correct for the 2015.1 version of Ubuntu.

I'm still using the JIT compiler because I have not yet figured out a way of using two different compilers in Eclipse. Code::Blocks lets you create different compile targets using different compilers and then combine the results at link time. If there are any gurus out there who know how to do this using Eclipse then please let me know.


Setting Up


The most important thing you need to do before you can run a program is to set up your environment. Eclipse has a runtime environment but it does not pick up critical bits that you need to use the JIT compiler and run your OpenCL application.

Here is what you need to do with a few other bits thrown in:

Choose Properties from the Project menu giving you this dialog:



Click the "Run/Debug Setting" entry on the list and you will see a default Launch configuration. Click Edit to show:



This dialog controls what and how your application is launched from within Eclipse.

The Name: defaults to you project name but in this box it refers to a specific launch configuration. You can have as many as you like to test out different combinations of settings.

I usually leave the other fields as they are because I'm using it to debug and that's what you get as a default. I might disable auto build if it gets annoying.

To get the program running you need the Environment tab.



This is where you set your environment variables. The minimum you will need are LD_LIBRARY_PATH and PATH. 

To add a new variable click the New... button. 

LD_LIBRARY_PATH you can copy from your environment and paste into the New... box as is. 

Do the same with PATH but add :/bin to the end of it. Oddly this is not on your path and is not in Eclipse either and without it, basic stuff like the cp command (needed by the JIT compiler) will not work.

If your program needs any other environment variables include them here also (except PWD - see below).

While we are here, have a look at the Arguments tab:



If your program uses command line arguments type them in the box exactly as you would on the command line. They will appear in argv as per normal.

If you want to change the working directory, un-click the Use Default box and type the directory you want in the text box. This stands in for the PWD variable. Note that Eclipse will not know what $HOME is unless you have included it as an environment variable in the Environment tab.


Running


Now that you have set the environment you can run your program from the tool bar:




The green Play icon has a small downwards pointing arrow next to it. Your run time configurations will appear on the on the top of the list. Click the one you want to use or just press the icon if you only have one.

This will run your program using the console sub-window for text output.


Debugging


Notice the little insect on the tool bar next to the play button. This launches a debug session which uses the same configurations that you defined previously except it takes notice of break points. Set break points using the Run menu:




Press the creepy-crawly button to launch the debug session. This will pop up a perspective switch warning:



Just press "Remember my decision" and Yes never to be annoyed by this again.

Then you will see the full glory of the debug window:



There is more here then this short tutorial can cover and I cannot say that I'm in any way an expert. Underlying the fancy windows is gdb which I find annoying and so I know the minimum I need to know to figure out what is going wrong.

The usual navigation buttons are next to the Resume button and the in-scope variables are in the top right window. For some reason, variables are often "optimized out" which is frustrating.


Finally


I still have not made any progress with kernel debugging on the Epiphany. If I do, I'll let you know.

For now, have fun and write brilliant code.

Sunday, 6 December 2015

Getting It Done - What I learnt from finishing the Neural Network Algorithm

Sometimes, implementation causes rethinks. In this post I'll review the design decisions I've made getting the back propagation algorithm working, hopefully shedding some light on the practicalities of implementing an algorithm using the Parallella architecture that may be helpful to others.

Before reading this post, I highly recommend watching the presentation by Anish Varghese from the ANU. They did some tricky stuff in machine code to gain a performance boost that is not available in straight OpenCL but their comments on the relative speed of inter-core communication and ARM-Epiphany communication is relevant regardless of language. 

Writing this post I'm using Brown Deer OpenCL (version 1.6.2) and the Ubuntu 14.04 image from 30 Jan 2015.

To have a look at the code referred to in this post look here or execute the following shell command:

git clone -b blog12Files https://github.com/nickoppen/nnP.git


Introduction


Since the last posting, I have written the code that trains my neural network - a process called back propagation. The first thing I had to do was to restructure the feed forward kernel so that it did not discard the intermediate values that are needed in the backward pass. However, it was the management of the complexity that proved the biggest challenge. To manage the complexity I made two design decisions one of which worked out and one that didn't.

Restructure of Feed Forward


My primary goal was to get the algorithm working correctly with speed and efficiency in second place. I had already written the first part of the process (called the feed forward pass) and I'd made some design decisions that were suitable for that process in isolation but proved poor choices for back propagation. 

A neural network is made up of a number of layers (>=3). Each layer takes an input vector and produces an output vector which then become the input to the next layer. When only running the network forward the intermediate results can be discarded and initially that is what I did.

However, when I came to write the back propagation process, I needed these values. I had to change the feed forward algorithm to keep them. While I think I came up with a better solution in the end, the basic lesson, "start with the end in mind" aka "Top down design" still applies.


Complexity - Sometimes you can make it worse!


Back propagation is a tricky little algorithm. Adjusting the weights requires calculating the error in the output and for the intermediate layers that is a sum of the errors on the outgoing weights. In a single threaded implementation you just take one (intermediate) node at a time and iterate through the outgoing weights which are all available in shared memory, calculate the error and then iterate through the incoming weights and calculate the adjustment. While this is a little simplified, the key point is that all of the data is available at all times. 

In the Epiphany architecture each processor has its own memory and so how the task is split up and how data is shared between processors are critical design decisions and come with their own set of problems.

Decision 1 - What goes where


In my first blog post I discussed carving up the task. It was a very "thinking about" sort of post. It had to be, I hadn't gotten the Parallella yet. For the purposes of this post, I'll take a simpler approach. EVERY CORE GETS 1/16th OF THE JOB (or as close to it as possible).

Complexity 1: Indexes

That does not sound complicated but even in a simple, input -> process -> output where you are processing data in arrays you start to get a proliferation of indexes. For example:



You have one for the global input, one for the local copy of the chunk processed by the individual core from which you produce a chunk of output which you then have to copy back to global memory before exiting. If the dimensions of the input and output are different this means that you are juggling four indexes.

A neural network is a little more complicated still. They are multi-dimensional with the size of each dimension (i.e. each layer) changing as you pass through the data structure.

Complexity Reducing Idea No. 1: Keep all the arrays the same size. Therefore the data structure private to the core is the same size as the global arrays. You have to make is as big as the biggest layer and then recalculate where each core starts and stops as you pass through the data. The data flow would look like this (highly simplified):



Clearly there is wasted space in the memory local to all cores (where space is at a premium) but traversing all arrays needs only one index.

Initially I had the global input and the local copy of the input aligned and the local copy of the node biases aligned and the intermediate values and the output aligned. That worked while I was only running the network forward. Then I got to do the back propagation and remembered that I needed to keep the intermediate values generated by the hidden layers. Then I decided that it would be a good idea to keep a copy of the node biases (and weights) so that they could be updated locally and the training could be run multiple times. Then I decided that it would be more efficient for the input to be copied onto the front of the array containing the output for each layer then it could be treated in the same way as the output from layer zero.

Idea No.1 turned out to be not very bright at all. Now I have virtually no alignment and a lot more wasted space that I bargained for - none of the benefit and a much higher cost. The other thing about back propagation is that you are referring to values from the current layer, the next layer and the previous layer in order to adjust the weights so you end up with lots of indexes anyway. This mess is still a part of the files for this post. I'll clean it up for the next version.

The only solution as I see it is to minimise the use of private memory and use descriptive names for the indexes. Coming up with good names is tedious and difficult but in a complex program it will make a huge difference. You will also end up with a lot of "parallel" indexes in for loops, thus: 

local_index = global_start - core_offset;   // set the relative start positions
for (global_array_iterator = global_start; global_array_iterator < global_end; global_array_iterator++)
{
     local_array[local_index] = genius(global_array[global_array_iterator]);
     local_index++;    // manually advance one iterator
}

Also, using the overall strategy of copy-in, process, copy-out will mean that mixing the local (_private) memory indexes and the global memory indexes are not mixed too much.

I store all of the indexes in an array of structures so that I don't have to work it out each time. This meant that there was an array of indexes that needed its own index but that came for free because it was the same as the layer that I was working on (i.e. the outermost for loop index).

Decision 2: Passing Intermediate Values


So, each core is processing 1/16th of the data. In my neural network simulator I decided that the variable length data is passed in via a couple of arrays and that the space for them is compiled into the kernel using JIT compilation as I described in a previous post. The method I used to decide which core processed which section of the data was as follows:

#define CORECOUNT 16

int gid = get_global_id(0);
int allCores = dataWidth / CORECOUNT;  /// all cores get this many
int leftOver = dataWidth % CORECOUNT;  /// the remainder are distributed one per core starting from core0 until all are allocated

int start = ((gid * allCores) + min(gid, leftOver)); 
int end = start + allCores + ((gid < leftOver) ? 1 : 0);

This works well for each core to determine where to start processing in the global data structure. What it does not tell you is which core is processing an arbitrary data point. 

If you have read my hand-waving description of the back propagation process you will recall that the "back" part refers to the backwards flow of data to previous layers in order to determine the error of those layers (let's call this data "weight deltas"). 

I've arranged my weights to be contiguous for the node that receives their contribution in the forward pass. This allows the forward pass to calculate the node value by traversing a contiguous sequence of weight values. In the backwards pass the weight delta of each weight has to be returned to its "origin node". To pass it directly back to the core that "owns" that node I needed a similar simple formula.

Complexity 2: Inter-core communication

With processing distributed over a number of cores, figuring out what core is responsible for which datum can be difficult.

Complexity Reducing Idea No. 2: Use global memory as a scratch pad to communicate between the cores.

The good thing about global memory is that it is big (32Mb). The bad thing is that it is slow. The ugly thing is that all cores have equal access to all of it and therefore you'd better be careful about race conditions.

To return the weight delta to it's "origin node", I calculate it when updating "destination node" and write it to the global array called g_weightDeltas. Then, when I come to calculate the error of the "origin node" I read it from g_weightDeltas adding them as I go. This is not quite as simple as it sounds. When calculating the error of each "origin node" you need a weight delta from all nodes in the subsequent layer. I decided to organise g_weightDeltas in the same way is the weights themselves for consistency. This means that writing them happens in the same order as calculating them but reading them requires some index gymnastics to pick the right value from the array, thus:

outputError += g_weightDeltas[nextLayer_globalWgtZero + (w * curLayerWidth) + layerNodeIterator];

where:
outputError is the accumulated error for the node
nextLayer_globalWgtZero is the index of first weight connecting the current layer with the next layer (don't forget that I'm organising the weight deltas in the same way as the weights)
curLayerWidth is the number of nodes in the current layer (therefore the number of weights for each node in the next layer)
w is the current weight
layerNodeIterator is the current node

Idea No. 2 did work out. For a little index gymnastics I saved a bit of tricky programming (aka laborious debugging) and I now have a way of checking the inter-core communication when I come to do it (leaving the global array there for now I can check the directly passed values to the global array values). The global array is easy to debug because I can call clmsync on return and check it's values. 

For this stage of the project, I'm not really concerned about speed but I will have to address this in the next version. I avoid race conditions in a macro way by ensuring that every layer has updated itself and the global array before the global array is read by using a barrier command. 

There is one little issue to remember when using global memory for internal purposes. It needs to be declared and allocated on the host. You don't need to do anything with it on the host but it cannot be allocated in a kernel.


Looking Forward


Having (yet again) learnt to design before coding, I've had a think about how I want the overall system to look. 

In my value passing experiment, I noticed that the overhead of launching a kernel is huge. Therefore the best way of getting the most performance out of the Epiphany is to launch the kernel once and then keep feeding it with data until there is not more data left.

This suits back propagation well because training requires multiple iterations through the training set which could be too big for global memory. Also, loading it into global memory before launching the kernel would introduce a serial component to the process that could be done by the host in parallel with the training that is happening on the Epiphany (and that's what we are all about after all).

Multiple executions of the feed forward algorithm also makes sense in certain applications. If you are using the neural net to recognise features in a video stream you will probably want to pass frames to the network as they arrive. Waiting for the kernel to launch for each frame would probably cause frame to be dropped while the host waits.

So, my target system looks like this:





It looks a bit complicated but the key thing to get is that the host is reading the next lot of input data into global memory while the Epiphany is processing. It then uses some sort of signalling mechanism to indicate that the data is ready. When the Epiphany is done, it signals back to the result reader thread to pull back the results from global memory.

To simplify the diagram, I've assumed that the main host thread is always going to be slower than the Epiphany and therefore never has to wait. While this might be the case in some applications, it may also have to wait to ensure that it does not get too far ahead and starts over-writing unprocessed data.


Up Next...


The next job is to figure out how to do the signalling. I'm going to have a look at the MPI libraries now included in the Brown Deer system. I think that they will do the job. 

After that, I'll tighten up local memory and figure out how to pass the weight deltas back to their origin node in an efficient manner.

As always, please comment, especially if I've missed something or if you have a suggestion. You can comment either here on Blogger, on the Parallella forum or on reddit.