Sunday 9 October 2016

Benchmarking Broadcast Strategies

In this post I compare four different strategies to share data between all cores using direct memory writes. I'll evaluate the four strategies using the e_ctimer functions on the epiphany to measure overall execution time and cMesh waits.


Introduction


There are two things I want to look at in this post. Firstly, what performance improvement is available if you try and improve the way you move data around using the cMesh and secondly, how would you measure that improvement.

The reason I'm spending time looking at this is that inter-core communication is the Achilles Heal of distributed memory multi-processors. There are a few algorithms that do not require the cores to share data (known as "embarrassingly parallel") but most do. Given that the whole purpose of using a parallel processing environment is performance, using the best communication strategy is crucial.

To be able to decide which is best, you need to be able to measure it. I used the e_ctimer functions to measure wall-clock performance of four different strategies. To gain some additional insight into what is going on, I used the slightly more complicated eMesh configuration registers to measure the wait times due to mesh traffic.

My conclusion is that, while some small improvements are available by using a carefully thought out strategy, the most important factor is now efficient your algorithm is.

Getting Started


I'm using the COPRTHR-2 beta pre-processor that is available here. At the time of writing it only runs on the Jan 2015 Parallella image.

$ uname -a

Linux parallella 3.14.12-parallella-xilinx-g40a90c3 #1 SMP PREEMPT Fri Jan 23 22:01:51 CET 2015 armv7l armv7l armv7l GNU/Linux

However, while I'm using COPRTHR-2 everything I use is available in COPRTHR-1 and the MPI add-in.

To have a look at the code, use git:

$git clone -b broadcastStrat https://github.com/nickoppen/passing.git 


The Strategies


In my previous post on passing strategies I looked at two broadcast strategies that were essentially the same other than that one had a barrier function on every iteration. It turns out that the barrier function completely dominates the time required by the function. On examination a barrier per iteration was not needed so the non-barrier version was the clear winner. This strategy is my base case.


Base Strategy - "Pass Up"


The base strategy passed all values "up". That is, each core passes the data to the core with a global id one greater than itself looping back to zero when it reached the "last" core with global id equal to fifteen. It stopped when it got back to it's own global id.

Thus, the core with global id 5 would send it's data to the other cores in the following order:

6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4 and stop because 5 is next.


Refinement 1 - "0 to 15"


I need to do this sort of distribution in my neural network program and when I came to write that routine I thought that the original strategy was a bit complicated. Why not just send the data to the core with global id 0 then 1 then 2 and so on until 15, skipping the local node.

Thus core 5 would send in the following order:

0, 1, 2, 3, 4, skip, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

I didn't think that this would be a more efficient way to send data but at least the code was simpler.

With both of the preceding strategies, the cores treat the epiphany as if is a linear array of cores. I thought that there might be some improvement by making more use of the vertical, north and south connections between the cores. To this end I came up with two strategies that ignored the artificially linear nature of the global id distribution.


Refinement 2 - "Random"


There is a handy site that generates random number sequences. I generated sixteen of them and ordered my cores accordingly. I remove the local core address for each core and from there it was simply a matter of iterating through the array of cores. No further thought required.


Refinement 3 - "Mapped"


In the final strategy, I tried to come up with a sending order that would try and reduce the number of cores trying to send data along the same channel at the same time. Thus reducing the number of clashes and therefore the amount of time the cores had to wait.

I used a "nearest first" strategy. Thus direct neighbours get the data first. For core 5:



Then I send to cores that are two hops away:




* Note: I have not changed the default "horizontal then vertical" communication strategy employed by the epiphany hardware.

Then, after two hops I do three hops, fours hops etc until all fifteen other cores have received the data. I also vary the initial direction, left first on the top row, then up on the second etc.

Both the "Random" and "Mapped" strategies require a different list of core base addresses for every core. In my kernel all cores get all lists but only use one. If space is tight, these lists could be stored in shared memory and each core would only copy down the list it is going to use. 


The Implementation



Copying


As with my last post on data passing, I use the following to make the copy:

#define NEIGHBOUR_LOC(CORE, STRUCTURE, INDEX, SIZEOFTYPE) (CORE + ((unsigned int)STRUCTURE) + (INDEX * SIZEOFTYPE))

The #define slightly improves the readability in the inner loop:


for (i=firstI; i < lastI; i++)
   *(int *)NEIGHBOUR_LOC(core[coreI], vLocal,  i, (sizeof(int))) = vLocal[i];

The kernel is called 32 times with the amount of data being copied increasing from 1 to 32 integers (ints). The whole loop is repeated 10,000 times to add to the workload and get some more representative numbers.


Clock Timer


The epiphany chip has two timers and they are implemented in hardware (so the best information comes from the Architecture Reference Document not the SDK Reference Document). They count downwards from their initial value.

 The basic usage of a timer is as follows:

e_ctimer_set(E_CTIMER_0, E_CTIMER_MAX);   // set timer register to its max value
start_ticks = e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK); // start the timer to count clock ticks remembering the initial value

// implement the code you want timed 

stop_ticks = e_ctimer_get(E_CTIMER_0);    // remember the final value
e_ctimer_stop(E_CTIMER_0);                // stop timing
time = start_ticks - stop_ticks;          // calculate the elapsed time in ticks

to stop the timer. Again I've #defined these lines:

#define STARTCLOCK0(start_ticks) e_ctimer_set(E_CTIMER_0, E_CTIMER_MAX); start_ticks = e_ctimer_start(E_CTIMER_0, E_CTIMER_CLK);

#define STOPCLOCK0(stop_ticks) stop_ticks = e_ctimer_get(E_CTIMER_0); e_ctimer_stop(E_CTIMER_0);

Note: I've used a C style #define to define an inline function with multiple lines. I've done this to avoid the overhead associated with a function call.


Mesh Timer


The mesh timer is a little more complicated. The start and stop are the same as above but you have to tell the epiphany what mesh event you want to time. You can time wait or access events to the CPU and/or to the four connections (see the MESHCONFIG register definitions in the Architecture Reference). To do this you need to set a value in the E_MESH_CONFIG register. Then, to be neat, you need to restore E_MESH_CONFIG back to what it was before you started.

To initialise the timer to measure all mesh wait ticks:

#define E_MESHEVENT_ANYWAIT1 0x00000200  // Count all wait events

int mesh_reg = e_reg_read(E_REG_MESHCFG);   // make a copy of the previous value
int mesh_reg_timer = mesh_reg & 0xfffff0ff; // blank out bits 8 to 11
mesh_reg_timer = mesh_reg_timer | E_MESHEVENT_ANYWAIT1; // write desired event code in bits 8 to 11

e_reg_write(E_REG_MESHCFG, mesh_reg_timer); // write the new value back to the register

/* Note: the Architecture Reference document (Rev 14.03.11 pp. 148 - 149) seems to imply that bits 4 - 7 control E_CTIMER1 and bits 8 - 11 control E_CTIMER0. This is around the wrong way. Bits 4 - 7 control E_CTIMER0 and bits 8 - 11 control E_CTIMER1. */

Then you start your time as described above but using E_CTIMER_MESH_1 (for E_CTIMER1) and the timer will count the event type you requested.

Stopping is the same as above but then, to be neat you should put the register back to where it was prior to the initialisation:

e_reg_write(E_REG_MESHCFG, mesh_reg); // where mesh_reg is previous value

Again, I've defined inline functions for brevity and performance:

#define PREPAREMESHTIMER1(mesh_reg, event_type)...
#define STARTMESHTIMER1(start_ticks)...
#define STOPMESHTIMER1(stop_ticks)...
#define RESETMESHTIMER1(mesh_reg)...

To make it obvious that I'm using E_CTIMER_0 for the clock and E_CTIMER_1 for the mesh timer I've added a "0" and a "1" onto the end of my inline function calls.

The Results


I plotted the wall clock time of each passing algorithm:



First, let me say that I have NO IDEA why the graph shows time reductions at 9, 17 and 25 integers. Maybe someone who understands the hardware better than me could explain that.

Second, for small amount of data there are some significant gains to be had. The "mapped" strategy outperforms the base line strategy by between 10% to 25% if you are sending 11 integers or less.

Thirdly, as the inner loop comes to dominate the execution time, the difference between the algorithms decreases. Therefore, for larger amounts of data, being really clever in your program delivers increasingly diminishing returns. The 25% difference writing one int shrunk down to about 1% for 32 ints.

Finally, and this is not obvious from the wall clock chart, the improvements in performance come from a more efficient algorithm rather than from better use of the cMesh. As I improved the algorithms I could see the execution time reducing. This is more obvious in the following, somewhat surprising chart:




This chart shows the average number of clock ticks that the cores have to wait for all mesh events. If the reduction in execution time was due to a more efficient use of the cMesh then you would expect that the best performers ("Mapped" and "Random") would be waiting less than the other two strategies.

It seems that the outer loop with fewer iterations is faster at pumping out data into the mesh which is therefore more clogged. However, the increase in waiting time of the "Mapped" strategy was only 6% of the total execution time compared to 4.2% for "Pass Up" and "0 to 15". Thus the difference made a small contribution to the convergence of the strategies as the volume of data increased.

I must also add that the steady upwards trend shown in the above graph is somewhat over simplified. Initially I used 1,000 iterations. That showed the same trend but the lines jumped around all over the place. If you are measuring something that is only a small chuck of the total execution time (in this case, 4% - 6%) and something that is influenced by other factors (i.e. other core's mesh traffic) you need to repeat the test many many times.


Conclusion


Using the e_ctimer functions is a great way to test out alternative implementations. It is like using a volt meter when developing circuits. However, you may need to run your test many thousands of times to get repeatable, representative results.

The cMesh does a pretty consistent job of delivering the data. If one of the broadcast strategies was inherently better I would expect the performance difference to be maintained and even increase as the amount of data rises. If you use direct memory writes for large amounts of data, simple is probably best.

For low volumes, while some improvement is available by tweaking the delivery process, it is your program that determines the overall performance.


Up Next...


I've briefly touched on DMA transfers is a previous post. I'd like to expand this test out beyond 32 ints and compare direct memory writes with DMA transfers. This will mean getting to understand the contents of e_dma_desc_t and how that all works. 

I will also include mpi_bcast if it is available in the next release of COPRTHR2. Here the big question is whether the coordination overhead imposed by MPI is more or less time consuming than using barriers to coordinate the cores.

Tuesday 19 April 2016

Getting COPRTHR and MPI running

In this post I get to grips with COPRTHR with the new MPI library installed and solve the puzzle as to why direct calls to the epiphany SDK did not work. There were some tricks to getting COPRTHR working but some unexpected benefits as well.


Introduction


Working towards my goal of a general purpose infrastructure to wrap around my neural network simulator, I did some investigation into the Message Passing Interface (MPI) and direct memory access (DMA) memory transfers. The plan is to use MPI to provide the signalling between the host and epiphany and between epiphany cores. The DMA part comes in as a way to parallelise the processing so that you are transferring the next data set into local memory while you are processing the last set. DMA is also a lot quicker for larger data sets than individual reads.

To get MPI working I decided to bite the bullet and get into COPRTHR. While this was a little confusing at first once I'd gotten everything wired properly it all made sense. If you understand stdcl the overall structure is the same and so far, I have not found anything that you can do in stdcl that you cannot do in COPRTHR.

I also want to use MPI because it allows asynchronous data transfer between cores. While direct memory writes are fast, the core being written to plays no part in the process and thus other coordination methods are required to ensure that the data has arrived before it is used. It also requires that the core doing the writing writes to the correct location. This is fine if all cores have the same code base but if they do not there needs to be signaling between sender and receiver to ensure that the sender is not writing over the wrong data (or program code!). MPI provides the abstraction needed to ensure an orderly, arms-length transfer between cores.

There was an unexpected benefit in that COPRTHR has a call that allows you to allocate a chunk of memory from within you kernel. While you still have the option to dynamically change your program and recompile on the fly as I described in a previous post, there is a call that allows you access to unused memory on the epiphany which, if used carefully, can store data for local processing.


Preliminaries


The programs developed for this blog post were done on the 2015.01.30 image using COPRTHR version 1.6.2. The 2016.03 image was release recently so I'll install that soon and update anything that fails (which I don't expect it to do). Please also be aware that version 2.0 of COPRTHR is in the works so check your version before proceeding.

You can have a look at the example code or to clone the source files execute:

git clone https://github.com/nickoppen/coprthr_basic.git


Installing MPI


MPI has been around for a number of years and is a set of calls, implemented on different systems that allows message passing between those systems. Brown Deer brought out a COPRTHR extension recently that implemented those calls for message passing between epiphany cores. While it is an extension to version 1.6.2 I expect that COPRTHR version 2 will have MPI built in with greater functionality (although no details have been release as yet).

Get the COPRTHR mpi extension from here:


http://www.browndeertechnology.com/code/bdt-libcoprthr_mpi-preview.tgz

Just download it, un-compress it somewhere out of the way and run the install script as root: 

sudo ./install.sh


Using eSDK calls in an OpenCL program


You can actually use eSDK calls directly from your OpenCL kernel code. However, I found that there is an inconsistency between the layout of the eSDK and the hard wired compile scripts used by the clcc compiler.

To test if this inconsistency exists on your system, fetch the latest version of the parallella examples from github which includes a couple MPI examples:

cd
git fetch parallella-examples
cd parallella-examples/mpi-nbody
make

If this compiles without errors then you are fine.

If you get errors such as:

undefined reference to `e_dma_copy(void*, void*, unsigned long)
undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*) and
undefined reference to `e_mutex_unlock(unsigned int, unsigned int, int*)

try the following:

ls /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

if this gives you "No such file or directory" run the following (note: this is a single command without any breaks):

sudo ln -s /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include /opt/adapteva/esdk/tools/e-gnu/epiphany-elf/sys-include

Then run:

make clean
make

again from the mpi-nbody directory to check if that fixed it. 

If that does not work?


This was the problem with my installation. The way I discovered it (thanks to jar) was by using the cldebug tool:

cldebug -t ./temp -- clcc --coprthr-cc -mtarget=e32 -D__link_mpi__ --dump-bin -I/opt/adapteva/esdk/tools/e-gnu/epiphany-elf/include -I ./ -DECORE=16 -DNPARTICLE=2048 -DSTEPS=32 -DSIZE=1 -DMPI_BUF_SIZE=1024 -DCOPRTHR_MPI_COMPAT mpi_tfunc.c

This looks very complicated but it really just calls the clcc command on the kernel file. The output looks even more horrendous but it contains the output of all the steps that the Brown Deer pre-processor goes through to produce a working executable.

If there is still something missing in your compile, sift through all of this text and look for something that is wrong. Given that I was missing a reference, I focused on include files and directories which occur after -I and -i switches. I found that it was including sys-include which didn't exist on my system. Instead, the .h files (e_dma.h and e_mutex.h) were in a directory called include. To fix it with minimum disruption, I created a soft link (ln -s) of the existing include and called it sys-include. Thus sys-include was an alias for include and the compiler was happy.


Compiling


Below are the steps changes you need to make in Code::Block to get a program to compile using COPRTHR. If you have not used Code::Blocks I suggest that you have a read through my post about setting it up from scratch. In this post I'm only going to cover the changes that you need to make to the stdcl set up described there. If you are an emacs/make fan, I've included a Makefile along with the example code.


Compiler Options


Kernel Compilation

The first two changes are critical for correct compilation of your kernel code. Go to the Compiler Settings menu (Settings | Compiler) and selection the compiler you have set up for compiling with clcc (mine is called BD OpenCL).

Under the Compiler Settings tab choose the Other options tab and type the switches shown below:


Then under the #defines tab define the following two definitions:




Host Application Compilation

Go back to the gcc compiler settings and change the Linker settings tab to include the COPRTHR link libraries (I'm not sure what the m library is for. It is in all the examples so I included it for good measure.):




Then under Search directories maker sure that the compiler can find your included header files:


And that the linker can find the COPRTHR libraries:





Project Properties


As always in a new project, you need to define a build target but in this case the output attributes are not important:



This is the tricky bit. The compiler, with the switches defined above, generates a binary file in the Execution working directory (your project directory by default). This file has the rather verbose extension of bin.3.e32 tacked onto the end of your kernel file name. Forget the .o file generated by the compiler, it is the bin.3.e32 file that you need at execution time. If your host application is being generated and run from your bin/Debug or bin/Release directory, you must copy the bin.3.e32 file to the same place.

To do this use the Post-build steps with a copy command and while you are at it, make the file readable by everyone. (Note: my kernel file is called pfunc.c. Replace this with your kernel file name.)


Also, while you are in this box and after saving your new settings, pop up to the root project settings (coprthr_basic in this example) and switch off All Warnings (-Wall). It will just be annoying.

Then, because we are not interested in the .o or the linker output, switch off linking for the kernel file or files.




Using COPRTHR Host-side


The overall structure of a COPRTHR program is the same as an stdcl program as described in my previous post. The steps you need to go through are also the same but with different calls.


Loading and Compiling


Stdcl has the commands clopen for opening the epiphany and to load a pre-compiled and JIT compiled kernels and clsopen for a kernel stored as an internal string.

COPRTHR has two commands:

int coprthr_dopen( const char* path, int flags); to open the epiphany (which should be matched with a coprthr_dclose()) where path is the #defined value COPRTHR_DEVICE_E32 and flags is COPRTHR_O_THREAD. The integer returned is the device descriptor that is needed in a number of subsequent calls.

and at least one of the following:

coprthr_program_t coprthr_cc_read_bin( const char* path, int flags ); to open a pre-compiled object file (our bin.3.e32 file), where path is the file location of the binary and flags is zero. The return value is a handle to the compiled file.

or

coprthr_program_t coprthr_cc( const char* src, size_t len, const char* opt, char** log ); compiles a string (src) using the compile options (opt) returning handle to the program. The log parameter is a pointer the compiler output. Pass in a NULL pointer and a suitable amount of space will be allocated (which must be then freed).


e.g.

int dd = coprthr_dopen(COPRTHR_DEVICE_E32, COPRTHR_O_THREAD);
printf("dd=%d\n",dd);
if (dd<0)
{
printf("device open failed\n");
exit(0);
}

coprthr_program_t prg;
prg = coprthr_cc_read_bin("./pfunc.cbin.3.e32", 0);
printf("prg=%p \n",prg);
if (!(prg))
{
printf("file mpi_pfunc.cbin.3.e32 not found\n");
coprthr_dclose(dd);
exit(0);

}

There is no call to compile source code in a file so read it into a string and compile it from there.

Allocating Shared Memory


Similar to the stdcl call clmalloc, COPRTHR has the call the call:

coprthr_mem_t coprthr_dmalloc(int dd, size_t size, int flags); where a memory space of size is allocated on the device (dd) returning a handle of type coprthr_mem_t. The argument flags is not used.

I call it a handle rather than a pointer because it is really not working memory. If you try and use it as working memory you will be bitterly disappointed. The only way to write to and read from it is with the calls described in the next section.

Shared memory can be resized using the coprthr_drealloc(dd, size, flags) call and should be freed using coprthr_dfree(int dd, coprthr_mem_t mem) when you are done.

e.g.

coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0);

Writing to and Reading From Shared Memory


In stdcl you would declare a variable of type cl_T *  where T is a compatible simple type (e.g.int or float), call clmalloc(), initialise it in your host application and then call clmsync to transfer the data to shared memory.

COPRTHR does things a little differently. The handle returned by coprthr_dmalloc refers to memory in shared space only. To initialise it you need to declare, allocate and write to memory in your host program and then call:


coprthr_event_t coprthr_dwrite(int dd, coprthr_mem_t mem, size_t offset, void* ptr, size_t len, int flags); where dd is your device descriptor, mem is your memory handle, offset is how far into the shared memory you want to start writing, ptr is a pointer to your host storage the contents of which you want written into shared memory, len is the length (in bytes) of the data you want written and flags is one of COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event handle which can be used in the call coprthr_dwaitevent(dd, event).

The contents of your host memory will be written to shared memory. In a similar way, once your kernel has finished its computation and written its results back, you call:

coprthr_event_t coprthr_dread(int dd, coprthr_mem_t mem, size_t offset, void* ptr,
size_t len, int flags); where the arguments and return values are the same except of course the data gets read from shared memory into your host memory.

There is also coprthr_dcopy(...) which copies data from one device's shared memory into another but that is not relevent on the Parallella given that there is only one device.

e.g.

float host_buffer[WIDTH];
for (i=0; i < WIDTH; i++) host_buffer[i] = -1 * i; /// write data to shared DRAM coprthr_mem_t p_data_mem = coprthr_dmalloc(dd, WIDTH*sizeof(float), 0); coprthr_dwrite(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float), COPRTHR_E_WAIT);

followed later by:

coprthr_dread(dd, p_data_mem, 0, host_buffer, WIDTH*sizeof(float),COPRTHR_E_WAIT);

Retrieving a Kernel


To call a kernel you need to retrieve a link to it from the program file you just read in or compiled. Similar to stdcl clsym call COPRTHR has:

coprthr_sym_t coprthr_getsym( coprthr_program_t program, const char* symbol); where program is the returned program handle and symbol is the name of the kernel. The return value is the handle to the kernel.

e.g.

coprthr_sym_t thr = coprthr_getsym(prg,"p_func_thread_mpi");
if (thr == (coprthr_sym_t)0xffffffff)
{
printf("kernel p_func_thread not found\n");
coprthr_dclose(dd);
exit(0);

}


Calling a Kernel


There are three ways to call a kernel or kernels. The MPI library uses the call:

coprthr_mpiexec(int dd, int thrCount, comprthr_sym_t thr, void * args, int size, int flags); where dd is the device descriptor, thrCount is the number of threads you want to launch, thr is the handle to the kernel you are calling, args is a type-less pointer to a structure containing the kernels arguments (see next paragraph), size is the size (in bytes) of the structure you are passing and flags (presumably because I have not seen them used) is one of  COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. I also presume that the call returns a variable of type coprthr_event_t.

To use the args variable in this call you need to typedef a structure that contains the variables you want to pass to your kernel. Do this in a separate include file and include it in both the host file and kernel file. Declare a variable of this type in your host program in host memory. The simple data types (int, float etc) can be written to the structure but the arrays need to be allocated in shared memory and initialised as described above. The structure is then passed to the kernel as a type-less pointer which can then be cast back to the structure type inside your kernel code.

e.g.

my_args_t args = {
.n = 2.0,
.width = WIDTH,
.p_mem = coprthr_memptr(p_data_mem, 0),
};

coprthr_mpiexec(dd, ECORES, thr, &args, sizeof(args), 0);



The alternative ways are the COPRTHR calls that were available before the MPI library appeared. These are:

coprthr_event_t coprthr_dexec( int dd, coprthr_kernel_t krn, unsigned int nargs, void** args, unsigned int nthr, void* reserved, int flags ); where dd is the device descriptor, krn is the handle to the kernel, nargs is the number of elements in the args variable (see next paragraph), args is a pointer to an array of pointers of values to be passed to the kernel, nthr is the number of times the kernel should be executed, reserved is not used and flags is one of  COPRTHR_E_NOWAIT or COPRTHR_E_WAIT. The return value is an event that can be used used in the call coprthr_dwaitevent(dd, event).

and:

coprthr_event_t coprthr_dnexec( int dd, unsigned int nkrn, coprthr_kernel_t v_krn[], unsigned int v_nargs[], void** v_args[], unsigned int v_nthr[], void* v_reserved[], int flags); which allows a number of different kernels to be executed at once. Each array is a is a collection of the same arguments in the dexec call.

The big difference between mpiexec and dexec is how the arguments are passed. The dexec call only accepts arrays and those arrays have to be in shared memory. This means that passing singleton data via a one element array which is a bit clumsy. But remember, if you use mpiexec to call the kernel, you don't have to use any MPI.

e.g.

float n[] = { 2.0 };
int width[] = { WIDTH };
coprthr_mem_t p_arg_n = coprthr_dmalloc(dd, sizeof(float), 0);
coprthr_mem_t p_arg_width = coprthr_dmalloc(dd, sizeof(int), 0);
coprthr_dwrite(dd, p_arg_n,0,&n, sizeof(float), COPRTHR_E_WAIT);
coprthr_dwrite(dd, p_arg_width, 0, &width, sizeof(int),   COPRTHR_E_WAIT);
void * p_args[] = { &p_arg_n, &p_arg_width, &p_data_mem };

coprthr_dexec(dd, thr, 3, p_args, ECORES, 0, COPRTHR_E_WAIT);


Changes to Kernel Code


Finally, now that you have called your kernel, you need to retrieve your arguments. Remember that shared memory is slow so if you want to use them more than once it is better to make a local copy.

Kernels called with coprthr_mpiexec you need to cast the single kernel argument as your defined structure:


__kernel void p_func_thread_mpi( void * p_args )
{
my_args_t* pargs = (my_args_t*)p_args;
...
}


With coprthr_dexec you need to de-reference the singleton arguments:


__kernel void p_func_thread_std( float * p_arg_n,  int* p_arg_width, void * p_mem )
{
float n = p_arg_n[0];
int cnt = p_arg_width[0] / ECORES;
float * g_pmem = (float *)p_mem;  /// to be used in dma copy
...
}

Reserving Local Memory


In stdcl there is no way to dynamically allocate space other than with source code modification and dynamic compilation. COPRTHR has two calls that allows you to use free space in local memory:

void * coprthr_tls_sbrk(int size); which returns the contents of the system variable containing the first byte of free memory and sets the free memory address system variable to be size bytes further along

and:

coprthr_tls_brk(void * addr); sets the system variable containing address of the first byte of free memory to be addr.

These calls should be used as following (if you are into neatness):

/// store the starting position of free space
void * memfree = coprthr_tls_sbrk(0);

/// grab a chunk of free space for local storage
float * localSpace = (float*)coprthr_tls_sbrk(cnt * sizeof(float));

/// do something with the space


/// reset the free space pointer to where it was initially
coprthr_tls_brk(memfree);

WARNING: The value returned by coprthr_tls_sbrk is an actual memory address. Remember this is bare metal programming with no comfy operating system preventing you writing data where you shouldn't. 


Copying using DMA


Once you have a reference to shared memory (called global memory in OpenCL terms) and allocated some space, you need to copy it into memory local to the core. The eSDK documentation shows the basic synchronous copy command:

int e_dma_copy(void *dst, void *src, size_t bytes); which copies bytes of data from src to dst returning 0 if successful. This is a synchronous copy using the E_DMA_1 channel.

In our example the src is the pointer to global memory and the dst is the memory address returned by coprthr_tls_sbrk. You'd better make sure that bytes in e_dma_copy is the same as size in coprthr_tls_sbrk.

e.g.

e_dma_copy(localSpace, g_pmem + (cnt * rank), cnt*sizeof(float));

Asynchronous use of the DMA channels requires the use of:


int e_dma_start(e_dma_desc_t *descriptor, e_dma_id_t chan);
int e_dma_busy(e_dma_id_t chan);
void e_dma_wait(e_dma_id_t chan);
void e_dma_set_desc(e_dma_id_t chan, unsigned config, e_dma_desc_t *next_desc, unsigned stride_i_src, unsigned stride_i_dst, unsigned count_i, unsigned count_o, unsigned stride_o_src, unsigned stride_o_dst, void *addr_src, void *addr_dst, e_dma_desc_t *descriptor);

Which I will cover in a future post.

Final Thoughts


This post is more a how-to post on using COPRTHR and a toe-in-the-water look at MPI and DMA copying. While I'm convinced that the combination will lead to a better structured and faster kernels, I'm not sure by how much and at what cost in terms of the complexity of the program.

My next step with be to compare MPI with direct memory writes along the lines of my data passing experiment. Until then, please let me know if I've missed anything.