Friday, 20 December 2013

Sorting of Spatial Data for Meshed Multi-Core Processors

Sorting of Spatial Data for Meshed Multi-Core Processors

Many algorithms that operate on two dimensional data sets are well suited to multi-core processors such as the Parallella. The nature of raw two dimensional data is that it generally arrives as an unsorted “point cloud” where neither the x or y axis are sorted. In order to gain the most efficient use of the processor, the points must be sorted so that each core can work on a subset of the space while minimising the need for inter-core communication. Therefore it makes sense to start the process with an efficient, two dimensional sort where the data ends up evenly distributed amongst the cores in clusters of “near neighbours”.
The target of this and subsequent blogs will be a design for Delaunay Triangulation. The target architecture is the Parallella board. I believe that the Epiphany III and Epiphany IV chips on the Parallella is well suited to spatial processing given that its cores are organised in a mesh, i.e. each core is connected to it's nearest neighbours to the north, south, east and west via multiply high speed buses. Thus, if close data points are clustered on each core and neighbouring clusters are on neighbouring cores, the distance to be travelled when the clusters are finally matched up will be minimised.

Goals of the Algorithm

The goals of the sorting algorithm are:
  • The x value of each point in a cluster are less than those in the clusters on the cores to the east and greater than those on the cores to the west.
  • The y value of each point in a cluster are less than those in the clusters on the cores to the north and greater than those in the clusters on the cores to the south.
  • The points are evenly distributed between cores
In addition we should take care that no unnecessary steps are introduced so that we end up with the most efficient process over all.

Distributing the Points

Given that the point cloud is assumed to be random and we want to have the same number of points per core then a round robin distribution seems the best way to start. The can be done as the data is being read in which should give the cores enough time to complete at least some of the initial sort in parallel with the data load phase.

Initial Sort

Let's not reinvent the wheel here. I think that a quick sort of the x values into one index and the y values into another index is the way to go here.

Swap Sort

After the initial sort, each core will have an sorted group of points that could be from anywhere in the point cloud. The purpose of the swap sort is to push each point to the neighbouring cores where the point is with it's nearest neighbours, or at least closer to them. I'm using a push or write-only method of communication between cores because the write message is a lot quicker than the read message. The cores must swap points (i.e receive one point for every point sent) in order to preserve the even distribution of points.
The cores must start the process by passing their larges x and y values to the cores to the east and north respectively via the c-mesh (let's call this the MyLargestX message). If the lowest x value of the core to the east is smaller than the highest x value then the these two cores must initiate a swap. This can be done in response to the MyLargestX message (let's call the response MySmallesXY). Simultaneously, the cores can be swapping along the y axis with a MyLargestY message. The swap is then completed with a MyLargestXY call from the initial core passing both the x and y values to the points new “home”.
If the MyLargestX message contains an argument that is smaller than the smallest value on the receiving core then the receiving core does not initiate the swap. If, at a later time that core receives a point that is smaller than that received from the core to the south or west then the swap can be initiated.

End Conditions

The end conditions cannot be determined by any one core on behalf of all cores because each core only knows it's own values and the largest values of the cores to the south and west. Therefore, each core must indicate to the controlling application that it is has received a MyLargestX or MyLargestY that does not cause it to initiate a swap. This is only a tentative state because a core that is currently “happy” (i.e. has values that are larger than the cores to the south and west) may receive a point from the north or east that it must pass on. Therefore the controlling application must only terminate the process when all cores are “happy”.

Potential Problems

Sub-Optimal Swapping Pathways

Because each core can only make decisions based on the data that is currently in it's local memory there may be some cases that swaps are initiated early in the process are then undone due to some smaller values arriving from cores further to the north. Right now I can't see how this can be avoided.
Similarly, a swap to the west may be undone at a later stage after some swaps to the north. This could be avoided by swapping based on the x values (north-south) first and, when x is sorted, sorting on y (east-west). This would also free up some memory on each core given that there would only be one sort index needed at a time (or indeed the points could be moved around in memory removing the need for an index at all).

Skewed Data Sets

This kind of sorting will end up with the point cloud spread out into a square. This is fine if the original data set is also roughly square or at least regularly distributed into a regular sort of shape.

Diagram 1 – Near neighbours stay near.

This distribution probably will not work so well if the point cloud is not regular. Points that are not true near neighbours may end up on the same core and points that are near neighbours may end up on distant cores.

Diagram 2 – Distant points that are not near neighbours get squashed onto the same core(s)

Diagram 3 – Near neighbours get separated onto distant cores

In this case the user must be aware of the limitations of the sort and the algorithm that uses the sorted data must be robust enough to handle this situation.

Next: Using the sorted data

My next entry will a description of how to Triangulate the sorted points.
Don't hold your breath – the Parallella boards are being shipped and when I get mine, I'll be writing code, not english.

Tuesday, 3 December 2013

Something to play with while we wait.

I've just posted a light weight C++ library that implements a three layer feed-forward back propagation neural network simulator on github. The repository is here:

I developed this version on the pi using the C++ standard library. This is an update of a version that I wrote a couple of years ago and is a very C++ oriented solution. There are classes for links, nodes, layers using templates, multiple inheritance and pretty much every C++ feature I know.

My intention is to use this as a base version and from which I'll develop a version for the parallella when it arrives. The next thing I'll work on is a matrix based implementation along the lines of the design outlined in the two previous posts on this blog.

If you want to download it and have a play please do. I hope my documentation is clear enough to get you started. 

One caveat though. I found github really confusing. If you are not familiar with it, downloading the files and not using git locally is probably the best way to go. Also, please don't post git or github questions - the answer will be "I have NO idea".

Please comment if you find any problems either on this post or on the repository wiki.

Thanks and have fun.

PS I just managed to get the Shuttle Statlog data uploaded. There are four training sets and one test set, the original data and transformed data in an open office spreadsheet and two scripts, one for create/train/save and one for test.

Tuesday, 1 October 2013

Training in Parallel

Training - the hard bit

Feeding forward is only part of the story. Any useful, real-world application with a significant number of inputs will need to be trained in an automatic fashion using a significant number of example data sets. Reed and Marks[1], the text that I mainly use, quotes Hinton[2] who says that a typical network with w weights requires requires O(w) training patterns and O(w) weight updates for good generalisation. On a serial machine this would require O(w^3) that would only be reduce by a factor of w on parallel hardware giving a training time of O(w^2). Clearly, an efficient implementation of the training algorithm is required.

How training works

Typically, before training commences, the weights on the links and bias values on the nodes are set to random values. When you feed your input pattern into the untrained network and calculated the output pattern, it will not bear any resemblance to the desired output. The weights and biases will have to be adjusted so that the calculated output matches the desired output (within predefined limits). The adjustment of each weight and bias is proportional to the amount of error it contributed to the final result.

For an output node the calculation is straight forward. The error is the desired value minus the calculated value. This error is then multiplied by the first derivative of the activation function giving the delta for the bias and incoming links. This delta is is used to "nudge" the incoming weights and output node bias in the direction that would result in a calculated output closer to the desired pattern. 

The magnitude of the nudge is determined by a value called the "learning rate". This is a value between 0 and 1 which is a multiplicand when updating the weights and bias values.

The error for the hidden layer is a little more difficult. Each weight between the hidden and output layer has contributed a little to the error of the output nodes. The error of a hidden node is the sum of these contributions (specifically, the weight times the output node delta). Again a delta for the hidden node is calculated summing the error on each weight and the first derivative of the activation function of the node.

Once the error on each hidden and output node has been calculated the incoming weights and node biases must be updated. Reed and Marks describe two variations called Batch Update and Online Update.

Training Variation 1 - Batch Update

In Batch Update the whole training set is passed through the network. The deltas for every node for every training set are calculated. The "total error" for the training set is the average of each node's deltas. Once this has happened the weights and biases are updated once.

Training Variation 2 - Online Update

In Online Update, each training pattern, (i.e. one set of inputs and the matching output) is run through the network, the deltas are calculated and the weights are updated straight away.

Practical Considerations

When to Stop

The idea with training is that the network is updated so that the discrepancy between the generated output and desired output is small while maintaining the generality of the solution. This is a balancing act. Clearly you want the network to produce an output that is a recognisable pattern (e.g. if an output is ON then it is greater that 0.8 and if it is OFF then it is less that -0.8). However, if you train the network too much it will eventually get to the stage that in only recognises the training set that you used. 

Recognising the situation where you have achieved the desired output levels can be done using a technique such as the Sum of the Squared Error (SSE).

Over training is not so easy. In practice, you keep a known set of inputs with their corresponding outputs aside as a test set. When you think that the network is getting close to a general solution you would run the test set (without doing any updates) and see if its output is as good a result as the training set. If it does then great! Press Save. If the test set results are significantly worse than the training test then you might have gone too far.

Once you have a well generalised network you may want continue training to see if you can improve the result but if the test set results start to diverge then you should back up to your last known general solution.

For the purposes of this design we need to keep in mind the calculation of the SSE and the ability to run a test set and not update afterwards. The maintenance and use of the test set will be left to the host application or as a manual process.

How do you know if you have the best network?

You can think of the weights in a neural network like a large multi-dimensional surface. A good solution represents a low point on this surface where all of the important weights are at a minimum. 

Diagram 8 - 2D Representation of a solution space

It is possible that your network to get stuck in a local minimum that does not represent a good solution. In this case no amount of training will make it better. There also may be a number of good solutions, one of which is the best. 

The only way of finding the best solution is to train the network many times from different starting points. The starting point in training is the set of random number that represent the initial weights. Therefore our system must have the ability to "unlearn" everything (hopefully after the user has pressed Save) and start again using a new set of random numbers.

Keeping the Cores Busy

To get the best performance out of the available hardware we should also consider how best to use all the features of the epiphany architecture.

Clearly the feed forward pass and error calculation (and weight update in online mode) are going to keep the core busy for a significant time and I presume that this task will be the processing bottleneck. Therefore keeping the cores busy, reducing the waiting time will be the key to optimum performance.

The off-chip network is connected to local memory via the DMA controller. To keep the cores "fed" with data, we should try to arrange the host process to send the next training batch to the core's local memory while it is still working on the current one. This should allow the next batch to commence as soon as it has finished. 

Where we left off...

At the end of the feed forward pass (assuming that the host has been diligent and passed the target output values while the cores were busy) our local memory would look like this:

Diagram 9 - Local memory at the end of the Feed Forward Pass

In this diagram:

  • red indicates values that have been passed to the core by the host (t(u) and t(v) are the target values for z(u) and z(v))
  • blue indicates values that have been calculated by a core (z(u), z(v), y(p) and y(q) have been calculated on Core J while y(1).. y(p-1) have been passed from upstream cores and y(q+1) .. y(N) from downstream cores)
  • purple indicates values that could either be calculated or passed (i.e. the weights)

Training Stage 1: Calculate the Output Error and Delta

Calculating the error and associated delta for an output node is trivial. The host can determine which core will calculate each final output value and send the target values to it.

Training Stage 2: Calculate the Hidden Error and Delta

The hidden node error is a little more difficult. The problem is that in my current model, the outbound weights from each hidden node are distributed across all of the cores. A few possible solutions come to mind:

1. Swap space for speed

In the example, Core J can only calculate the "hidden" error that is associated with Output(zu) and Output(zv) because it only has the links between Hidden(yp), Hidden(yq) and Output(zu) and Output(zv). It actually wants to calculate the whole error attributed to Hidden(yp) and Hidden(uq). To do this it would have to have a copy of all the weights between it's hidden nodes (yp and yq) and all the output node deltas. 

This is possible if each core had a copy of its own outbound weights and we could distribute the output deltas by using the same mechanism we used with the hidden layer values.

Diagram 10 - Space for Speed compromise

Clearly this strategy requires each core to have two sets of Hidden-Output links, the inbound links for its output nodes and the outbound links for it's hidden nodes. When training in batch mode the weights don't change from one training set to the next so the two sets of weights start synchronised and remain so until the update pass. 

The additional work to constantly update two sets of weights required for on-line mode suggests that this strategy would only be contemplated for batch mode.

2. Calculate and distribute

A less memory intensive method would be to calculate the weight * delta for all weights and deltas available to the core and to pass the those to the core that needs them. 

This would mean that data flowing around would use the fast the on-chip write network to its fullest extent. The value calculated by each core would only have to be sent to the core that needs it therefore the path would be determined by the epiphany's "x then y" rule. The largest number of hops would be 6 (on an epiphany-16) which would be for example between Core 1 and Core 13 at opposite corners as described in Diagram 7.

Once the hidden deltas have been calculated by the core that owns the hidden node it is at least in the right place. That core can either accumulate it for a later batch update or it can be used to update the node's inbound weights straight away in online mode.

3. Let the ARM cores look after it

Clearly neither of these a great solutions. There is another possibility however. The host CPUs (i.e. the ARM cores) also have a full set of data. Up until now we have only required them to keep the sending the data to the cores and not do any computation. There are two of them and both have considerable resources available in terms of memory and number crunching power.

If the output value for each output core or the output delta is passed back to the host, then it could work out remaining deltas while it is waiting for the next training pattern to be processed. The decision what to pass back to the ARM core would be based on how long the host takes to do its part of the job. The host's main task is still to keep the work up to the epiphany.

Again, batch mode would be fine with this. The ARM cores would accumulate the deltas and when the training set was done, send the deltas to each core which could then update the weights. This would introduce a short pause for the epiphany cores while the final output and hidden deltas and total batch errors are calculated and the sent to each core for the update.

Online mode... again... would have a problem. If the weights are to be updated every training example then the epiphany cores would be waiting around for the ARM cores to calculate and send the updates. This does not seem to be a good solution.

Training Stage 3: Update Weights and Biases

Once the output and hidden node errors have been calculated then each bias and weight needs to be nudged towards a better solution. Given that each core has a copy of each incoming weight and (after we figure out the best way of determining the hidden layer error) each will have the error of its own nodes then the update of the weights is straight forward. Each weight is assigned weight (i.e. itself) * delta * learningRate.

In online mode this would happen straight away and after the update the delta could be discarded.

In batch mode the error would have to be accumulated somewhere, either by each core or by the host CPU and when the training set was complete the final "total batch error" could be calculated. The accumulated errors would then be used to update the weights.

Up Next...

While we wait for our Parallellas to arrive I though I'd pull apart my Windows version and get a half decent interface together. I'll start on some documentation for that but it won't mean much until the guts have been written.

Also, I'll look around for some decent well known test data. I want to be about to get a couple of big problems to run through it and test out the scalability of the solution. If anyone knows of some good public test data please send me a link.

[1] Reed, R.D. and Marks, R. J. "Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks", MIT Press 1999.
[2] Hinton, G.E. Connectionist learning procedures. Artificial Intelligence 40(1): 143-150. 1989.

Monday, 2 September 2013

Parallel Neural Networks - Carving up the task


Over a couple of blogs I'll break down the process of getting the best performance out of a parallel architecture for processing neural networks. 

A few points before I begin:

First let me say that I am only going to look at Feed Forward - Back Propagation networks in the first instance. I might get into other types of neural networks with other training methods at a later time but for now I'm keeping things simple. This is also not a rigorous academic work. I'm only going to cover neural network theory in as much detail as I need in order to illustrate my program design.

Second, I must point out that this is an experiment. The method I develop here might not be the most efficient and if you believe you have a better way, please let me know. If I find a better way I'll update the post.

Third, I'm focussing on the Adapteva Parallella architecture. Whatever I write might work on other platforms but no promises from me. If you think it will work elsewhere please give it a try and let me know.

Carving Up the Task

To get the best performance it is important to carve up the task so that the available hardware is as busy is it can be. Let's begin by looking at the task.

Feed Forward - Back Propagation Networks

This is the basic shape of a feed forward - back propagation network:

Diagram 1 - Basic Structure

An input vector |x| (x(1) to x(n)) are fed into the network and produce an output vector |z| (z(1) to z(o)) via an intermediate or "hidden" vector |y| (y(1) to y(m)). What happens to transform inputs to outputs will be explained in the next paragraph. In the diagram, each circle is referred to as a Node and the nodes for each vector |x|, |y| and |z| are collectively known as a layer. Each node in the input layer is connected by a link (a.k.a. an arc) to every node in the hidden layer and similarly, each hidden node is linked to every output node.

The transition between input nodes and hidden nodes is shown in the following diagram:

Diagram 2 - Calculation of a single node (y(p))

This example a single node in the hidden layer is calculated (y(p)). To calculate y(p), each input value is multiplied by a weight (w(1,p) to w(n,p) in this diagram). The result is summed, added to bias value b(p) and passed through the Activation Function. The activation function is usually a non-linear function such as sigmoid or tanh.

This is the Feed Forward component of the behaviour. Back Propagation refers to the process by which weights and threshold values are adjusted in order for the network to "recognise" an input pattern and produce the output pattern. To use back propagation you need a large number of matched input and output vectors. The input vectors are presented to the network the resultant outputs produced by the feed forward pass are compared to the desired output in the matched set. The error for each output node is calculated and the weights between the hidden nodes and output nodes as well as the threshold value are then adjusted to reduce the magnitude of the error. Once the weights of the hidden to output layer links have been adjusted, the weights of the input to hidden layers are adjusted along with the threshold values on the hidden nodes. Thus the error is propagated backwards through the network. This is termed "training the network".

In the feed forward step weight multiplication must happen for each link between every input value and each hidden node and every hidden node and each output node. This is the bulk of the computation that must be optimised. The summation and Activation Function also represent significant work.

In the back propagation step each weight and threshold value must adjusted in proportion to it's contribution to the error. The adjustment is also "dampened" by a factor known as the learning rate (0..1) thus the weight adjustment is the multiplication of three values. Therefore, training (a feed forward pass followed by a back propagation pass) represents significantly more work than a feed forward pass alone.

Organising the Feed Forward Pass

Splitting apart Diagram 1 the complete task now looks like this:

Diagram 3 - Diagram 1 rearranged

I have rearranged Diagram 1 to highlight the regularity of the computation. To calculate a derived value (i.e. hidden nodes and output nodes) the same same pattern must be repeated. To again redraw the task in a more programatic way:

Diagram 4 - Node values and weights as arrays

Diagram 4 now has all the computation steps to calculate a derived value (hidden nodes in this case). The input vector and weight vectors are shown as array elements that need to be multiplied together to form the input to the summation and subsequently the activation function. 

This is now the sort of problem that we can break down into work items for Parallella cores.

Let's Get Practical

One Node per Core

Probably the most obvious way to split up the task is to assign one node to each core. This has the advantage of being simple to understand and therefore program. The big disadvantage is that it is unlikely that optimal performance will be achieved because it is unlikely that the number of nodes is evenly divisible by the number of cores.

Round Robin

It would also be possible to assign input * weight calculations from each link to each core in a Round Robin manner. This would ensure that all cores got the same amount of work but collecting all of the weighted inputs together for the summation step would be a communication headache.


The best way I have come up with to split up the task is to partitions based on the number of cores. Assume that there are six input nodes and 16 hidden nodes. To split the task evenly between cores, core J, somewhere in the middle of the array would get the following task:

Diagram 5 - Core Assignment Example 1

Assigning the weight calculations evenly means that each core gets 10 to do. With 6 input values that means that there will be an overlap of 4 on each core. 

If a core completes a whole set of link multiplications (for yq in this example) then it has all the information it needs to do the summation and activation function for that core.

If it starts but does not finish the job then it passes the partial sum onto the next core (core K in this example) which then finishes the sum and applies the activation function (for yr).

If it completes the job but does not start it then it must wait for the previous core to pass the partial sum before completing the sum and applying the activation function (for yp in this example).

The only other case to consider is when a core does not get to start or complete a whole derived node's worth of input * weight calculations. 

Diagram 6 - Core Assignment Example 2

In this case, the core must wait for the incoming partial sum before passing it on to the next core. This would be a common occurrence on the Epiphany-64. 

Moving from the Hidden Layer to the Output Layer

The partitioning strategy assumes that all inputs to the calculation (x1..6 in the example) are know at the time the process starts. This can be set up to be that way initially when calculating the hidden nodes. All inputs and weights needed by each core can be sent to it during the set up phase.

At the end of the calculation of the hidden node values however, each core only has those hidden values that it has calculated itself. Therefore an efficient method must be devised to transmit all calculated values to all cores.

The most obvious method would be to write all hidden node values back to global memory and then each core can read the ones that it does not have (or read all of them if it is quicker).

My approach would be to try something different. Why not use the on-chip write network to transmit the hidden node values to the left and right as they are calculated. In this way, the transmission can be started before all of the values are known and each core can start the second phase calculation when it has a full set of values.

It might be best to initially pass the hidden node values in one direction (e.g. to a core with a higher index see next section) and then when the core has received all of its upstream values then to pass in the other direction. This would mean that the on-chip write network would be writing in one direction at a time and therefore hopefully would avoid collisions.

I'm not sure which strategy will work best. The on-chip write network would be pretty busy for a period. However, if each core can determine when it has a full set of values this strategy prevents the need for coordination make sure that all values have been posted to global memory before the cores are updated. Also, the off-chip write network is 8 times slower and would also get pretty congested when all the cores were updating.

Epiphany Internal Network Configuration

The partitioning strategy as outlined above has a linear relationship between the cores that reflects the linear organisation of the data. I propose that the most efficient flow of data between cores is as shown:

Diagram 7 - Data Pathway on a Epiphany-16 

Execution Flow

I think that it is worth having a think about how the execution will happen and how the inter-core communication will fit into it.

In Example 2 above there is only one job to do so starting at the lowest index and working through to the highest would be the most obvious option.

Example 1 however would be best done in reverse, i.e. the partial sum that is passed to the next core should be calculated first and passed so that it is ready when that core needs it. In this case the execution flow would be as follows:

0. Pass the input values and partitioned weight vectors (w(x,y) and w(y,z) to each core
1. Execute the input * weight calculations and any partial sum that must be passed to the next core
2. Pass the partial sum
3. Execute the input * weight / summation / activation function for the nodes that are wholly the task of the core
5. Pass the derived node value to the neighbouring cores to update their local memory in preparation for the next layer
6. Execute the input * weight calculation for any node where a partial sum is expected from an upstream core
7. Wait for the partial sum (it should already be there from step 2)
8. Add the incoming partial sum to the sum of the locally calculated values
9. Calculate the activation function for the node
10. Pass the derived node value to the neighbouring cores
11. Repeat steps 1-9 for the second layer ignoring step 5 (no need to update neighbouring cores)
12. Pass the final output values back to the host.

Next Up... Training

Now that I've covered the feed forward process, I'll spend some time thinking about training. Again, it is the transition from the output layer back to the hidden layer that seems to be the tricky bit. 

Please post any comments if you have spotted any problems with my process or if you have any other ideas.