PipelineC (again), dct example, looking for help/interest

  • Thread starter Julian Kemmerer
  • Start date
J

Julian Kemmerer

Guest
Hi folks looking for feedback on PipelineC. Ideas of what to implement next..

I will point you to a recent reddit post which ultimately points to GitHub.

https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_lower_resource_usage/

Here is the code to get you interested:

// This is the unrolled version of the original dct copy-and-pasted algorithm
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/
// PipelineC iterations of dctTransformUnrolled are used
// to unroll the calculation serially in O(n^4) time

// Input 'matrix' and start=1 to begin calculation
// Input 'matrix' must stay constant until return .done

// 'sum' accumulates over iterations/clocks and should be pipelined
// So 'sum' must be a volatile global variable
// Keep track of when sum is valid and be read+written
volatile uint1_t dct_volatiles_valid;
// sum will temporarily store the sum of cosine signals
volatile float dct_sum;
// dct_result will store the discrete cosine transform
// Signal that this is the iteration containing the 'done' result
typedef struct dct_done_t
{
float matrix[DCT_M][DCT_N];
uint1_t done;
} dct_done_t;
volatile dct_done_t dct_result;
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start)
{
// Assume not done yet
dct_result.done = 0;

// Start validates volatiles
if(start)
{
dct_volatiles_valid = 1;
}

// Global func to handle getting to BRAM
// 1) Lookup constants from BRAM (using iterators)
// 2) Increment iterators
// Returns next iterators and constants and will increment when requested
dct_lookup_increment_t lookup_increment;
uint1_t do_increment;
// Only increment when volatiles valid
do_increment = dct_volatiles_valid;
lookup_increment = dct_lookup_increment(do_increment);

// Unpack struct for ease of reading calculation code below
float const_val;
const_val = lookup_increment.lookup.const_val;
float cos_val;
cos_val = lookup_increment.lookup.cos_val;
dct_iter_t i;
i = lookup_increment.incrementer.curr_iters.i;
dct_iter_t j;
j = lookup_increment.incrementer.curr_iters.j;
dct_iter_t k;
k = lookup_increment.incrementer.curr_iters.k;
dct_iter_t l;
l = lookup_increment.incrementer.curr_iters.l;
uint1_t reset_k;
reset_k = lookup_increment.incrementer.increment.reset_k;
uint1_t reset_l;
reset_l = lookup_increment.incrementer.increment.reset_l;
uint1_t done;
done = lookup_increment.incrementer.increment.done;


// Do math for this volatile iteration only when
// can safely read+write volatiles
if(dct_volatiles_valid)
{
// ~~~ The primary calculation ~~~:
// 1) Float * cosine constant from lookup table
float dct1;
dct1 = (float)matrix[k][l] * cos_val;
// 2) Increment sum
dct_sum = dct_sum + dct1;
// 3) constant * Float and assign into the output matrix
dct_result.matrix[j] = const_val * dct_sum;

// Sum accumulates during the k and l loops
// So reset when they are rolling over
if(reset_k & reset_l)
{
dct_sum = 0.0;
}

// Done yet?
dct_result.done = done;

// Reset volatiles once done
if(done)
{
dct_volatiles_valid = 0;
}
}

return dct_result;
}


What does this synthesize to?

Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled).

Consider the 'execution' of the function in time order. The logic consists of:

~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix'

~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier

~34% of time for the 2) Increment sum addition, a floating point adder

~21% of time for the 3) constant * Float, a floating point multiplier

~5% of time for the 3) assignment into the output matrix at [j]

That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.
 

Welcome to EDABoard.com

Sponsor

Back
Top