Lecture7

策略2

input_tile 包括了 halo 元素

tile_s2

Block的大小要覆盖 input tile
Block中的每个线程 load input tile 中的一个元素进入 shared memory
在进行计算时，关闭某些线程

下图2D convolution 中，output tile 覆盖了输入的绿色方框部分，input tile 又额外覆盖了黄色 halo 部分（因为需要这一部分来参与卷积计算）。

因此，Block的维度设定为

dim3 dimBlock(TILE_WIDTH + MASK_WIDTH - 1, TILE_WIDTH + MASK_WIDTH - 1, 1);
dim3 dimGrid((Width - 1) / TILE_WIDTH + 1, (Width - 1) / TILE_WIDTH + 1, 1);

TILE_WIDTH 实际上是 output tile 的宽度。 tiled_2D_conv_s2

从 output tile 到 input tile 坐标的转变

int tx = threadIdx.x;
int ty = threadIdx.y;

int row_o = blockIdx.x * TILE_WIDTH + ty;
int col_o = blockIdx.y * TILE_WIDTH + tx;

int row_i = row_o - radius;
int col_i = col_o - radius;

__syncthreds();

output2input

现在开启了足够多的线程去 load 元素，对于 outside 的元素，load 为0

float Pvalue = 0.f;

if ((row_i >= 0 && row_i < Width) &&
    (col_i >= 0 && col_i < Width)) {
    tile[ty][tx] = N[row_i * Width + col_i];
}
else {
    tile[ty][tx] = 0.f;
}

不是所有的线程都计算输出

if (ty < TILE_WIDTH && tx < TILE_WIDTH) {
    for (int i = 0; i < MASK_WIDTH; i++) {
        for (int j = 0; j < MASK_WIDTH; j++) {
            Pvalue += Mc[i][j] * tile[i+ty][j+tx];
        }
    }
}

if (row_o < Width && col_o < Width) {
    P[row_o * Width + col_o] =  Pvalue;
}