Dot Produkt in Cuda von Beispiel funktioniert nicht für mich

Ich fange an zu lesen "Cuda By Example" Buch und ich habe ein Problem mit dem Punkt Beispiel mit "Shared Memory". Ich kopiere das Beispiel aus dem Buch und setze: N = x * 1024; threadsPerBlock = 32; blocksPerGrid = 8. Wo ich die "x" -Werte mit 2, 3, 4, 5 teste. Wenn ich x = 3 setze, ist das Ergebnis schlecht, aber wenn ich x = 2,4,5 verwende, ist alles in Ordnung. Ich verstehe nicht, wo das Problem ist. Der Code ist:Dot Produkt in Cuda von Beispiel funktioniert nicht für mich

#include "cuda_runtime.h" 
#include "device_launch_parameters.h" 
#include <stdio.h> 

#define imin(a, b) (a<b?a:b) 
#define sum_squares(x) (x*(x+1)*(2*x+1)/6) 

const int x = 3; 
const int N = 3 * 1024; 
const int threadsPerBlock = 32; 
const int blocksPerGrid = 8; 

__global__ void dot(float *a, float *b, float *c) 
{ 
    __shared__ float cache[threadsPerBlock]; 
    int tid = threadIdx.x + blockIdx.x * blockDim.x; 
    int cacheIndex = threadIdx.x; 
    float temp = 0; 

    while (tid < N) 
    { 
     temp += a[tid] * b[tid]; 
     tid += blockDim.x * gridDim.x; 
    } 

    cache[cacheIndex] = temp; 

    __syncthreads(); 

    int i = blockDim.x/2; 
    while (i != 0) 
    { 
     if (cacheIndex < i) 
      cache[cacheIndex] += cache[cacheIndex + i]; 
     __syncthreads(); 
     i /= 2; 
    } 

    if (cacheIndex == 0) 
     c[blockIdx.x] = cache[0]; 
} 

int main() 
{ 
    float *a, *b, *partial_c, result; 
    float *d_a, *d_b, *d_partial_c; 

    a = (float *)malloc(N * sizeof(float)); 
    b = (float *)malloc(N * sizeof(float)); 
    partial_c = (float *)malloc(blocksPerGrid * sizeof(float)); 

    cudaMalloc((void **)&d_a, N * sizeof(float)); 
    cudaMalloc((void **)&d_b, N * sizeof(float)); 
    cudaMalloc((void **)&d_partial_c, blocksPerGrid * sizeof(float)); 

    for (int i = 0; i < N; i++) 
    { 
     a[i] = i; 
     b[i] = 2 * i; 
    } 

    cudaMemcpy(d_a, a, N * sizeof(float), cudaMemcpyHostToDevice); 
    cudaMemcpy(d_b, b, N * sizeof(float), cudaMemcpyHostToDevice); 

    dot << <blocksPerGrid, threadsPerBlock >> >(d_a, d_b, d_partial_c); 

    cudaMemcpy(partial_c, d_partial_c, blocksPerGrid * sizeof(float),  cudaMemcpyDeviceToHost); 

    result = 0; 
    for (int i = 0; i < blocksPerGrid; i++) 
     result += partial_c[i]; 

    if (2 * sum_squares((float)(N - 1)) == result) 
     printf(":)\n"); 
    else 
     printf(":(\n"); 

    cudaFree(d_a); 
    cudaFree(d_b); 
    cudaFree(d_partial_c); 

    free(a); 
    free(b); 
    free(partial_c); 

    getchar(); 
    return 0; 
}

Quelle

2016-07-23 Pavel Angel Mendoza Villafane

Da float nicht präzise genug haben, die ~7 decimal digits nur. Aber für x=3; Ihr erwartetes Ergebnis ist

19317916672

enthält 11 Ziffern.

für x=4,5, die Ergebnisse sind auch auf meiner Maschine schlecht.

Quelle

2016-07-23 12:03:34 kangshiyin

Ok, ich habe Float zu Double geändert und arbeite ok. Aber ich verstehe nicht, warum in meinem Fall mit x = 4,5 (mit großem Ergebnis) keine Probleme, aber mit x = 3 (weniger langes Ergebnis) fehlschlagen. –

@PavelAngelMendozaVillafane Sie könnten die beiden Float-Zahlen, die Sie verglichen haben, und das genaue Ergebnis drucken, um zu sehen, warum. '2 * sum_squares ((float) (N - 1))' gibt nicht unbedingt das korrekte Ergebnis mit 'float'. – kangshiyin

Dot Produkt in Cuda von Beispiel funktioniert nicht für mich

Antwort

Verwandte Themen