Reports

I think I found a solution, at least for what concerns getting the correct answer with the GNU compiled code. In my opinion, the key of the problem here is that, due to the multi-platform targeting of the OpenMP API, OMP directives are not actual code, but just instructions that tell the compiler how to translate the source code into device oriented code. The OpenMP standard defines a series of default rules concerning the communication between host and device, in order to be both simple and effective. If I compile the example code with nvcc on a machine with a NVIDIA device, these standards work nicely. If, on the contrary, I use other compilers, like gcc, I need to add compilation flags and explicit mapping directives to get the data handled as I need.

More specifically, it looks like when I compile the code with gcc all the information that I map explicitly is passed between host and device, except the return value of the target-declared function. My solution was therefore to add a wrapper host function, which opens the target region and calls the target function from within it. Here the difference is that the return-value of the target declared function gauss() is not passed to a variable that will be mapped back to host, but it is used to build the array ON the device. Then, the for loop just accumulates the array elements in the host-mapped wrapper result.

The code which worked for me is:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

// This is a device-specific function: it is called
// on the device and its return value is used by the
// device.
#pragma omp begin declare target
double increment(double x_val, double diff_val) {
  return (exp(-1.0 * x_val * x_val) * diff_val);
}
#pragma omp end declare target

// This is a wrapper function: it is called by the host
// and it opens a target region where the device function
// is executed. The return value of the device function is
// used to build the return value of the wrapper.
double gauss(double* d_array, size_t size) {
  double d_result = 0.0;
  const double dx = 20.0 / (1.0 * size);
#pragma omp target map(tofrom: d_result) map(to:dx, size) map(alloc:d_array[0:size])
  {
#pragma omp teams distribute parallel for simd reduction(+:d_result)
    for (size_t i = 0; i < size; i++) {
      double x = 10.0 * i / (0.5 * size) - 10.0;
      // Here I call the target-declared function increment()
      // to get the integration element that I want to sum.
      d_array[i] = increment(x, dx);
      d_result += d_array[i];
    }
  }
  return d_result;
}

int main(int argc, char *argv[]) {
  double t_start = 0.0, t_end = 0.0;
  double result = 0.0;
  size_t gb = 1024 * 1024 * 1024;
  size_t size = 2;
  size_t elements = gb / sizeof(double) * size;
  double *array = (double*)malloc(elements * sizeof(double));

  // Testing inline: this works fine
  printf("Running offloaded calculation from main()...\n");
  t_start = omp_get_wtime();
#pragma omp target map(tofrom:result) map(to:elements) map(alloc:array[0:elements])
  {
    const double dx = 20.0 / (1.0 * elements);
#pragma omp teams distribute parallel for simd reduction(+:result)
    for (int64_t i = 0; i < elements; i++) {
      double x = 10.0 * i / (0.5 * elements) - 10.0;
      array[i] = exp(-1.0 * x * x) * dx;
      result += array[i];
    }
  }
  t_end = omp_get_wtime();
  result *= result;
  printf("The result of the inline test is %lf.\n", result);
  printf("Inline calculation lasted %lfs.\n", (t_end - t_start));

  // Testing wrapper function with target-declared calls: also works
  printf("Running offloaded calculation from gauss()...\n");
  t_start = omp_get_wtime();
  result = gauss(array, elements);
  t_end = omp_get_wtime();
  result *= result;
  printf("The result of the test using gauss() is %lf.\n", result);
  printf("The calculation using gauss() lasted %lfs.\n", (t_end - t_start));
  free(array);
  return 0;
}

I compiled and ran it as:

$ gcc -O3 -fopenmp -no-pie -foffload=default -fcf-protection=none -fno-stack-protector -foffload-options="-O3 -fcf-protection=none -fno-stack-protector -lm -latomic -lgomp" gauss_test.c -o gnu_gauss_test -lm -lgomp

$ OMP_TARGET_OFFLOAD=DISABLED ./gnu_gauss_test 

Running offloaded calculation from main()...
The result of the inline test is 3.141593.
Inline calculation lasted 0.259518s.
Running offloaded calculation from gauss()...
The result of the test using gauss() is 3.141593.
The calculation using gauss() lasted 0.195138s.

$ OMP_TARGET_OFFLOAD=MANDATORY ./gnu_gauss_test 

Running offloaded calculation from main()...
The result of the inline test is 3.141593.
Inline calculation lasted 2.127423s.
Running offloaded calculation from gauss()...
The result of the test using gauss() is 3.141593.
The calculation using gauss() lasted 0.161285s.

i.e., I am now getting the correct accumulation value from both the inline calculation and the gauss() function call. (NOTE: the large execution time of the inline calculation is an artifact of GPU init processes, which, with my current implementation, appear to be executed when hitting the first OMP target directive).

79691799