Corrupted memory when reading int32 connectivity into a int64 dataspace (investigated parallel reads), HDF5

Description

Problem: Reading a CGNS file (which was created for a default cgsize_t = int) with a CGNS program which has default cgsize_t = long int)

The issue is CGNS is passing a different memory type to HDF5 than what the memory buffer is. I’ll try to explain it for the case of an int32 dataset.

(1) You pass CGNS a buffer that is int64 (cgsize_t); this is what the dataset gets read into memory as.

(2) However, CGNS gets the memory data type of the dataset from the CGNS file, which in this case is int32, i.e.,

(2a) CGNS sets the dataset memory space in H5Dread to be int32.
(2b) But, CGNS lied, in step (2a), to HDF5 by telling HDF5 that it was reading into an int32 space, but it’s actually reading it into an int64 memory space

(3) The result, you get junk in the memory from HDF5.

Now, if CGNS is honest with HDF5 and instead reports the memory space as int 64,

if (rw_mode == CG_PAR_READ) {
type_id = H5T_NATIVE_INT64;
herr = H5Dread(data_id, type_id, mem_shape_id,
data_shape_id, plist_id, data[0].u.rbuf);

then HDF will convert the int32 dataset into an int64 dataset in memory, and everything will be correct.

So, the issue is CGNS assumes that the user is reading into memory a datatype buffer that is the same as the dataset type in the file. What CGNS should be doing instead is setting the memory space to match the memory buffer and not the file space buffer type. Then HDF will automatically handle the conversion to correct memory space datatype.

CGNS needs to do in readwrite_data_parallel something like (untested):

switch (type) {
case CGNS_ENUMV(Character):
type_id = H5T_NATIVE_CHAR;
break;
case CGNS_ENUMV(Integer):
if(sizeof(cgsize_t) == 8)
type_id = H5T_NATIVE_INT64;
else
type_id = H5T_NATIVE_INT32;
break;

Environment

None

Activity

Show:
cgns
March 2, 2019, 3:48 PM

I've only looked into pcgnslib.c. cgsize_t is more than just setting dimensions, it also determines the dataset datatype. Take for example,

int cgp_elements_write_data(int fn, int B, int Z, int S, cgsize_t start,
cgsize_t end, const cgsize_t *elements)

The connectivity array is of type cgsize_t. What this actually means in terms of HDF5 is that the memory space (and the file space for the write case) of the connectivity array is of type cgsize_t.

I don't think there is anything wrong, per se, with cgns_internels APIs. But it seems to me that CGNS confuses the memory space and file space when it comes to HDF5. It appears to me that CGNS internals always assumes that the memory space datatype is always the same as the file space datatype. This will not be the case with cgsize_t variables because this value can be int or long int depending on how the CGNS library was built.

stephen.guzik@colostate.edu
March 6, 2019, 6:26 PM

I would have preferred that users could specify the data type for element connectivity, independent of cgsize_t, but I guess we're stuck with this now. Otherwise, the proposed fix sounds good.

Scot Breitenfeld
October 4, 2019, 4:06 PM
Edited

After looking into it further, it boils down to this:


Problem:


The cgio_read APIs get the "void *data” data type from the CGNS file (i.e. the data type when the data was written to the file), this is for HDF5/CGNS. However, the HDF5 read routines expect the memory type of the variable that the data is being read into, not the data type in the file (i.e. it expects the data type of "void *data”). HDF5 automatically handles the conversion from the file data type to the memory data type.
We can’t use cgio_get_data_type to get this information in the cgio_read routines from the “cgio_num" and “id" because that returns the data type from the file.

Possible solutions:
(1) Overload the current APIs to add the memory type of the void *data. We do this for some of the mid-level CGNS APIs, so there are a mechanism and precedent for doing this. It also preserves backward compatibility, but it preserves something fundamentally flawed.
(2) Create a new API with the additional memory option and deprecate the current API in the next release. This would break backward compatibility.
I would lean for (2), but I would like some feedback before doing so. What is the historical precedent for this type of scenario?



Scot Breitenfeld
October 8, 2019, 10:36 PM

Looking more into the code, I do see that there was added cgio_<read/write>_data_type APIs which pass the memory type, they are currently only used for the general field read/writes from the MLL. There are also read_all/write_all _type APIs. They are not documented, I added a Jira issue to add them to the docs.

The cgio write APIs also confuse the memory and file space in HDF5, which the write _type APIs correct. I can just use those.



Scot Breitenfeld
October 9, 2019, 3:25 PM

If the non- _type read routines are removed then the following changes should be made to the ADFH routines, they are noted in the source. There are 3 instances.

 

Assignee

Scot Breitenfeld

Reporter

cgns

Components

Fix versions

Affects versions

Priority

Blocker
Configure