Customize Consent Preferences

We use cookies to help you navigate efficiently and perform certain functions. You will find detailed information about all cookies under each consent category below.

The cookies that are categorized as "Necessary" are stored on your browser as they are essential for enabling the basic functionalities of the site. ... 

Always Active

Necessary cookies are required to enable the basic features of this site, such as providing secure log-in or adjusting your consent preferences. These cookies do not store any personally identifiable data.

No cookies to display.

Functional cookies help perform certain functionalities like sharing the content of the website on social media platforms, collecting feedback, and other third-party features.

No cookies to display.

Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics such as the number of visitors, bounce rate, traffic source, etc.

No cookies to display.

Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.

No cookies to display.

Advertisement cookies are used to provide visitors with customized advertisements based on the pages you visited previously and to analyze the effectiveness of the ad campaigns.

No cookies to display.

Last Month we installed the OS related, firmware, …. part of the QFSDP JANUARY 2018 12.2.1.1.6 to be more specific, on our test and dev system at my customer.

GI and DB still need to be patched.

After our last patching experience http://pfierens.blogspot.be/2017/06/applying-april-2017-qfsdp-12102-exadata.html this only could go better.

Well to put a very long story short be cautious we ran into now 4 different bugs, causing instability of the RAC clusters, GI that refused to startup, loss of Infiniband Connectivity …

So the Monday after the patching we were hit by instability of our Exadata OVM infrastructure for Test and Dev and Qualification. Dom0 rebooting ….

There seemed also to be an issue on IB interfaces in the domU, unfortunately
we didn’t have a crash dump so support couldn’t really do something.

The only way to get GI and DB’s up again was to reboot the VM, crsctl stop crs and start crs didn’t really work logs showed IB issues

Last time (forgot to blog about that ) we ran into the gnttab_max_frames issue which we had set to 512 after this patching it was put to 256 so we thought that might have been the reason, because in this release another parameter was introduced in grub.conf.

gnttab_max_maptrack_frames
gnttab_max_frames

the relation between the two was difficult to find but in the end this seem not to be the right diagnosis

if you want some more information about the gnttab_max_frames please read this
shortly put each virtual disk needs and networking operations needs a number of frames granted to communicate if this is not correctly set then you have issues ….

Luckily the Friday in that same week we were in the same situation, we decided to let the dom0 crash and that way have a crashdump.

After uploading that crashdump to Support the where able to see that issue was on Melanox HCA Firmware layer. between APR 2017 and January there where 4000 changes in that Firmware that happened which one or combination caused our issue.


Bottom line : There seem to be issue with the melanox HCA firmware (from 2.11.1280 to 2.35.5532.)
in this patch, you may encounter it if you have more then 8 vm’s under one dom0, we had 9……

so basically we shutdown one vm on each node and had again stability.

when it was confirmed in numerous conf calls that  8 was  the magic number we decided to move the exadata monitoring vm functionality to another vm and shutdown the monitoring vm, to be again at 8 vm’s

we got a stable situation until last Friday where we had an issue with both IB switches being unresponsive and the second switch not take the sm master role, this issue is still under investigation and hopefully not related to the QFSDP JAN 2018 …

If you have similar symptoms point support to bugs :

  Bug 27724899 – Dom0 crashes with ib_umad_close with large no. of VMs 
  Bug 27691811 
  Bug 27267621 


UPDATE :

There seem to be a bug as well in the IB switch version 2.2.7-1 solved in 2.2.10 (not released yet) not everything is solved only the logging issue but not the main root cause apparently there is a separate ER for this




Leave a Reply