Skip to main content

XPages EL/class-loader memory leak (now with solution)

 We have recently experienced OutOfMemory crashes of XPages app server. The server was recently upgraded to 12.0.1FP1, but we were getting some panic crashes in HTTP even before the upgrade (it was 9.0.1FP10). Our hopes were that the upgrade would stabilize the server, but it's not the case. At least now I start to see what's the problem. 

update 8.12.2022

There were actually 3 different leaks. I have rewritten the article to be a bit more clear. I also re-run some of the tests on 9.0.1FP10, so I assume the problems are also in earlier versions.

Problem 1

The server is hosting over 1000 NSF sharing the same design + some other custom apps. Not all NSFs are used via web as the app still has classic Notes UI in parallel, so it's a bit tricky to estimate the load. By using tell http xsp show modules I usually see around 350 NSFs active. We kept the default application timeout that should provide reasonable application recycling if it's not used continuously. 

We started to see OutOfMemory crash every 10-12 days and the memory was growing continuously, so I started digging. I used Eclipse MAT for the analysis as the heap dumps are still IBM-specific even when the JVM is not IBM JVM anymore. Opening of production heap dumps from the server that's up for a few days requires a few hours, so it's quite hard to do. But it showed me 2 things:

Leak detector pointed to com.sun.faces.el.impl.BeanInfoManager

The hashmap size seems to be continuously growing. There were references to classes from the apps in counts that should not be possible if the XPages apps were correctly unloaded. Further more checking the histogram view, it showed 5000 module classloaders. 
But there were only 318 instances of NSFComponentModule, so it seems that only 318 apps were active. 

I tried search for answers, but I've only found this post from 2008 about Websphere - https://www.ibm.com/support/pages/apar/PK69142  * PROBLEM DESCRIPTION: BeanInfoManager instances in the JSF's BeanInfoManager static hashmap are not removed when an application is stopped which leads to Classloaders not being garbage collected.

Seems similar to what I'm experiencing.

Isolated test 1

To confirm my theory, I have created 1000 nsfs with used JMeter to open all of them to see what happens - it crashed my dev server at roughly 235 apps as I configured just 1GB heap. But I at least knew what is the limit (the prod has 6GB heap, so it should be probably able to handle all of them).

By configuring xsp.application.timeout=1 at server level I tried to enforce quick cleanup. Using XPages Debug Toolbar I was able to see the heap and force garbage collection. 

By repeating the test I saw that the heap keep increasing by roughly 75MB during each batch. So I was getting to roughly 375KB increase per app life-cycle. By using multiplying it by 5000 ModuleClassloaders in the heapdump after cca 5 days I'm getting to 1.8GB of wasted heap space x 3 we get close to the full allowed heap of 6GB (but the server needs the heap for normal operation, so it would crash before 15 days, which is exactly what we are seeing).

So next step is to see if it's the app, our code, or the runtime...

Isolated test 2

I created a very simple app - 1 XPage, 1 app scope bean and one enum. Enums (or anything static) are more tricky to deal with and I originally thought that the problem is from them as the histogram in MAT is showing their counts.

This time I could run with a full set of 1000 NSFs and the increase was pretty small - roughly 4MB per batch of 1000, but it was still there. Here is a heap dump overview after the first and second batch.

The number of class loaders increased by 2k as there are 2 class loaders used for the app.
(the extra 1 is the app on the side that use for XPages Debug toolbar heap monitoring)

Verification test 3

Now I needed to confirm if it was the el bean management causing the problems. I have removed the reference from the XPage to the app scope bean and re-run the test. 

After 2 batches (2000 apps loaded and unloaded) I see nothing extra in the heap dump.
Just 230 class loaders and nothing from my app in the heap.
The XPage itself is also fully recycled (would be xsp.TestPage in my case).

Explanation

I had to look into the decompiled code to see what was going on. It turns out that there is a static hash-map containing metadata about classes used as beans. And there is no code that would ever clean this. Putting aside that new entry creation is using static synchronized method, which is something you want to avoid on an app server (but it's called just once for each class used as a bean in running NSF).

Solution for the BeanInfoManager problem

Luckily, the BeanInfoManager is reachable even from XPages app itself (assuming you have correct java policy permissions to do reflection). The idea is simple - remove the entries manually when the app is no longer needed. We can add custom ApplicationListeners - thanks Sven Hasselbach for this - http://hasselba.ch/blog/?p=740 .

With a simple class registered as ApplicationListener, I was able to remove the map entries


package mprtest;

import java.lang.reflect.Field;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import com.ibm.xsp.application.ApplicationEx;
import com.ibm.xsp.application.events.ApplicationListener2;
import com.sun.faces.el.impl.BeanInfoManager;

public class CleaningApplicationListener implements ApplicationListener2 {
	 public void applicationCreated(ApplicationEx arg0) {
	     }

	    public void applicationDestroyed(ApplicationEx arg0) {
	        ClassLoader cl=CleaningApplicationListener.class.getClassLoader();
			try {
				Field f = BeanInfoManager.class.getDeclaredField("mBeanInfoManagerByClass");
				f.setAccessible(true);
				Map m= (Map) f.get(null);
				synchronized(BeanInfoManager.class) {
					Iterator> it=m.entrySet().iterator();
					while(it.hasNext()){
						Entry me = it.next();
						if (me!=null) {
							Class beanInfoClass = me.getKey();
							if (beanInfoClass!=null) {
								ClassLoader cl2 = beanInfoClass.getClassLoader();
								if (cl2!=null && cl2.equals(cl)) {
									it.remove();
									System.out.println("removed " + me.getKey().getName());
								}
							}
						}
					}
				}
			} catch (Exception e) {
				e.printStackTrace();
			}
			
			
	    }

	    public void applicationRefreshed(ApplicationEx arg0) {
	    }
}
 

(there are probably too many null checks, but I was getting some strange NPE, so I have added some extra ones).

You need to register the class in META-INF/services/com.ibm.xsp.core.events.ApplicationListener file as noted on Sven's blog. 

Once it's active, it will report cleared entries, e.g.
[4674:0002-74F0] 07.12.2022 21:27:54   HTTP JVM: removed app.controller.empty
[4674:0002-74F0] 07.12.2022 21:27:54   HTTP JVM: removed app.Settings

Great, this has solved the problem for my simple app. Heap is good as new even after creating and destroying of 1000 apps. Unfortunately, it's not working for the production app....

Problem 2 - SSJS

I use Jesse's XPages Scaffolding in the app and other frameworks, so I've expected some problems coming from these additional classes. But it's not the case. 

I narrowed down my next problem to a simple XPage:

<?xml version="1.0" encoding="UTF-8"?>
<xp:view xmlns:xp="http://www.ibm.com/xsp/core">
<xp:this.beforePageLoad loaded="false"><![CDATA[#{javascript:pageController.beforePageLoad()}]]></xp:this.beforePageLoad>

<xp:text value="#{pageController.test}"></xp:text>
</xp:view>


Everything works fine when I remove the beforePageLoad line, so the bean and class is unregistered correctly. But I still see the ModuleClassLoader in the heap. If you look carefully, the beforePageLoad is not even loaded. Really weird, but the heapdump gives me another hint.


Solution to problem 2

The solution is similar to the first problem - just find the map using reflection and remove all the entries associated with the current class loader, so we can just add it to the existing applicationDestroyed method.

			//second part, remove the static JSContext types
			try {
				JSContext jscontext = JSContext.getStaticContext();
				Field f = JSContext.class.getDeclaredField("registry");
				f.setAccessible(true);
				Registry r = (Registry) f.get(jscontext);
				
				Field fRegistry = Registry.class.getDeclaredField("wrapperMap");
				fRegistry.setAccessible(true);
				IdentityHashMap<Class<?>, Object> map = (IdentityHashMap<Class<?>, Object>) fRegistry.get(r);
				synchronized(r) {
					Iterator<Entry<Class<?>, Object>> it=map.entrySet().iterator();
					while(it.hasNext()){
						Entry<Class<?>, Object> me = it.next();
						if (me!=null) {
							Class regClass = me.getKey();
							if (regClass!=null) {
								ClassLoader cl2 = regClass.getClassLoader();
								if (cl2!=null && cl2.equals(cl)) {
									System.out.println("removing JS type" + me.getKey().getName());
									it.remove();
								}
							}
						}
					}
				}
			} catch (Exception e) {
				e.printStackTrace();
			}
 

This worked well on the test app and even the production app when running the requests automatically using JMeter. But it wasn't working when doing manual tests. The class-loader was still present in the heap dump.

Luckily I got an idea (influenced by previous class loader garbage collection Google searches) - the app is configured with "Keep only the current page in memory" and the JMeter test does not do any navigation, just fresh requests, so nothing gets serialized. My test app had "Keep pages on disk", but there was just an app-scope bean, so nothing created from the module class loader was serialized. By adjusting these options I was again able to reproduce the problem consistently.

Problem 3 - Serialization

After re-running my test with 1000 dbs, Eclipse MAT started to show "java.io.ObjectStreamClass$Caches" as one of the suspects for a leak, so I was on the right track. 

Solution to problem 3

It turned out that the problem is in JDK and it has been fixed last year. But the fixes are for OpenJDK v11 and newer, so it's not applicable to Domino - see https://bugs.openjdk.org/browse/JDK-8277072.
But one of the reports was with a workaround https://bugs.openjdk.org/browse/JDK-8199589 - 

Workaround is again using reflection and doing similar clean up like the previous solutions, so it's just another block to the applicationDestroy() method. 

Final version

Here is the final version of the file, including all 3 cleaning routines. To run it, you need to allow several permissions (or run your Java with AllPermissions).


package mprtest;

import java.lang.ref.Reference;
import java.lang.reflect.Field;
import java.util.IdentityHashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;

import com.ibm.jscript.JSContext;
import com.ibm.jscript.types.Registry;
import com.ibm.xsp.application.ApplicationEx;
import com.ibm.xsp.application.events.ApplicationListener2;
import com.sun.faces.el.impl.BeanInfoManager;

public class CleaningApplicationListener implements ApplicationListener2 {
	 public void applicationCreated(ApplicationEx arg0) {
	     }

	    public void applicationDestroyed(ApplicationEx arg0) {
	        ClassLoader cl=CleaningApplicationListener.class.getClassLoader();
	        try {
				Field f = BeanInfoManager.class.getDeclaredField("mBeanInfoManagerByClass");
				f.setAccessible(true);
				Map<Class,BeanInfoManager> m= (Map<Class, BeanInfoManager>) f.get(null);
				synchronized(BeanInfoManager.class) {
					Iterator<Entry<Class, BeanInfoManager>> it=m.entrySet().iterator();
					while(it.hasNext()){
						Entry<Class, BeanInfoManager> me = it.next();
						if (me!=null) {
							Class beanInfoClass = me.getKey();
							if (beanInfoClass!=null) {
								ClassLoader cl2 = beanInfoClass.getClassLoader();
								if (cl2!=null && cl2.equals(cl)) {
									System.out.println("removed " + me.getKey().getName());
									it.remove();
								}
							}
						}
					}
				}
			} catch (Exception e) {
				e.printStackTrace();
			}
			
			//second part, remove the static JSContext types
			try {
				JSContext jscontext = JSContext.getStaticContext();
				Field f = JSContext.class.getDeclaredField("registry");
				f.setAccessible(true);
				Registry r = (Registry) f.get(jscontext);
				
				Field fRegistry = Registry.class.getDeclaredField("wrapperMap");
				fRegistry.setAccessible(true);
				IdentityHashMap<Class<?>, Object> map = (IdentityHashMap<Class<?>, Object>) fRegistry.get(r);
				synchronized(r) {
					Iterator<Entry<Class<?>, Object>> it=map.entrySet().iterator();
					while(it.hasNext()){
						Entry<Class<?>, Object> me = it.next();
						if (me!=null) {
							Class regClass = me.getKey();
							if (regClass!=null) {
								ClassLoader cl2 = regClass.getClassLoader();
								if (cl2!=null && cl2.equals(cl)) {
									System.out.println("removing JS type" + me.getKey().getName());
									it.remove();
								}
							}
						}
					}
				}
			} catch (Exception e) {
				e.printStackTrace();
			}
			
			//third part - serialization cache https://bugs.openjdk.org/browse/JDK-8199589
			  try {
		            Class<?> clazz = Class.forName("java.io.ObjectStreamClass$Caches");
		            clearCache(clazz, "localDescs", cl);
		            clearCache(clazz, "reflectors", cl);
		        } catch (Exception e) {
		            // Clean-up failed
		            e.printStackTrace();
		        }
			
			
	    }
	    
	    private void clearCache(Class<?> target, String mapName, ClassLoader mycl)
	            throws ReflectiveOperationException, SecurityException, ClassCastException {
	        Field f = target.getDeclaredField(mapName);
	        f.setAccessible(true);
	        Map<?,?> map = (Map<?,?>) f.get(null);
	        Iterator<?> keys = map.keySet().iterator();
	        while (keys.hasNext()) {
	            Object key = keys.next();
	            if (key instanceof Reference) {
	                Object clazz = ((Reference<?>) key).get();
	                if (clazz instanceof Class) {
	                    ClassLoader cl = ((Class<?>) clazz).getClassLoader();
                        if (cl!=null && cl.equals(mycl)) {
                        	System.out.println("removed from ObjectStreamClass cache: "+ ((Class<?>) clazz).getName());
                            keys.remove();
                        }
	                }
	            }
	        }
	    }

	    public void applicationRefreshed(ApplicationEx arg0) {
	    }
}

 
This worked well even when I tested the app manually. I will need to deploy it to test and production and keep it there for a few days to state that it's the final solution, but so far it looks good. 

Conclusion

There is the "Single copy template" option that is designed to improve class loading/management of deployments where you run many instances of the same app, but it has some specifics too, so I did not want to use it. This way we have full control over the design of each app and potentially can do some adjustments and staged roll-outs if needed.

I have opened a support case with HCL and I wrote this post also to provide better documentation. I'll update the article once have any news. Maybe we can get this fixed in one of the next versions. HCL is promising JVM updates, so it's worth checking.

A combination of JMeter and heap dumps is a good way to stress test your server. The authentication can be tricky, but for dev environments, you can always use the HTTPENABLEDANGEROUSCONNECTORHEADER=1 and just pass some user using $WSRU header. 

So be careful if you host many XPages apps on a server. If you use Java, which you should, you may sooner or later get into similar problems and need similar workarounds.






Comments

Popular posts from this blog

HCL Domino SSO with Microsoft Teams

 Microsoft Teams is probably one of the most used tools this year, it was already quite popular before the pandemic started to spread across the world this spring, but now most of the businesses I work with use it. After using it just like a chat/conferencing tool, many start to explore further capabilities of the platform. When working with Domino data in apps that are web-enabled, it can be quite easy - just add a web tab anywhere you want. The problem is, that you need to deal with user authentication. 

HCL Domino 12.0.2, Engage 2022 and HCL Factory tour Milan

 I haven't published my recap after Engage this year and the recent HCL Factory tour in Milan is a great opportunity to write a summary about what's happening in HCL (mostly Domino) space. It's a mix of news about 12.0.2, future directions, and my impressions, so it can be a bit chaotic, but I got the impression that many people see it similarly.  Engage 2022 Engage 2022 was great (as always). I love the atmosphere in Brudges. I visited it once after Engage a few years ago and I was happy to come back. This was also the first time I had the opportunity to speak at Engage, which obviously made it a bit more stressful, but also more fun. Together with Domino Jams, HCL continued conversations with customers and partners about the future of their products at Engage. Many of these ideas were now discussed in greater detail in Milan, some of them were even demoed.  My main takeaways from Engage were: Nomad (web and mobile) are a great addition to Notes family Restyle is a great g