Topic: Error caused by too many table attributes ?
Error caused by too many table attributes ?
Just stumbled across DataCleaner by accident and I am very impressed: good ideas and approach.
But for some scenarios - most of them in combination with very large tables (either in number of records or in number of selected/to be analyzed attributes) - I get stuck with "java.lang.OutOfMemoryError: Java heap space".
Played around with the little possibilities within the conf.xml, but without any success.
Is there any recommended setup for ensuring that memory will be used properly ?!
Thanx in advance,
Christian
But for some scenarios - most of them in combination with very large tables (either in number of records or in number of selected/to be analyzed attributes) - I get stuck with "java.lang.OutOfMemoryError: Java heap space".
Played around with the little possibilities within the conf.xml, but without any success.
Is there any recommended setup for ensuring that memory will be used properly ?!
Thanx in advance,
Christian
Hi Christian,
Can you give us a bit more info about which features of DataCleaner you're using. For example: Just how many columns are you profiling (tens, hundreds, thousands??) and which analyzers are you applying to them?
The main thing you can tune in terms of memory is in the conf.xml file. If you find the place where it says...
Can you give us a bit more info about which features of DataCleaner you're using. For example: Just how many columns are you profiling (tens, hundreds, thousands??) and which analyzers are you applying to them?
The main thing you can tune in terms of memory is in the conf.xml file. If you find the place where it says...
<in-memory max-rows-threshold="1000" />... you can lower this value (1000) to eg. 100 and it should save you quite a lot of memory. It will reduce the amount of records you are able to "drill to detail" to, but in some scenarios that is acceptable?
Hi Kasper,
thanks for the reply. But before I was posting this problem, I did a lot of test-runs with various settings.
So far it seems that the memory heap problem happens with most of the functions.
But here are the facts:
- setup an Oracle connection
- select a table with 6,8 Mio records
- select all 114 attributes
- choose "value distribution"
- start "analyze"
- get the error message:
And after that I have to close down and start up DataCleaner completely, as any other action after the error would create follow up errors.
I know that this requires some reasonable memory, but I am looking for a special limit where it still enables me to run this analysis (even if it takes longer).
Had a similar problem with Pentaho Data Integrator in the past and could fix it then by resizing the sort area memory consumption.
But so far it seems, that I have to split up the analysis in several steps. But it is quite painful, as I do not know in advance, with how many attributes being selected the memory consumption does not exceed.
Best regards,
Christian
thanks for the reply. But before I was posting this problem, I did a lot of test-runs with various settings.
So far it seems that the memory heap problem happens with most of the functions.
But here are the facts:
- setup an Oracle connection
- select a table with 6,8 Mio records
- select all 114 attributes
- choose "value distribution"
- start "analyze"
- get the error message:
ERROR: An error occurred in the analysis job!
java.lang.OutOfMemoryError: Java heap space
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Unknown Source)
at oracle.jdbc.driver.BufferCache.get(BufferCache.java:226)
at oracle.jdbc.driver.PhysicalConnection.getCharBuffer(PhysicalConnection.java:7600)
at oracle.jdbc.driver.OracleStatement.prepareAccessors(OracleStatement.java:991)
at oracle.jdbc.driver.T4CTTIdcb.receiveCommon(T4CTTIdcb.java:273)
at oracle.jdbc.driver.T4CTTIdcb.receive(T4CTTIdcb.java:144)
at oracle.jdbc.driver.T4C8Oall.readDCB(T4C8Oall.java:806)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:355)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:191)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:523)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:193)
at oracle.jdbc.driver.T4CStatement.executeForDescribe(T4CStatement.java:852)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1153)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1275)
at oracle.jdbc.driver.OracleStatement.executeQuery(OracleStatement.java:1477)
at oracle.jdbc.driver.OracleStatementWrapper.executeQuery(OracleStatementWrapper.java:392)
at org.eobjects.metamodel.JdbcDataContext.executeQuery(JdbcDataContext.java:534)
at org.eobjects.analyzer.job.runner.RowProcessingPublisher.run(RowProcessingPublisher.java:204)
at org.eobjects.analyzer.job.tasks.RunRowProcessingPublisherTask.execute(RunRowProcessingPublisherTask.java:40)
at org.eobjects.analyzer.job.concurrent.TaskRunnable.run(TaskRunnable.java:63)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
And after that I have to close down and start up DataCleaner completely, as any other action after the error would create follow up errors.
I know that this requires some reasonable memory, but I am looking for a special limit where it still enables me to run this analysis (even if it takes longer).
Had a similar problem with Pentaho Data Integrator in the past and could fix it then by resizing the sort area memory consumption.
But so far it seems, that I have to split up the analysis in several steps. But it is quite painful, as I do not know in advance, with how many attributes being selected the memory consumption does not exceed.
Best regards,
Christian
Hi Christian,
Well there is always the possibility to assign more memory for the process. Here's an example .cmd file contents with 2 gb (instead of the normal 1gb) assigned:
Well there is always the possibility to assign more memory for the process. Here's an example .cmd file contents with 2 gb (instead of the normal 1gb) assigned:
set DATACLEANER_HOME=%~dp0But I would also say that Value Distribution is among the most memory consuming analyzers there is and when you're applying it to 100+ columns then you're maybe stretching the application a bit too far in terms of what it can currently do. There are some ideas that we are experimenting with to possibly make this analyzer faster and less memory consuming but it's not a trivial task.
cd /d %DATACLEANER_HOME%
echo Using DATACLEANER_HOME: %DATACLEANER_HOME%
call java -Xmx2048m -jar DataCleaner.jar
Hi Kasper,
thanks for your reply. I understand that there are some limitations and it is not trivial to handle these.
So I will do some more tests to try to find out which would be a reasonable setting to have a stable run of DataCleaner.
If you are currently discussing some ideas reg.performance and memory consumption: maybe you should have a look on Pentaho Data Integrator.
I use it pretty much for static and repetitive data analysis/profiling and had there at the beginning also to fight with memory overflows.
The nice thing with Pentaho is, that I really could limit the usage of memory. The only effect is, that the total analysis time will take longer, but there is no more abort due to memory overflow.
I believe they can deal with the memory limitiations, because the results of each step can be written in output files (so I do not fill up memory with temporary data storage of profiling results) and further analysis will be done on this step-results.
So still both thumbs up for DataCleaner. I like the way to approach data profiling. Because this is really outstanding. But please keep in mind that DWH tables nowadays are really huge and from my experience it is more accepted to wait some time for the analysis results (eventually add some warning window reg.time consumption when starting up) then to face DataCleaner abort due to memory overflow. So there should be a way to make DataCleaner really stable.
However, I will continue with my changing parameters and testing and keep you up-to-date if you like.
Best regards,
Christian
thanks for your reply. I understand that there are some limitations and it is not trivial to handle these.
So I will do some more tests to try to find out which would be a reasonable setting to have a stable run of DataCleaner.
If you are currently discussing some ideas reg.performance and memory consumption: maybe you should have a look on Pentaho Data Integrator.
I use it pretty much for static and repetitive data analysis/profiling and had there at the beginning also to fight with memory overflows.
The nice thing with Pentaho is, that I really could limit the usage of memory. The only effect is, that the total analysis time will take longer, but there is no more abort due to memory overflow.
I believe they can deal with the memory limitiations, because the results of each step can be written in output files (so I do not fill up memory with temporary data storage of profiling results) and further analysis will be done on this step-results.
So still both thumbs up for DataCleaner. I like the way to approach data profiling. Because this is really outstanding. But please keep in mind that DWH tables nowadays are really huge and from my experience it is more accepted to wait some time for the analysis results (eventually add some warning window reg.time consumption when starting up) then to face DataCleaner abort due to memory overflow. So there should be a way to make DataCleaner really stable.
However, I will continue with my changing parameters and testing and keep you up-to-date if you like.
Best regards,
Christian
Hi Christian,
I know what you mean and the size of data is of course something that we always have in mind. Sometimes you just need to begin at a place and improve from there. I'm glad that you like the product though :)
Our current strategy for staging temporary data is based on a few assumptions about which types of columns the analyzer will be applied to. We had not anticipated that you would apply the value distribution to hundreds of columns at the same time and thus the strategy is not tuned towards your scenario.
We do have another strategy you might try. It's probably going to be somewhat slower but it should be stable because it also uses a disk-based database for the staging/result data. However I don't particularly like the approach from an architecture perspective because it involves a lot of I/O and in the long run I want to have a different approach to the querying for Value Dist. (ie. a more fundamental change, not just a different tuning).
To change the strategy, find the "<storage-provider>" element in conf.xml and put this instead of what's already there:
This will use an embedded H2-database to save the data on disk instead of in memory. Alternatively you can also use HSQLDB, by replacing the "h2" part of the above with "hsqldb".
Good luck :)
I know what you mean and the size of data is of course something that we always have in mind. Sometimes you just need to begin at a place and improve from there. I'm glad that you like the product though :)
Our current strategy for staging temporary data is based on a few assumptions about which types of columns the analyzer will be applied to. We had not anticipated that you would apply the value distribution to hundreds of columns at the same time and thus the strategy is not tuned towards your scenario.
We do have another strategy you might try. It's probably going to be somewhat slower but it should be stable because it also uses a disk-based database for the staging/result data. However I don't particularly like the approach from an architecture perspective because it involves a lot of I/O and in the long run I want to have a different approach to the querying for Value Dist. (ie. a more fundamental change, not just a different tuning).
To change the strategy, find the "<storage-provider>" element in conf.xml and put this instead of what's already there:
<storage-provider>
<h2-database>
<temp-directory>staging</temp-directory>
</h2-database>
</storage-provider>
This will use an embedded H2-database to save the data on disk instead of in memory. Alternatively you can also use HSQLDB, by replacing the "h2" part of the above with "hsqldb".
Good luck :)
Dear Kasper,
thanks, this helps a lot in my further research to identify limits and best workarounds (and which I am happy to share then).
Just one special finding with the memory overflow I want to report before I return to testing: I just wondered, why DataCleaner comes up with an error message immediately after firing of the analysis. Do you do some precalculation for an assumed memory consumption?
I have a lot of other memory based analysis applications and in general they're handling such situations with a warning stop during analysis and then I can choose for example between:
- try to free memory (on my own) and then continue
- stop analysis but keep the (incomplete) results which could be prepared so far
- stop the analysis completely
Best regards,
Christian
thanks, this helps a lot in my further research to identify limits and best workarounds (and which I am happy to share then).
Just one special finding with the memory overflow I want to report before I return to testing: I just wondered, why DataCleaner comes up with an error message immediately after firing of the analysis. Do you do some precalculation for an assumed memory consumption?
I have a lot of other memory based analysis applications and in general they're handling such situations with a warning stop during analysis and then I can choose for example between:
- try to free memory (on my own) and then continue
- stop analysis but keep the (incomplete) results which could be prepared so far
- stop the analysis completely
Best regards,
Christian
Does it really happen immediately? That sounds strange ... No we don't do any precalculation like that (not possible without knowing what's in the data).
Yes, definitely.
I will do additional testing in more detail so I can provide you with facts - hopefully quite soon, but it will take some time (there is also some business I have to do).
I will do additional testing in more detail so I can provide you with facts - hopefully quite soon, but it will take some time (there is also some business I have to do).
To readers of this post:
After a closer talk with Christian we managed to identify what was causing the out of memory issue. This has been fixed in the new version 2.1.1 of DataCleaner, see the news page for more details!
After a closer talk with Christian we managed to identify what was causing the out of memory issue. This has been fixed in the new version 2.1.1 of DataCleaner, see the news page for more details!
Some final words on this topic: did already some heavy analysis tests, it works and the fix even seems to speed up the performance.
Hi Kasper,
can you please explain a bit more how you fixed the issue?
We currently have the same problem (at least the same stacktrace more or less).
We also request lots of columns, and this post is the only one which comes near our problem at all.
Thank you very much
can you please explain a bit more how you fixed the issue?
We currently have the same problem (at least the same stacktrace more or less).
We also request lots of columns, and this post is the only one which comes near our problem at all.
Thank you very much
Hi abuehler,
The issue was solved as a bugfix. We figured out that we had set a constant FETCH_SIZE regardless of the amount of columns. That means that given a large amount of columns, multiplied by the constant FETCH_SIZE, we would actually fill up memory even before beginning processing!
If you are seeing an out of memory issue with the latest version, I doubt that it is the same issue. Can you share your stacktrace maybe?
The issue was solved as a bugfix. We figured out that we had set a constant FETCH_SIZE regardless of the amount of columns. That means that given a large amount of columns, multiplied by the constant FETCH_SIZE, we would actually fill up memory even before beginning processing!
If you are seeing an out of memory issue with the latest version, I doubt that it is the same issue. Can you share your stacktrace maybe?
Hi Kasper,
thank you very much for the reply.
The fetch_size "keyword" did the job for me now. We have a large prefetch size by default, so I will adjust that for oracle.
The problem occurred in a different product, but your thread was the only one on the web on which I found at least a clue for the solution, so I thought I give it a try and ask you directly :-)
Thanx for your help!
thank you very much for the reply.
The fetch_size "keyword" did the job for me now. We have a large prefetch size by default, so I will adjust that for oracle.
The problem occurred in a different product, but your thread was the only one on the web on which I found at least a clue for the solution, so I thought I give it a try and ask you directly :-)
Thanx for your help!
Hehe how nice! Glad that our experiences can help others! What is the product you're building? Maybe you would be interested to know that we use MetaModel as a library to query the databases and it actually has a quite nice way of figuring out the appropriate FETCH_SIZE now. It is open source so you can just use it if you want.
At the time we implemented the fetch size thing, I wrote this blog entry on the subject:
Speed up your JDBC queries with MetaModel.
Out of curiosity - what is the product you're developing?
At the time we implemented the fetch size thing, I wrote this blog entry on the subject:
Speed up your JDBC queries with MetaModel.
Out of curiosity - what is the product you're developing?
We build a Product Information Management system called Heiler Product Manager which handles a lot (millions) of items. HPM can be deployed on Oracle or on MSSQL Server...and actually, on MSSQL I never had any of those problems lol :-)
I'll have a look at MetaModel, thank you for the hint!
I'll have a look at MetaModel, thank you for the hint!
Okidoke! Good luck with it!
Log in by clicking the login link at the top of the screen
Go back to forum.


