back to forum.

Topic: performance

Topic by
tzimmerman

2011-12-15
19:55

performance

I read Kasper's blog about performance benchmarks here (http://kasper.eobjects.org/2009/06/performance-benchmark-datacleaner.html) and was wondering what size files are people running through DataCleaner? What kind of environment are you running running under (CPU, RAM etc) and what kind of performances are you seeing?

I am currently trying to profile some very large files (~75MM records of 10 columns) and am having trouble getting through them. I have created some smaller test files of ~3MM records tweaking the config file settings for threads and in memory record count along with increasing the JVM heap size has only gotten me to around 8 minutes to process the 3MM records (3MM rows X 10cols).

The performance I have seen has not been linear and seems to get increasingly slower so I am wondering if am just expecting too much? Or maybe I have missed some configuration?

Any suggestions would be appreciated.

Reply by
kasper

2011-12-15
07:26
Hi Tim Zimmerman,

There's different performance characteristics based on which components you add to your job. Especially which analyzers you use can have a big impact.

Expect very-near-linear time on analyzers such as String analyzer, Number analyzer, Date/time analyzer and Boolean analyzer.

Depending on the data, something like the Pattern finder can either be near-linear in good cases (if most values have the same pattern) or be quite slow in bad cases (if there are lots and lots of patterns).

Most transformers and filters are linear-time, since they are completely stateless. A few exceptions exist though, like the Table lookup which has a cache and therefore it depends on the amount of repetition.

Which analyzers are you using?

Usually I am running with DC's default configuration (30 threads) on my 8gb ram / Intel core i5 machine.

PS: I assume by "MM" you mean millions? Usually I would only write "M" though, so I hope I am completely oblivious to what you're talking about :P

Reply by
tzimmerman

2011-12-15
20:42
I was just starting to explore the performance with different analyzers. Currently I have only a ValueDistribution analyzer for each of the 10 columns. Eventually there would be a number of different analyzers depending on the datatype as well as the "domain" of the particular piece of data (i.e. postal code, email address etc).

My current test machine is a VM with 4GB RAM and Intel Xeon dual 2.93GHz processor. I assume I might expect faster processing if I can get more CPU and RAM and increase the number of threads?

By the way, I do mean millions by MM, sorry for any confusion, I am used to seeing it in financial context here where mm is commonly used.


P.S.
Thank you for your efforts in DataCleaner and also for such a very quick response.

Reply by
kasper

2011-12-15
08:56
Aha, yes the value distribution is also a pretty tricky one. In deed it is not linear time! But there are tricks you can apply...

Basically the reason it is not linear is that we need to save the temporary results on disk. This is because we want DC out-of-the-box to not crash due to outofmemory problems.

BUT, if you have a lot of memory you can change the strategy! The trick is to open conf.xml (in the DC dir) and to replace this element:
<storage-provider>
<combined>
...
</combined>
</storage-provider>
With this:
<storage-provider>
<in-memory max-rows-threshold="1000"/>
</storage-provider>
Let me explain: The value distribution needs somewhere to hold it's state, like "I've so far seen value X 5 times and value Y 7 times". But if a dataset contains eg. ONLY UNIQUE values and you have millions of records, then this approach would quickly run out of memory! So as a fail-safe precaution, we save that data in an embedded database as we go. But of course a database like that will get slower and slower, thus the non-linearness.

The trick I described will instead of the database use an in-memory approach. This is vulnerable to out-of-memory issues if you're short on memory, but if you do expect to see repeated values (even say 10's of thousands of distinct values, but not millions of distinct values) then it will work fine! Additionally you can trim the JVM to assign more memory, so if you have a big machine, almost everything is doable in that fashion.

Regarding threads: Don't increase it, since I think with your usage 30 is quite enough. Having too many threads can also harm performance since they might just all be waiting (and thereby wasting performance because of the thread management overhead) for access to the same resource (the state of the analyzer).

PS: Thanks for the kind words :)

Reply by
tzimmerman

2011-12-15
21:39
Arrgh!!! I am kicking myself. I thought I was using the in-memory storage but just realized I was looking at the row annotation in-memory storage element of the default combined storage!!

Thank you ... this has seems to have made a big difference.

You need to be logged in to participate

In order to post your own comments on this topic, you need to be logged in.

Username:

Log in by clicking the login link at the top of the screen

 

Go back to forum.