I am currently researching on a backend to use for a project with demanding data aggregation requirements. The main project requirements are the following.
- Store millions of records for each user. Users might have more than 1 million entries per year so even with 100 users we are talking about 100 million entries per year.
- Data aggregation on those entries must be performed on the fly. The users need to be able to filter on the entries by a ton of available filters and then present summaries (totals , averages e.t.c) and graphs on the results. Obviously I cannot precalculate any of the aggregation results because the filter combinations (and thus the result sets) are huge.
- Users are going to have access on their own data only but it would be nice if anonymous stats could be calculated for all the data.
- The data is going to be most of the time in batch. e.g the user will upload the data every day and it could like 3000 records. In some later version there could be automated programs that upload every few minutes in smaller batches of 100 items for example.
I made a simple test of creating a table with 1 million rows and performing a simple sum of 1 column both in mongodb and in mysql and the performance difference was huge. I do not remember the exact numbers but it was something like mysql = 200ms , mongodb = 20 sec.
I have also made the test with couchdb and had much worse results.
What seems promising speed wise is cassandra which I was very enthusiastic about when I first discovered it. However the documentation is scarce and I haven’t found any solid examples on how to perform sums and other aggregate functions on the data. Is that possible ?
As it seems from my test (Maybe I have done something wrong) with the current performance its impossible to use mongodb for such a project although the automated sharding functionality seems like a perfect fit for it.
Does anybody have experience with data aggregation in mongodb or have any insights that might be of help for the implementation of the project ?
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
If you’re looking for a very high performance DBMS and don’t need it to be relational, you might consider Cassandra – although its advantages only come into play if you have a database cluster instead of a single node.
You didn’t say what limits there are on the physical architecture. You did mention sharding which implies a cluster. IIRC MySQL clusters support sharding too.
It’d also be very useful to know what level of concurrency the system is intended to support, and how data would be added (drip-feed or batch).
You say “Obviously I cannot precalculate any of the aggregation results because the filter combinations (and thus the result sets) are huge.”
This is your biggest problem, and will be the most important factor in determining the performance of your system. Sure, you can’t maintain materialized views of every possible combination, but your biggest performance win is going to be maintaining limited pre-aggregated views and building an optimizer that can find the nearest match. It’s not all that hard.
You could also try the free single node edition of Greenplum: http://www.greenplum.com/products/single-node/ and http://www.dbms2.com/2009/10/19/greenplum-free-single-node-edition/
If a simple sum of 1 million documents took 20 seconds in Mongo you probably don’t have enough RAM. It’s important with Mongo that you can maintain the entire dataset in memory or performance will suffer. You didn’t mention how you did the count, perhaps it’s a problem with your map reduce code? There’s just too little detail to say what the issue is, but I’ve done more complicated map reduces with an order a magnitude more documents that took less time just running on my laptop