I have a website that calculates a users involvement/activity using multiple MySQL queries.
For a typical user I will ask:
How many updates have they made? How many photos have they uploaded? etc etc.
These are just basic COUNT queries on the relevant tables, updates, photos. I then total the COUNT values of each to get a score. Each query requires one JOIN and each query takes around 0.0006 seconds, 10 queries in total for each user, 0.006 seconds in all.
While not so bad for one user, I have to calculate this for 100,000 users, for a theoretical processing time of 10 minutes and some 1,000,000 DB queries. It feels like I'm approaching the problem in the wrong way and was wondering if anyone had any ideas?
I had thought about storing the users score in their user record and incrementing it every time they carry out a particular action but it's not as flexible (I can't go back and see how many points were accrued on a certain day for instance).
Any help greatly appreciated!
Assuming your tables are structured where each table has a
user_id field of some kind, you can do something like this to get the total "activities" your users have done:
SELECT users.user_id, (update_counts.update_count + photo_counts.photo_count) AS activity_count FROM users INNER JOIN ( SELECT updates.user_id AS user_id, COUNT(updates.*) AS update_count FROM updates GROUP BY user_id ) AS update_counts ON users.user_id = update_counts.user_id INNER JOIN ( SELECT photos.user_id AS user_id, COUNT(photos.*) AS photo_count FROM photos GROUP BY user_id ) AS photo_counts ON users.user_id = photo_counts.user_id GROUP BY users.user_id
Obviously you can add tables as you need to and you can weight things as you see fit. It should perform reasonably well if you have an index on the
user_id field in each table, though it does depend on how big your tables get.
Once your tables get huge, you are going to need to start caching an activity_count in a cache table. You can cache values by date if you need to, of course.
If you only need a rough estimate, you could run this query at some regular frequency (say, once every night) and cache the results; this would be less intrusive than writing triggers for every table to update a cache table.
user_activity link table. It takes an
user_id and a
timestamp. So for instance, when a user uploads a photo, a record is created with
activity_id = 2 (for 'photo upload' referenced from the
activities table), the
user_id and the current timestamp. That's easily queryable and eliminates the worry of long-running queries when you have zillions of users.
Unless you want to do an 11 way join, I'd create a seperate table for this purpose that you insert after each user update.
That table should contain just the username, timestamp, section (table it's from), and the unique id from the other tables so you have a backreference for deletes and such.
It sounds to me as though you are trying to optimize before it is truly necessary. If you don't have 100,000 users you do not need to worry about such issues until necessary.
With that being said there is no reason not to optimize, just don't try to overthink the problem for a solution you don't yet need.
Although you might experience slight inconsistencies, you could try caching the results of each user's counts when they login (using memcached) and only update the cache when update one of the counts. If user's are very active it would be more efficient to only update every hour or so.
This may be overkill for your application, but you could always go the OLAP route. This would allow you to have pre-aggregated measures along multiple dimensions, such as users and time spans. This gives you a flexible framework for a variety of reporting needs. The SQL Server Analysis Services package has worked well for our company.