Eager loading
One feature in ORM that's really useful and helps optimization is eager loading. The idea is quite simple, you define what objects in a graph you want to load for a single query and the ORM will make sure that the full graph is loaded after the query. Really efficient for graphs with complex hierarchies and the last year or so I've found myself always defining load spans for eager loading for every query I use.
I really like the basic idea of how Load Spans are defined in LINQ to SQL (but what's up with the per-context thing?), it's quite simple:
DataLoadOptions options = new DataLoadOptions();
options.LoadWith(user => user.ToDoLists);
options.LoadWith(list => list.ToDoItems);
context.LoadOptions = options;
This will instruct the context to always load the association ToDoLists when loading a User and always load the association ToDoItems when loading a ToDoList. Pretty neat. "But what if the associations have millions of rows in the database?" - you might ask. Well they thought about that to and you can add a call to "AssociateWith" to constraint load spans. Here's an example:
options.AssociateWith(list => list.ToDoItems.Where(item => item.IsDone));
Which will constrain the load span to just eager load the items where IsDone is set to true. Again a very neat solution to a complex problem. An observant reader would expect a "but" right about now; and here it is:
The But!
When digging into this feature, the first thing one will notice is that the eager load works fine for one level down. That is; if the load span just asks for User.ToDoLists everything will work as expected. Our scenario will be different though, with a load span that tries to load multiple levels we will get a different result.
Consider this piece of code:
DataLoadOptions options = new DataLoadOptions();
options.LoadWith(user => user.ToDoLists);
options.LoadWith(list => list.ToDoItems);
IEnumerable usersQuery;
using (ToDoDataContext context = new ToDoDataContext())
{
context.LoadOptions = options;
usersQuery = from user in context.Users
select user;
foreach (var item in usersQuery)
{
Console.WriteLine("{0} ({1})", item.FirstName, item.ToDoLists.Count);
foreach (var list in item.ToDoLists)
Console.WriteLine("{0} ({1})", list.Name, list.ToDoItems.Count);
}
}
Looking at the code here one would assume that when we start to iterate through the query, the complete object graph will have been loaded. In Linq To Sql this isn't true. Looking at the SQL server profiler trace for this query the truth quickly reveals itself. This is the first query issued to the database:
-->SELECT --> [t0].[UserId], [t0].[FirstName], [t0].[LastName], [t0].[UserName], [t0].[Enabled], [t0].[Version]
-->FROM --> [dbo].[Users] AS [t0]
Not much eager about that query one would say. The first thing that happens is that all the users gets fetched to the application. Well no harm in that, it's a perfectly viable solution for eager loading, event though not the most efficient one. The problems comes with the next query issued:
exec sp_executesql N'SELECT [t0].[ToDoListId], [t0].[UserId], [t0].[Name], [t0].[Description], [t0].[Version], [t1].[ToDoItemId], [t1].[ToDoListId] AS [ToDoListId2], [t1].[Name] AS [Name2], [t1].[Completed], [t1].[Version] AS [Version2], (
-->SELECT --> COUNT(*)
-->FROM --> [dbo].[ToDoItems] AS [t2]
-->WHERE --> [t2].[ToDoListId] = [t0].[ToDoListId]
) AS [value]
-->FROM --> [dbo].[ToDoLists] AS [t0]
LEFT OUTER -->JOIN --> [dbo].[ToDoItems] AS [t1] -->ON --> [t1].[ToDoListId] = [t0].[ToDoListId]
-->WHERE --> [t0].[UserId] = @x1
ORDER BY [t0].[ToDoListId], [t1].[ToDoItemId]',N'@x1 int',@x1=1
After carefully examining this query, you'll see that while it actually eager loads lists together with items, this query will only fetch lists and items for a single user and looking at the rest of the trace it becomes quite evident that every user will issue a command to get it's list and items. So for every user object an additional query will be generated. It is just plain wrong to call that eager loading, eager loading shouldn't have it's number of queries dependent on the number of objects at the root hierarchy!
Another issue that I've found is that the "eager loading" isn't done at the query execution level, but at the materialization level of the root object. That means that when a user object get's materialized, a second query is sent to the database to fetch all the lists and items. Up until now one could still argue that it's eager loading, but this approach just makes it "automatic lazy loading". For this scenario the ToDoList can't possible be argued to be eager loaded, the only difference between this approach and "lazy loading" is that the lazy load get's triggered by the materialization and not by the access to the list itself. To make matters even worse, the database connection is opened and closed for every materialization of a user object, so if you don't have a couple of connections in your pool or no pooling at all, this will be very expensive indeed.
Now in the example above I use the deferred loading feature in LINQ. What if I materialized the query before the iteration? As it turns out, calling ToList before the iteration will not change the behavior against the database. It will still be the same queries issued to the database in the same fashion as before, only now the materialization is hidden in the call to ToList.
This is not what I would call eager loading or even load spans in the common use of the term.
How SHOULD it be?
If I instruct my framework to eager load with a load span, I expect the framework to load my object graph with as few queries as possible. At most that would mean one query per level in the hierarchy. Something like:
-->select --> * -->from --> users -->where --> name like @p0 + '%'
-->select --> * -->from --> todolists
inner -->join --> ( -->select --> * -->from --> users -->where --> name like @p0 + '%') users -->on --> users.userId = todolists.userId
-->select --> * -->from --> todoitems
inner -->join -->
( -->select --> * -->from --> todolists
inner -->join --> ( -->select --> * -->from --> users -->where --> name like @p0 + '%') users -->on --> users.userId = todolists.userId)
lists -->on --> items.listid = lists.listid
Summary
So Eager Load in Linq To SQL is only eager loading for one level at a time. As it is for lazy loading, with Load Options we will still issue one query per row (or object) at the root level and this is something we really want to avoid to spare the database. Which is kind of the point with eager loading, to spare the database. The way LINQ to SQL issues queries for the hierarchy will decrease the performance by log(n) where n is the number of root objects. Calling ToList won't change the behavior but it will control when in time all the queries will be issued to the database.
|