Sunday, March 20, 2011

Iteration over large data sets in Rails


We often need to iterate over the database rows in our migrations. When dealing with millions of records, basic iteration techniques doesn’t work well because each loaded object is consuming system memory and it issues at least one database query per object to load.
# >> Book.count :all
# => 4000216
Solution 1
class BooksUpdateTitles < ActiveRecord::Migration
def self.up
Books.all.each do |book|
# ...
end
end
 
def self.down
end
end
The problem with this version is that it will load all 4000216 objects into memory, all memory will be consumed and it will start to use disk swap and it will take hours to complete.
We can optimize it a little bit by specifying select parameter in our query:
Solution 2
Books.find(:all, :select => 'id').each do |t|
book = Book.find t.id
# ...
end
Version 2 still loads all objects in memory but selects only id field.
We need to avoid loading all objects in the memory, we’ll iterate over collection and we’ll load only current object.
Solution 3
last_id = 0
while book = Books.find(:first, :conditions => ['id > ?', last_id])
# ...
last_id = book.id
end
Version 3 it’s OK, but it can be speed up by loading objects in batch not just one by one.
Solution 4
last_id = 0
while books = Book.find(:all, :conditions => ['id > ?', last_id], :limit => 100)
# ...
last_id = books.last.id
end
Examining the log:
...
Domain LOAD (0.000176) SELECT * FROM `books` WHERE (id > 0) LIMIT 100
...
We have loaded 100 objects with one query. Solution 4 seems to be the best solution to iterate over large data sets as it uses less memory with fewer SQL requests.
Update:
Mitchell proposed a better solution to use ActiveRecord’sfind_in_batches method. DHH commited this feature on February 23, 2009 that permits iterating over large data sets in batches:
Solution 5
Book.find_in_batches(:batch_size => 100) do |results|
# Do something with results
end

No comments:

Post a Comment