Rails何もわからない民。RDBのデータ件数が猛烈に多く、Elasticsearchへの全件登録にXX時間かかるので、どうにかしようとしている。以下は妄想メモ。そのうち検証する。

現状から書き換えてみる

これがテーブル全件登録。現状これでXX時間かかっている。 __elasticsearch__ は elasticsearch-rails というgemがモデルクラスに生やしてるオブジェクト。ここに import とか index_document などのメソッドが生えている。 import の実装は https://github.com/elastic/elasticsearch-rails/blob/master/elasticsearch-model/lib/elasticsearch/model/importing.rb#L140-L185

Hoge.__elasticsearch__.import(index: new_index_name)

このような書き換えができる。一件登録の __elasticsearch__.index_document というメソッドが存在してる。 index_document の実装は https://github.com/elastic/elasticsearch-rails/blob/master/elasticsearch-model/lib/elasticsearch/model/indexing.rb#L370-L378

Hoge.find_each do |o|
    o.__elasticsearch__.index_document(index: new_index_name)
end

find_each は find_in_batch をラップしたメソッドで、このように書き換えできる。

Hoge.find_in_batch do |records|
    records.each do |o|
        o.__elasticsearch__.index_document(index: new_index_name)
    end
end

find_in_batch でレコードを配列として持ってきたことで、Parallel に渡せるようになる。Parallel はマルチスレッド処理を簡易に使えるようにするgem。in_threads は並列数。あまり無茶するとEC2インスタンスが爆散するので控えめにする。

Hoge.find_in_batch do |records|
    Parallel.each(records, in_threads: 4) do |o|
        ActiveRecord::Base.connection_pool.with_connection do
            o.__elasticsearch__.index_document(index: new_index_name)
        end
    end
end

ところで Datadog でインポート中のメモリの動きを見ていると、スワップに手を付けそうになっている。ActiveRecord がメモリにクエリキャッシュを溜めるのをOFFってみる。これが効くかわからないが。

Hoge.uncached do
    Hoge.find_in_batch do |records|
        Parallel.each(records, in_threads: 4) do |o|
            ActiveRecord::Base.connection_pool.with_connection do
                o.__elasticsearch__.index_document(index: new_index_name)
            end
        end
    end
end

という妄想を書き散らしたが、ほんとに効くのかわからない。そもそも動くのか不明。

こうなってほしいな

RDBからは find_in_batch でバルクフェッチしてるので、現状比で、そんな酷いことにならないはず
- __elasticsearch__.import の内部で find_in_batch を呼んでいるので、ここで差は出ないと思いたい
  - https://github.com/elastic/elasticsearch-rails/blob/master/elasticsearch-model/lib/elasticsearch/model/importing.rb#L140-L185
  - https://github.com/elastic/elasticsearch-rails/blob/e0d46a864db2d1287afee1d88280a838df261bcb/elasticsearch-model/lib/elasticsearch/model/adapters/active_record.rb#L96-L109
Parallelでマルチスレッドにしてるのは、けっこう効くんじゃないすか。ダメ？
Elasticsearchに .__elasticsearch__.index_document で一件ずつ登録になってるのは改善の余地があるんじゃないか。バルクAPIとかあるんだし。__elasticsearch__.import は、そのあたりをイイカンジやってるのかな。
find_in_batch は order by しないらしいが、中断した処理の再開を実装したい。どうしたもんかな。
しょせんシングルノード。別途でSidekiqのEC2インスタンスをAutoScalingさせてるので、そっちに投げて非同期の分散処理させたほうが早いんじゃねと思ったりなど