Exploring foreign data wrappers in Postgres

- postgres ruby sequel

For many years now Postgres has been my database of choice, but I still regularly find new and interesting features that I wasn’t yet aware of. The latest entry in this long list are “foreign data wrappers”, an implementation of the SQL/MED (“SQL Management of External Data”) specification which was added in Postgres 9.1.

This mechanism allows for integrating our database with data stored outside of it. The most common use case for this is probably accessing a remote RDBMS (Postgres or others), but in this example we will use the file remote data wrapper to access a CSV file from disk.

Integrating external data

We will use the basic world cities dataset from Simplemaps in this example, so go and download this file now if you want to follow along.

To be able to define a “foreign table”, we first need to install the file_fdw extension and define a server that uses it.

CREATE EXTENSION file_fdw;
CREATE SERVER cities FOREIGN DATA WRAPPER file_fdw;

Once we have our server in place, we can define a foreign table to access data from it:

CREATE FOREIGN TABLE cities (
  city text,
  city_ascii text,
  lat real,
  long real,
  population real,
  country text,
  iso2 text,
  iso3 text,
  province text
) SERVER cities
OPTIONS (
  filename '/tmp/simplemaps-worldcities-basic.csv',
  format 'csv',
  header 'true'
);

The above creates the schema of our table and also specifies the location of our data file, the file format (file_fdw doesn’t only support CSV) and instructs Postgres to ignore the first row of the document since it’s a header.

We can now query this CSV file just like any regular old table:

test=# select city, population::int from cities order by population desc limit 3;
    city     | population
-------------+------------
 Tokyo       |   22006300
 Mumbai      |   15834918
 Mexico City |   14919501
(3 rows)

I don’t know about you, but I find the ability to access CSV data without having to first write an importer pretty exciting and performance is rather nice too:

test=# explain analyze select avg(population) from cities;
                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1896.14..1896.15 rows=1 width=8) (actual time=12.437..12.437 rows=1 loops=1)
   ->  Foreign Scan on cities  (cost=0.00..1851.60 rows=17816 width=4) (actual time=0.064..10.648 rows=7322 loops=1)
         Foreign File: /tmp/simplemaps-worldcities-basic.csv
         Foreign File Size: 570124
 Planning time: 0.605 ms
 Execution time: 13.517 ms
(6 rows)

Enter Sequel

Now it’s time to access our foreign table from Ruby. For this purpose we’ll be using Jeremy Evans’ excellent Sequel gem. It comes with its own console, so the command sequel postgres://localhost/test will drop us into an IRB session with the gem already required and the Sequel::Database object stored in the constant DB:

→ sequel postgres://localhost/test
Your database is stored in DB...
>> DB
#=> #<Sequel::Postgres::Database: "postgres://localhost/test" {:test=>true, :loggers=>[]}>

As a first step, we assign our cities table to a variable. This returns a Sequel::Dataset which represents the query that will be executed when we want to retrieve the data.

cities = DB[:cities]
#=> #<Sequel::Postgres::Dataset: "SELECT * FROM \"cities\"">

Now we can define a dataset called top3, which very closely resembles the SQL query shown above.

top3 = cities.select(:city, :population).reverse_order(:population).limit(3)
#=> #<Sequel::Postgres::Dataset: "SELECT \"city\", \"population\" FROM \"cities\" ORDER BY \"population\" DESC LIMIT 3">

NB: if we had enabled Sequel’s core extensions the order clause could have been written as order(:population.desc), but whether or not one wants to use this style is left to personal preference.

All Sequel::Datasets include the Enumerable module, so we can use its #to_a method to materialize our dataset:

top3.to_a
#=> [{:city=>"Tokyo", :population=>22006300.0}, {:city=>"Mumbai", :population=>15834900.0}, {:city=>"Mexico City", :population=>14919500.0}]

Ruby and Sequel are blissfully ignorant of the fact that they are dealing with a foreign table and we’re executing the same query we issued from the Postgres console.

But I want objects!

While the above query worked, it returned an array of plain hashes. If you prefer objects it’s better to use Sequel::Model instead. To do this, we need to define a class like so:

class City < Sequel::Model ; end

This exposes the underlying dataset via a class method:

City.dataset
=> #<Sequel::Postgres::Dataset: "SELECT * FROM \"cities\"">

However, it’s not considered good style to directly chain methods onto that. Instead the documentation recommends using a single method per tasks. If you’re wondering about dataset_module here, it’s a convenience method that let’s us create named dataset methods that return cached datasets:

class City < Sequel::Model
  dataset_module do
    def top3
      select(:city, :population).
      reverse_order(:population).
      limit(3)
    end
  end
end

Now let’s use our newly defined class:

top3 = City.top3.to_a
#=> [#<City @values={:city=>"Tokyo", :population=>22006300.0}>, #<City @values={:city=>"Mumbai", :population=>15834900.0}>, #<City @values={:city=>"Mexico City", :population=>14919500.0}>]

top3.first.city
#=> "Tokyo"

Same result as before, though with object access instead of plain hashes.

Wrap up

Postgres’ foreign data wrappers are an interesting concept and I’m definitely going to spend some more time exploring them, especially the more exotic ones like LDAP, IMAP and RSS provided by the Multicorn project.

Sequel is a wonderful gem and a good reminder that ActiveRecord is far from being the only game in town when it comes to database access and ORM in Ruby.