Exwiw

Export What I Want (Exwiw) is a Ruby gem that allows you to export records from a database to a dump file(to specifically, the full list of INSERT sql) on the specified conditions.

When to use

Most of case in developing a software, There is no better choice than the same data in production. You might make well-crafted data, but it's very very hard to maintain.

If you find the way to maintain the data for develoment env, then exwiw might be a solution for that.

Export the full database and mask data and import to another database.
Setup some system to replicate and mask data in real-time to another database.

You want to export only the data you want to export.

Features

Export the full list of INSERT sql for the specified conditions.
Provide serveral masking options for sensitive columns.
Provide config generator for ActiveRecord.

Installation

bundle add exwiw

Most of cases, you want to add 'require: false' to the Gemfile.

If bundler is not being used to manage dependencies, install the gem by executing:

gem install exwiw

Supported Databases

mysql
postgresql
sqlite
mongodb (experimental, see MongoDB notes)

For MySQL, exwiw connects through whichever of the mysql2 or trilogy gem is available (preferring mysql2), so an app on either driver works without any extra setup. There is no separate trilogy adapter name — pass --adapter=mysql either way.

Set EXWIW_MYSQL_DRIVER=trilogy (or mysql2) to force a specific driver. This is useful when the mysql2 gem is linked against a libmysqlclient that can no longer load the server's auth plugin — e.g. a MySQL 9.x client drops the mysql_native_password plugin and raises Authentication plugin 'mysql_native_password' cannot be loaded on connect. The pure-Ruby trilogy driver implements that auth handshake itself and sidesteps the issue.

Usage

exwiw has two subcommands:

export (default) — generate INSERT/COPY SQL files. If the subcommand is omitted, export is assumed.
explain — print the compiled SQL and its EXPLAIN output for each query that export would run, without executing the SELECTs.

`exwiw export`

# dump & masking all records from database to dump.sql based on schema.json
# pass database password as an environment variable 'DATABASE_PASSWORD'
exwiw \
  --adapter=mysql \
  --host=localhost \
  --port=3306 \
  --user=reader \
  --database=app_production \
  --config-dir=exwiw \
  --target-table=shops \
  --ids=1 \ # comma separated ids
  --output-dir=dump \
  --log-level=info

By default --ids are matched against the target table's primary key. --ids-column=COLUMN matches them against a different column instead (e.g. --target-table=users --ids=alice@example.com --ids-column=email). Related tables are still extracted correctly: their foreign keys are resolved through the target via a subquery (WHERE fk IN (SELECT pk FROM target WHERE COLUMN IN (...))), so only the target table's filter column changes. This is the SQL-adapter counterpart of the mongodb --ids-field; the two are mutually exclusive and each is rejected by the other adapter family. Note: if COLUMN is itself masked, re-running delete-* against an already-imported (masked) dump won't match, so prefer a stable natural key.

When --target-table and --ids are omitted, exwiw dumps all tables defined in --config-dir:

# dump all tables
exwiw \
  --adapter=postgresql \
  --host=localhost \
  --port=5432 \
  --user=reader \
  --database=app_production \
  --config-dir=exwiw \
  --output-dir=dump

This command will generate sql files in the dump directory.

dump/insert-000-schema.sql — idempotent CREATE TABLE IF NOT EXISTS ... for every table in scope. Apply this first to provision an empty database.
dump/insert-{idx}-{table_name}.sql
dump/delete-{idx}-{table_name}.sql

idx means the order of the dump. bigger idx might depend on smaller idx, so you should import the dump in order.

insert-000-schema.sql is generated by shelling out to the database client tools (mysqldump for mysql, pg_dump for postgresql, and the sqlite3 driver for sqlite), so the corresponding client must be available on PATH when running exwiw. For mysql, set EXWIW_MYSQLDUMP to point at a specific mysqldump binary when the one on PATH is incompatible with the server (e.g. a MySQL 9.x mysqldump cannot load mysql_native_password against a server still using that auth plugin — EXWIW_MYSQLDUMP=/path/to/mysql@8.0/bin/mysqldump). The output is post-processed to make it idempotent: CREATE TABLE IF NOT EXISTS, CREATE INDEX IF NOT EXISTS (where the engine supports it), and PostgreSQL's ALTER TABLE ... ADD CONSTRAINT statements are wrapped in DO $$ ... EXCEPTION WHEN duplicate_object.

you need to delete the records before importing the dump, delete-{idx}-{table_name}.sql will help you to do that. This sql will delete "all" related records to the extract targets. idx meaning is the same as insert sql.

`exwiw explain`

Print the compiled SQL and its EXPLAIN output (estimate-only; EXPLAIN QUERY PLAN on SQLite) for each query that export would run, to stdout. No SELECT is executed. Supported for mysql, postgresql, and sqlite. The mongodb adapter is not yet supported.

# preview the queries exwiw would run, without executing the SELECTs
exwiw explain \
  --adapter=postgresql \
  --host=localhost --port=5432 --user=reader \
  --database=app_production \
  --config-dir=exwiw \
  --target-table=shops --ids=1

The --output-dir, --output-format, --insert-only, and --after-insert-hook options are dump-specific and rejected when used with explain.

Generator

The config generator is provided as a Rake task.

# generate table schema under exwiw/
bundle exec rake exwiw:schema:generate

By default, the schema files will be saved in the exwiw directory. You can specify a different output directory by setting the OUTPUT_DIR_PATH environment variable:

OUTPUT_DIR_PATH=custom_directory bundle exec rake exwiw:schema:generate

Multiple databases

If the application uses Rails' multiple-database support (connects_to), schema:generate buckets models by the database they connect to and writes each database's config files into its own subdirectory of the output directory, named after the database config name (primary, analytics, ...):

exwiw/
  primary/
    shops.json
    users.json
    schema_migrations.json
  analytics/
    analytics_events.json

Each database keeps its own Rails migration history, so a schema_migrations (and ar_internal_metadata) entry is emitted under every database that contains one — the example above shows primary/schema_migrations.json and would also produce analytics/schema_migrations.json when the analytics database has its own migration table. Single-database applications are unaffected and continue to write files flat into the output directory.

Limitations

The rails-managed table names are resolved from the global ActiveRecord::Base.schema_migrations_table_name / internal_metadata_table_name accessors, which are shared across all connections. A per-database override of these names is not detected, so such a table will be missing from that database's generated configs.

Mongoid applications

For MongoDB applications backed by Mongoid, a separate rake task introspects Mongoid document models and emits MongodbCollectionConfig files (the fields / _id / embedded_in shape described under MongoDB notes):

bundle exec rake exwiw:schema:generate_mongoid

It is a distinct task and class (Exwiw::MongoidSchemaGenerator) from the ActiveRecord generator because the two ORMs expose entirely different metadata. From each model it derives:

the collection name and the _id primary key,
fields from the declared Mongoid fields (referenced belongs_to foreign keys such as shop_id, and the created_at / updated_at columns added by Mongoid::Timestamps, are ordinary fields — their BSON ObjectId / Date values serialize as MongoDB Extended JSON at dump time). For an aliased field (field :ctry, as: :country), the generator emits the stored document key (ctry), never the Ruby accessor (country), so masking and projection target the key that actually appears in the document, and additionally records the accessor as mongoid_field_name on that field so the short key stays understandable (association aliases such as shop => shop_id and the built-in id => _id are not field renames and are not annotated),
belongs_tos from referenced belongs_to associations ({ table_name, foreign_key }). A referenced belongs_to declared on an embedded document is dropped (cross-collection refs from inside embedded subdocuments are unsupported — see MongoDB notes), but its foreign-key column is still kept as an ordinary field. A has_and_belongs_to_many association is also dropped (its foreign keys are stored as an array field, e.g. tag_ids, which exwiw cannot follow as a single-valued foreign key), while that *_ids array column is kept as an ordinary field,
embedded_in from embedded_in / embeds_many / embeds_one associations. Each embedded config names its immediate parent collection and the document key it lives under (store_as, defaulting to the relation name); nested embedding is represented as a chain (comments → embedded_in posts, posts → embedded_in users) rather than a flattened dot-path, matching how the adapter recurses through array and Hash subdocuments. A polymorphic embedded_in (embedded_in :addressable, polymorphic: true) has no single embedding parent collection and so cannot be expressed as an embedded_in config; the generator raises a clear error pointing you to define that collection's config by hand. A self-referential / cyclic embedding (Mongoid's recursively_embeds_many / recursively_embeds_one) makes a collection both a top-level document and embedded inside documents of its own type; exwiw represents a collection as either top-level or embedded, not both, so the generator likewise raises a clear error rather than emit a config that would silently make the collection undumpable.

Models in an inheritance hierarchy whose subclasses share the base's collection (Mongoid STI, distinguished by the auto-added _type discriminator) collapse into a single config: the generator discovers the subclasses via descendants (Mongoid registers only the base class in Mongoid.models) and unions every class's fields and belongs_tos into the collection config, so subclass-only fields and associations are not lost.

Regeneration preserves hand-edited replace_with, filter, skip, and bulk_insert_chunk_size values, like the ActiveRecord generator. Indexes are not written to the config — they are introspected from the live database at dump time (see MongoDB notes). Polymorphic belongs_to is not yet expanded by this task.

Configuration

This is an example of the one table schema:

{
    "name": "users",
    "primary_key": "id",
    "filter": "users.id > 0",
    "bulk_insert_chunk_size": 1000,
    "belongs_to": [{
        "name": "companies",
        "foreign_key": "company_id"
    }],
    "columns": [{
        "name": "id",
    }, {
        "name": "email",
        "replace_with": "user{id}@example.com"
    }, {
        "name": "company_id"
    }]
}

--config-dir will use all json files in the specified directory.

Output format

By default, exwiw generates INSERT statements. For PostgreSQL, you can pass --output-format=copy to generate COPY FROM stdin format instead, which is significantly faster for bulk loading.

The generated file uses tab-separated values with PostgreSQL's text-format escaping (\N for NULL, \\ for backslash, etc.). Import with psql:

psql -d app_dev -f dump/insert-001-shops.sql

--output-format=copy is only supported with the postgresql adapter.

Skip DELETE SQL output

By default, exwiw generates delete-*.sql files alongside the insert-*.sql files so that an existing dataset can be cleared before re-inserting. Pass --insert-only when you only need the insert files.

After-insert hook

--after-insert-hook=PATH runs a post-processing hook after all per-table insert/delete files have been written. The hook can be either a Ruby file (.rb) or any executable script (e.g. .sh).

Ruby hook (.rb): provides a tiny DSL with two builtins:

cli_options — Hash of all parsed CLI options (e.g. cli_options.fetch(:ids) returns the --ids array).
insert_sql(template) — appends an ERB-rendered string to a buffer. After the hook finishes, the buffer is concatenated and written to insert-{N+1}-after_insert.{ext} where {N+1} is one past the last per-table insert file. For the MongoDB adapter the equivalent alias insert_jsonl(template) is available; output goes to insert-{N+1}-after_insert.jsonl. Multiple insert_sql calls in a single hook are joined with "\n" into the same file. If no insert_sql call is made, no file is created.

Example hooks/seed_default_users.rb:

insert_sql <<~SQL
  -- seed default users for tenants <%= cli_options.fetch(:ids).join(',') %>
  <%- cli_options.fetch(:ids).each do |tenant_id| -%>
  INSERT INTO users (tenant_id, email) VALUES (<%= tenant_id %>, 'default@example.com');
  <%- end -%>
SQL

Shell hook: anything other than .rb is exec'd as a child process. It is a pure side-effect hook — exwiw does not capture its stdout. The hook receives these env vars and inherits DATABASE_PASSWORD from the parent:

EXWIW_OUTPUT_DIR, EXWIW_CONFIG_DIR
EXWIW_DATABASE_ADAPTER, EXWIW_DATABASE_HOST, EXWIW_DATABASE_PORT, EXWIW_DATABASE_USER, EXWIW_DATABASE_NAME
EXWIW_TARGET_TABLE, EXWIW_IDS (comma-separated), EXWIW_OUTPUT_FORMAT

A non-zero exit code from the shell hook aborts exwiw.

Note: Ruby hooks are evaluated via instance_eval inside the exwiw process — only pass paths you trust.

Skip a table

Set "skip": true on a table's config JSON to exclude it from data extraction. The table's DDL is still emitted into insert-000-schema.{sql,js} so the schema stays consistent, but no insert-* / delete-* files are generated for it and the table is never queried.

{
  "name": "audit_logs",
  "primary_key": "id",
  "skip": true,
  "belongs_tos": [],
  "columns": [{ "name": "id" }]
}

Constraints:

If another non-skipped table has a belongs_to entry pointing at a skipped table, exwiw raises ArgumentError on load. Remove the belongs_to entry on the referencing table, or unset skip on the referenced table.
Specifying a skipped table as --target-table raises ArgumentError.
skip: true is preserved by exwiw:schema:generate regenerations (the receiver value wins over the auto-generated config).

Polymorphic `belongs_to`

A Rails polymorphic association (belongs_to :reviewable, polymorphic: true) does not point at a single table — the target row is selected at runtime by a type column. exwiw models this as one belongs_to entry per concrete target table, each carrying two extra fields:

foreign_type — the type column on this table (e.g. reviewable_type).
type_value — the value stored in that column for this target (e.g. "Product"), i.e. the target model's polymorphic_name.

{
  "name": "reviews",
  "primary_key": "id",
  "belongs_tos": [
    {
      "table_name": "products",
      "foreign_key": "reviewable_id",
      "foreign_type": "reviewable_type",
      "type_value": "Product"
    },
    {
      "table_name": "shops",
      "foreign_key": "reviewable_id",
      "foreign_type": "reviewable_type",
      "type_value": "Shop"
    }
  ],
  "columns": [{ "name": "id" }, { "name": "reviewable_type" }, { "name": "reviewable_id" }]
}

exwiw:schema:generate expands a polymorphic belongs_to automatically: it finds every model that registers the association as a target via has_many / has_one ..., as: :reviewable and emits one entry per target table (ordered by table name so the output is stable across Ruby versions). A plain (non-polymorphic) belongs_to simply omits foreign_type / type_value.

At dump time, when a polymorphic belongs_to lies on the path to the dump target, exwiw constrains both the foreign key and the type column, so only rows of the matching type are extracted. For example, dumping products pulls only reviews whose reviewable_type = 'Product':

SELECT reviews.* FROM reviews
WHERE reviews.reviewable_id IN (/* products subquery */)
  AND reviews.reviewable_type = 'Product'

The same type filter is applied on the join path — and in the matching delete-*.sql bulk-delete subquery — when the polymorphic table is an intermediate hop rather than the directly-dumped table.

Rails-managed tables (special `type` values)

Some tables are owned by Rails itself rather than the application — they have no ActiveRecord model and Rails reserves the right to evolve their column shape between versions (e.g. schema_migrations, ar_internal_metadata). exwiw treats them as a distinct category via the type field on a table config:

type: "rails_managed_schema_migrations" — Rails' migration history table (ActiveRecord::Base.schema_migrations_table_name).
type: "rails_managed_internal_metadata" — Rails' internal metadata table (ActiveRecord::Base.internal_metadata_table_name).

exwiw:schema:generate emits these entries automatically when the corresponding tables exist on the connection — they are NOT pulled from ActiveRecord::Base.descendants because they have no model class.

A rails-managed entry has a minimal shape (no primary_key, no belongs_tos, no columns):

{
  "name": "schema_migrations",
  "type": "rails_managed_schema_migrations",
  "comment": "Managed internally by Rails. Tracks applied schema migrations."
}

Behavior at dump time:

Extraction uses SELECT * so the dump is robust against Rails-side column additions.
INSERT statements omit the column list (INSERT INTO schema_migrations VALUES (...)). For PostgreSQL --output-format=copy, the COPY header similarly omits the column list (COPY schema_migrations FROM stdin;).
No delete-*.sql file is generated for rails-managed tables, to avoid wiping migration history on the import target.

Constraints:

Defining primary_key, columns, or belongs_tos on a rails-managed entry is rejected with ArgumentError on load.
A rails-managed table cannot be used as --target-table.
In multi-database setups, the rails-managed entry is emitted under whichever database's connection actually contains the table (see Multiple databases). The table name itself is still derived from the global ActiveRecord::Base.schema_migrations_table_name / internal_metadata_table_name (prefix/suffix) accessors.

Composite primary keys (unsupported)

exwiw does not yet support tables with a composite primary key. When exwiw:schema:generate encounters a model whose primary_key is an array, it still emits a config entry so the table is not silently dropped, but marks it skip: true, tags it type: "unsupported_composite_primary_key", and records the key columns in a comment:

{
  "name": "composite_pk_records",
  "type": "unsupported_composite_primary_key",
  "skip": true,
  "comment": "exwiw does not support composite primary keys (organization_id, location_id); data extraction is skipped.",
  "belongs_tos": [],
  "columns": [{ "name": "organization_id" }, { "name": "location_id" }, { "name": "name" }]
}

Unlike rails-managed entries, columns and belongs_tos are retained so the entry is ready to wire up once composite-key support lands. The type is purely a marker — skip: true is what actually excludes the table from extraction, so removing skip (and supplying a workable primary_key) lets you opt the table back in manually.

Bulk insert chunk size

bulk_insert_chunk_size splits the generated INSERT statement into multiple statements, each containing at most the specified number of rows. This is useful when the number of records per table is large enough to hit limits like MySQL's max_allowed_packet.

If omitted, all records for a table are emitted as a single INSERT statement.

Filter

Some case, you don't need full records related to target. e.g. dump user access logs only for the last year. filter is here for that. Be careful to use this option, as it will be:

injected as it is in table condition(e.g. WHERE on mysql), so you are recommended to clearify table name of column to avoid ambiguity.
injected to every where / join clause, so it affects to all tables depends on filterted target-table. it results to data inconsistency.

Masking

exwiw provides several options for masking value.

`replace_with`

It will replace the value with the specified string, and you can use the column name with {} to replace the value with the column value.

For example, Let assume we have the record which id is 1, then "userid@example.com" will be replaced with "user1@example.com".

`raw_sql`

It will used instead of the original value.

For example, "raw_sql": "CONCAT('user', shops.id, '@example.com')" is equivalent to "replace_with": "user{id}@example.com". This is useful when you want to transform with functions provided by the database.

Notice that you are recommended to clearify table name of column to avoid ambiguity.

If it used with replace_with, replace_with will be ignored.

`map`

XXX: TODO

Given value will be evaluated as Ruby code, and treated as the proc.

"map": "proc { |r| 'user' + v['id'].to_s + '@example.com' }"

which is equivalent to "replace_with": "user{id}@example.com".

Notice this is the most powerful option, but you should be careful to use this option. Because this transformation occured on exwiw process, so much slower than other options. Most of case, this option is not recommended.

MongoDB notes

The MongoDB adapter is experimental. To use it:

Add gem "mongo" to your Gemfile in addition to exwiw (it is not declared as a runtime dependency of the gem).
Set --adapter=mongodb. --user / DATABASE_PASSWORD are optional and only needed when your MongoDB requires authentication.
The MongoDB adapter consumes a separate config type, MongodbCollectionConfig, with MongoDB-native naming. Use fields (instead of the SQL adapters' columns), and set "primary_key": "_id". Foreign keys (shop_id, user_id, ...) stay as ordinary fields.
--ids values are coerced to the type actually stored in _id before filtering: integer-looking ids become Integer, 24-char hex ids become BSON::ObjectId (Mongoid's default _id type — a plain String would never match an ObjectId), and any other string is left as-is.
--target-collection=COLLECTION is a mongodb-only alias of --target-table (use whichever reads better for MongoDB). Specifying both, or using --target-collection with a non-mongodb adapter, is an error.
--ids-field=FIELD matches --ids against FIELD on the target collection instead of its primary key (e.g. --target-collection=users --ids=a@example.com --ids-field=email). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are not type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is mongodb-only; the SQL adapters use --ids-column instead (see below).
Output is JSON Lines (insert-{idx}-{collection}.jsonl) using MongoDB Extended JSON (relaxed mode). Import with mongoimport: bash mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl
The leading dump/insert-000-schema.js contains db.createCollection(...) and db.<col>.createIndex(...) calls for every top-level collection (indexes are introspected from the source via listIndexes; the auto-created _id_ index is skipped). Apply it with mongosh before running mongoimport: bash mongosh "mongodb://localhost/app_dev" dump/insert-000-schema.js
Unlike SQL adapters, the MongoDB adapter does not emit delete-*.jsonl files (drop the database / collection yourself before importing if needed).
raw_sql is not supported (the MongodbField schema does not declare it; any raw_sql keys in scenario JSON are silently dropped on load). Use replace_with for masking.
The MongoDB adapter does not support the collection-level filter field (it raises NotImplementedError if set, since the SQL-string filter cannot be applied to MongoDB).

Embedded documents

MongoDB models often store one-to-many relationships as embedded subdocument arrays (e.g. users documents with a posts: [...] field). To mask fields inside embedded subdocuments, declare a separate config with embedded_in:

// scenario/users.json — top-level collection
{
  "name": "users",
  "primary_key": "_id",
  "belongs_tos": [{ "table_name": "shops", "foreign_key": "shop_id" }],
  "fields": [
    { "name": "_id" },
    { "name": "name", "replace_with": "masked{_id}" },
    { "name": "shop_id" }
  ]
}

// scenario/posts.json — embedded under users.posts
{
  "name": "posts",
  "primary_key": "_id",
  "embedded_in": { "collection_name": "users", "path": "posts" },
  "belongs_tos": [],
  "fields": [
    { "name": "_id" },
    { "name": "title", "replace_with": "masked-{_id}" }
  ]
}

At runtime:

posts is not dumped as its own jsonl file. Its replace_with rules are applied to the subdocuments inside the parent users document at the path posts.
path accepts dot-separated paths for nested fields (e.g. "profile.contacts").
Both arrays of subdocuments and a single Hash subdocument at path are supported. Multiple levels of nesting work via embedded chains.
Cross-collection references from inside an embedded subdocument (belongs_tos on an embedded config) are not supported and raise ArgumentError on load.
Specifying an embedded config as --target-table raises NotImplementedError; pass the top-level collection name instead.

How it works

Load the table information from the specified config file.
Calculate the dependency between tables.
Generate the full list of INSERT sql based on the specified conditions.
- If the processing table has no relation with target tables, then dump all records.
- If the processing table has relation with target tables, then dump the records which are related to the target tables.
Generate the full list of DELETE sql based on the specified conditions.
- If the processing table has no relation with target tables, then delete all records.
- If the processing table has relation with target tables, then delete the records which are related to the target tables.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install.

To release a new version:

Run the Release PR workflow from the Actions tab with the new version number (e.g. 0.2.3). This creates a PR that bumps version.rb and CHANGELOG.md.
Merge the PR. The Release workflow runs automatically, creating a git tag and publishing the gem to rubygems.org.

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/heyinc/exwiw.

License

The gem is available as open source under the terms of the MIT License.