Exwiw
Export What I Want (Exwiw) is a Ruby gem that allows you to export records from a database to a dump file(to specifically, the full list of INSERT sql) on the specified conditions.
When to use
Most of case in developing a software, There is no better choice than the same data in production. You might make well-crafted data, but it's very very hard to maintain.
If you find the way to maintain the data for develoment env, then exwiw might be a solution for that.
- Export the full database and mask data and import to another database.
- Setup some system to replicate and mask data in real-time to another database.
You want to export only the data you want to export.
Features
- Export the full list of INSERT sql for the specified conditions.
- Provide serveral masking options for sensitive columns.
- Provide config generator for ActiveRecord.
Installation
bundle add exwiw
Most of cases, you want to add 'require: false' to the Gemfile.
If bundler is not being used to manage dependencies, install the gem by executing:
gem install exwiw
Supported Databases
- mysql
- postgresql
- sqlite
- mongodb (experimental, see MongoDB notes)
For MySQL, exwiw connects through whichever of the mysql2 or trilogy gem is
available (preferring mysql2), so an app on either driver works without any
extra setup. There is no separate trilogy adapter name — pass --adapter=mysql
either way.
Set EXWIW_MYSQL_DRIVER=trilogy (or mysql2) to force a specific driver. This
is useful when the mysql2 gem is linked against a libmysqlclient that can no
longer load the server's auth plugin — e.g. a MySQL 9.x client drops the
mysql_native_password plugin and raises Authentication plugin
'mysql_native_password' cannot be loaded on connect. The pure-Ruby trilogy
driver implements that auth handshake itself and sidesteps the issue.
Usage
exwiw has two subcommands:
export(default) — generate INSERT/COPY SQL files. If the subcommand is omitted,exportis assumed.explain— print the compiled SQL and itsEXPLAINoutput for each query thatexportwould run, without executing the SELECTs.
exwiw export
# dump & masking all records from database to dump.sql based on schema.json
# pass database password as an environment variable 'DATABASE_PASSWORD'
exwiw \
--adapter=mysql \
--host=localhost \
--port=3306 \
--user=reader \
--database=app_production \
--config-dir=exwiw \
--target-table=shops \
--ids=1 \ # comma separated ids
--output-dir=dump \
--log-level=info
By default --ids are matched against the target table's primary key. --ids-column=COLUMN matches them against a different column instead (e.g. --target-table=users --ids=alice@example.com --ids-column=email). Related tables are still extracted correctly: their foreign keys are resolved through the target via a subquery (WHERE fk IN (SELECT pk FROM target WHERE COLUMN IN (...))), so only the target table's filter column changes. This is the SQL-adapter counterpart of the mongodb --ids-field; the two are mutually exclusive and each is rejected by the other adapter family. Note: if COLUMN is itself masked, re-running delete-* against an already-imported (masked) dump won't match, so prefer a stable natural key.
When --target-table and --ids are omitted, exwiw dumps all tables defined in --config-dir:
# dump all tables
exwiw \
--adapter=postgresql \
--host=localhost \
--port=5432 \
--user=reader \
--database=app_production \
--config-dir=exwiw \
--output-dir=dump
This command will generate sql files in the dump directory.
dump/insert-000-schema.sql— idempotentCREATE TABLE IF NOT EXISTS ...for every table in scope. Apply this first to provision an empty database.dump/insert-{idx}-{table_name}.sqldump/delete-{idx}-{table_name}.sql
idx means the order of the dump. bigger idx might depend on smaller idx, so you should import the dump in order.
insert-000-schema.sql is generated by shelling out to the database client tools (mysqldump for mysql, pg_dump for postgresql, and the sqlite3 driver for sqlite), so the corresponding client must be available on PATH when running exwiw. For mysql, set EXWIW_MYSQLDUMP to point at a specific mysqldump binary when the one on PATH is incompatible with the server (e.g. a MySQL 9.x mysqldump cannot load mysql_native_password against a server still using that auth plugin — EXWIW_MYSQLDUMP=/path/to/mysql@8.0/bin/mysqldump). The output is post-processed to make it idempotent: CREATE TABLE IF NOT EXISTS, CREATE INDEX IF NOT EXISTS (where the engine supports it), and PostgreSQL's ALTER TABLE ... ADD CONSTRAINT statements are wrapped in DO $$ ... EXCEPTION WHEN duplicate_object.
you need to delete the records before importing the dump,
delete-{idx}-{table_name}.sql will help you to do that.
This sql will delete "all" related records to the extract targets.
idx meaning is the same as insert sql.
exwiw explain
Print the compiled SQL and its EXPLAIN output (estimate-only; EXPLAIN QUERY PLAN on SQLite) for each query that export would run, to stdout. No SELECT is executed. Supported for mysql, postgresql, and sqlite. The mongodb adapter is not yet supported.
# preview the queries exwiw would run, without executing the SELECTs
exwiw explain \
--adapter=postgresql \
--host=localhost --port=5432 --user=reader \
--database=app_production \
--config-dir=exwiw \
--target-table=shops --ids=1
The --output-dir, --output-format, --insert-only, and --after-insert-hook options are dump-specific and rejected when used with explain.
Generator
The config generator is provided as a Rake task.
# generate table schema under exwiw/
bundle exec rake exwiw:schema:generate
By default, the schema files will be saved in the exwiw directory. You can specify a different output directory by setting the OUTPUT_DIR_PATH environment variable:
OUTPUT_DIR_PATH=custom_directory bundle exec rake exwiw:schema:generate
Multiple databases
If the application uses Rails' multiple-database support (connects_to), schema:generate buckets models by the database they connect to and writes each database's config files into its own subdirectory of the output directory, named after the database config name (primary, analytics, ...):
exwiw/
primary/
shops.json
users.json
schema_migrations.json
analytics/
analytics_events.json
Each database keeps its own Rails migration history, so a schema_migrations (and ar_internal_metadata) entry is emitted under every database that contains one — the example above shows primary/schema_migrations.json and would also produce analytics/schema_migrations.json when the analytics database has its own migration table. Single-database applications are unaffected and continue to write files flat into the output directory.
Limitations
- The rails-managed table names are resolved from the global
ActiveRecord::Base.schema_migrations_table_name/internal_metadata_table_nameaccessors, which are shared across all connections. A per-database override of these names is not detected, so such a table will be missing from that database's generated configs.
Mongoid applications
For MongoDB applications backed by Mongoid, a separate rake task introspects Mongoid document models and emits MongodbCollectionConfig files (the fields / _id / embedded_in shape described under MongoDB notes):
bundle exec rake exwiw:schema:generate_mongoid
It is a distinct task and class (Exwiw::MongoidSchemaGenerator) from the ActiveRecord generator because the two ORMs expose entirely different metadata. From each model it derives:
- the collection name and the
_idprimary key, fieldsfrom the declared Mongoid fields (referencedbelongs_toforeign keys such asshop_id, and thecreated_at/updated_atcolumns added byMongoid::Timestamps, are ordinary fields — their BSONObjectId/Datevalues serialize as MongoDB Extended JSON at dump time). For an aliased field (field :ctry, as: :country), the generator emits the stored document key (ctry), never the Ruby accessor (country), so masking and projection target the key that actually appears in the document, and additionally records the accessor asmongoid_field_nameon that field so the short key stays understandable (association aliases such asshop => shop_idand the built-inid => _idare not field renames and are not annotated),belongs_tosfrom referencedbelongs_toassociations ({ table_name, foreign_key }). A referencedbelongs_todeclared on an embedded document is dropped (cross-collection refs from inside embedded subdocuments are unsupported — see MongoDB notes), but its foreign-key column is still kept as an ordinary field. Ahas_and_belongs_to_manyassociation is also dropped (its foreign keys are stored as an array field, e.g.tag_ids, which exwiw cannot follow as a single-valued foreign key), while that*_idsarray column is kept as an ordinary field,embedded_infromembedded_in/embeds_many/embeds_oneassociations. Each embedded config names its immediate parent collection and the document key it lives under (store_as, defaulting to the relation name); nested embedding is represented as a chain (comments→embedded_inposts,posts→embedded_inusers) rather than a flattened dot-path, matching how the adapter recurses through array and Hash subdocuments. A polymorphicembedded_in(embedded_in :addressable, polymorphic: true) has no single embedding parent collection and so cannot be expressed as anembedded_inconfig; the generator raises a clear error pointing you to define that collection's config by hand. A self-referential / cyclic embedding (Mongoid'srecursively_embeds_many/recursively_embeds_one) makes a collection both a top-level document and embedded inside documents of its own type; exwiw represents a collection as either top-level or embedded, not both, so the generator likewise raises a clear error rather than emit a config that would silently make the collection undumpable.
Models in an inheritance hierarchy whose subclasses share the base's collection (Mongoid STI, distinguished by the auto-added _type discriminator) collapse into a single config: the generator discovers the subclasses via descendants (Mongoid registers only the base class in Mongoid.models) and unions every class's fields and belongs_tos into the collection config, so subclass-only fields and associations are not lost.
Regeneration preserves hand-edited replace_with, filter, skip, and bulk_insert_chunk_size values, like the ActiveRecord generator. Indexes are not written to the config — they are introspected from the live database at dump time (see MongoDB notes). Polymorphic belongs_to is not yet expanded by this task.
Configuration
This is an example of the one table schema:
{
"name": "users",
"primary_key": "id",
"filter": "users.id > 0",
"bulk_insert_chunk_size": 1000,
"belongs_to": [{
"name": "companies",
"foreign_key": "company_id"
}],
"columns": [{
"name": "id",
}, {
"name": "email",
"replace_with": "user{id}@example.com"
}, {
"name": "company_id"
}]
}
--config-dir will use all json files in the specified directory.
Output format
By default, exwiw generates INSERT statements. For PostgreSQL, you can pass --output-format=copy to generate COPY FROM stdin format instead, which is significantly faster for bulk loading.
The generated file uses tab-separated values with PostgreSQL's text-format escaping (\N for NULL, \\ for backslash, etc.). Import with psql:
psql -d app_dev -f dump/insert-001-shops.sql
--output-format=copy is only supported with the postgresql adapter.
Skip DELETE SQL output
By default, exwiw generates delete-*.sql files alongside the insert-*.sql files so that an existing dataset can be cleared before re-inserting. Pass --insert-only when you only need the insert files.
After-insert hook
--after-insert-hook=PATH runs a post-processing hook after all per-table insert/delete files have been written. The hook can be either a Ruby file (.rb) or any executable script (e.g. .sh).
Ruby hook (.rb): provides a tiny DSL with two builtins:
cli_options— Hash of all parsed CLI options (e.g.cli_options.fetch(:ids)returns the--idsarray).insert_sql(template)— appends an ERB-rendered string to a buffer. After the hook finishes, the buffer is concatenated and written toinsert-{N+1}-after_insert.{ext}where{N+1}is one past the last per-table insert file. For the MongoDB adapter the equivalent aliasinsert_jsonl(template)is available; output goes toinsert-{N+1}-after_insert.jsonl. Multipleinsert_sqlcalls in a single hook are joined with"\n"into the same file. If noinsert_sqlcall is made, no file is created.
Example hooks/seed_default_users.rb:
insert_sql <<~SQL
-- seed default users for tenants <%= cli_options.fetch(:ids).join(',') %>
<%- cli_options.fetch(:ids).each do |tenant_id| -%>
INSERT INTO users (tenant_id, email) VALUES (<%= tenant_id %>, 'default@example.com');
<%- end -%>
SQL
Shell hook: anything other than .rb is exec'd as a child process. It is a pure side-effect hook — exwiw does not capture its stdout. The hook receives these env vars and inherits DATABASE_PASSWORD from the parent:
EXWIW_OUTPUT_DIR,EXWIW_CONFIG_DIREXWIW_DATABASE_ADAPTER,EXWIW_DATABASE_HOST,EXWIW_DATABASE_PORT,EXWIW_DATABASE_USER,EXWIW_DATABASE_NAMEEXWIW_TARGET_TABLE,EXWIW_IDS(comma-separated),EXWIW_OUTPUT_FORMAT
A non-zero exit code from the shell hook aborts exwiw.
Note: Ruby hooks are evaluated via instance_eval inside the exwiw process — only pass paths you trust.
Skip a table
Set "skip": true on a table's config JSON to exclude it from data extraction. The table's DDL is still emitted into insert-000-schema.{sql,js} so the schema stays consistent, but no insert-* / delete-* files are generated for it and the table is never queried.
{
"name": "audit_logs",
"primary_key": "id",
"skip": true,
"belongs_tos": [],
"columns": [{ "name": "id" }]
}
Constraints:
- If another non-skipped table has a
belongs_toentry pointing at a skipped table, exwiw raisesArgumentErroron load. Remove thebelongs_toentry on the referencing table, or unsetskipon the referenced table. - Specifying a skipped table as
--target-tableraisesArgumentError. skip: trueis preserved byexwiw:schema:generateregenerations (the receiver value wins over the auto-generated config).
Polymorphic belongs_to
A Rails polymorphic association (belongs_to :reviewable, polymorphic: true) does not point at a single table — the target row is selected at runtime by a type column. exwiw models this as one belongs_to entry per concrete target table, each carrying two extra fields:
foreign_type— the type column on this table (e.g.reviewable_type).type_value— the value stored in that column for this target (e.g."Product"), i.e. the target model'spolymorphic_name.
{
"name": "reviews",
"primary_key": "id",
"belongs_tos": [
{
"table_name": "products",
"foreign_key": "reviewable_id",
"foreign_type": "reviewable_type",
"type_value": "Product"
},
{
"table_name": "shops",
"foreign_key": "reviewable_id",
"foreign_type": "reviewable_type",
"type_value": "Shop"
}
],
"columns": [{ "name": "id" }, { "name": "reviewable_type" }, { "name": "reviewable_id" }]
}
exwiw:schema:generate expands a polymorphic belongs_to automatically: it finds every model that registers the association as a target via has_many / has_one ..., as: :reviewable and emits one entry per target table (ordered by table name so the output is stable across Ruby versions). A plain (non-polymorphic) belongs_to simply omits foreign_type / type_value.
At dump time, when a polymorphic belongs_to lies on the path to the dump target, exwiw constrains both the foreign key and the type column, so only rows of the matching type are extracted. For example, dumping products pulls only reviews whose reviewable_type = 'Product':
SELECT reviews.* FROM reviews
WHERE reviews.reviewable_id IN (/* products subquery */)
AND reviews.reviewable_type = 'Product'
The same type filter is applied on the join path — and in the matching delete-*.sql bulk-delete subquery — when the polymorphic table is an intermediate hop rather than the directly-dumped table.
Rails-managed tables (special type values)
Some tables are owned by Rails itself rather than the application — they have no ActiveRecord model and Rails reserves the right to evolve their column shape between versions (e.g. schema_migrations, ar_internal_metadata). exwiw treats them as a distinct category via the type field on a table config:
type: "rails_managed_schema_migrations"— Rails' migration history table (ActiveRecord::Base.schema_migrations_table_name).type: "rails_managed_internal_metadata"— Rails' internal metadata table (ActiveRecord::Base.internal_metadata_table_name).
exwiw:schema:generate emits these entries automatically when the corresponding tables exist on the connection — they are NOT pulled from ActiveRecord::Base.descendants because they have no model class.
A rails-managed entry has a minimal shape (no primary_key, no belongs_tos, no columns):
{
"name": "schema_migrations",
"type": "rails_managed_schema_migrations",
"comment": "Managed internally by Rails. Tracks applied schema migrations."
}
Behavior at dump time:
- Extraction uses
SELECT *so the dump is robust against Rails-side column additions. INSERTstatements omit the column list (INSERT INTO schema_migrations VALUES (...)). For PostgreSQL--output-format=copy, theCOPYheader similarly omits the column list (COPY schema_migrations FROM stdin;).- No
delete-*.sqlfile is generated for rails-managed tables, to avoid wiping migration history on the import target.
Constraints:
- Defining
primary_key,columns, orbelongs_toson a rails-managed entry is rejected withArgumentErroron load. - A rails-managed table cannot be used as
--target-table. - In multi-database setups, the rails-managed entry is emitted under whichever database's connection actually contains the table (see Multiple databases). The table name itself is still derived from the global
ActiveRecord::Base.schema_migrations_table_name/internal_metadata_table_name(prefix/suffix) accessors.
Composite primary keys (unsupported)
exwiw does not yet support tables with a composite primary key. When exwiw:schema:generate encounters a model whose primary_key is an array, it still emits a config entry so the table is not silently dropped, but marks it skip: true, tags it type: "unsupported_composite_primary_key", and records the key columns in a comment:
{
"name": "composite_pk_records",
"type": "unsupported_composite_primary_key",
"skip": true,
"comment": "exwiw does not support composite primary keys (organization_id, location_id); data extraction is skipped.",
"belongs_tos": [],
"columns": [{ "name": "organization_id" }, { "name": "location_id" }, { "name": "name" }]
}
Unlike rails-managed entries, columns and belongs_tos are retained so the entry is ready to wire up once composite-key support lands. The type is purely a marker — skip: true is what actually excludes the table from extraction, so removing skip (and supplying a workable primary_key) lets you opt the table back in manually.
Bulk insert chunk size
bulk_insert_chunk_size splits the generated INSERT statement into multiple statements, each containing at most the specified number of rows. This is useful when the number of records per table is large enough to hit limits like MySQL's max_allowed_packet.
If omitted, all records for a table are emitted as a single INSERT statement.
Filter
Some case, you don't need full records related to target. e.g. dump user access logs only for the last year.
filter is here for that. Be careful to use this option, as it will be:
- injected as it is in table condition(e.g. WHERE on mysql), so you are recommended to clearify table name of column to avoid ambiguity.
- injected to every where / join clause, so it affects to all tables depends on filterted target-table. it results to data inconsistency.
Masking
exwiw provides several options for masking value.
replace_with
It will replace the value with the specified string,
and you can use the column name with {} to replace the value with the column value.
For example, Let assume we have the record which id is 1, then "userid@example.com" will be replaced with "user1@example.com".
raw_sql
It will used instead of the original value.
For example, "raw_sql": "CONCAT('user', shops.id, '@example.com')" is equivalent to
"replace_with": "user{id}@example.com".
This is useful when you want to transform with functions provided by the database.
Notice that you are recommended to clearify table name of column to avoid ambiguity.
If it used with replace_with, replace_with will be ignored.
map
XXX: TODO
Given value will be evaluated as Ruby code, and treated as the proc.
"map": "proc { |r| 'user' + v['id'].to_s + '@example.com' }"
which is equivalent to "replace_with": "user{id}@example.com".
Notice this is the most powerful option, but you should be careful to use this option. Because this transformation occured on exwiw process, so much slower than other options. Most of case, this option is not recommended.
MongoDB notes
The MongoDB adapter is experimental. To use it:
- Add
gem "mongo"to your Gemfile in addition toexwiw(it is not declared as a runtime dependency of the gem). - Set
--adapter=mongodb.--user/DATABASE_PASSWORDare optional and only needed when your MongoDB requires authentication. - The MongoDB adapter consumes a separate config type,
MongodbCollectionConfig, with MongoDB-native naming. Usefields(instead of the SQL adapters'columns), and set"primary_key": "_id". Foreign keys (shop_id,user_id, ...) stay as ordinary fields. --idsvalues are coerced to the type actually stored in_idbefore filtering: integer-looking ids becomeInteger, 24-char hex ids becomeBSON::ObjectId(Mongoid's default_idtype — a plain String would never match an ObjectId), and any other string is left as-is.--target-collection=COLLECTIONis a mongodb-only alias of--target-table(use whichever reads better for MongoDB). Specifying both, or using--target-collectionwith a non-mongodb adapter, is an error.--ids-field=FIELDmatches--idsagainstFIELDon the target collection instead of its primary key (e.g.--target-collection=users --ids=a@example.com --ids-field=email). Downstream foreign-key propagation still keys off the primary key, so only the target collection's filter changes. Unlike the primary-key path, the supplied ids are not type-coerced (the stored type of a custom field is unknown), so pass values matching the field's actual type. This flag is mongodb-only; the SQL adapters use--ids-columninstead (see below).- Output is JSON Lines (
insert-{idx}-{collection}.jsonl) using MongoDB Extended JSON (relaxed mode). Import withmongoimport:bash mongoimport --db app_dev --collection users --file dump/insert-002-users.jsonl - The leading
dump/insert-000-schema.jscontainsdb.createCollection(...)anddb.<col>.createIndex(...)calls for every top-level collection (indexes are introspected from the source vialistIndexes; the auto-created_id_index is skipped). Apply it with mongosh before runningmongoimport:bash mongosh "mongodb://localhost/app_dev" dump/insert-000-schema.js - Unlike SQL adapters, the MongoDB adapter does not emit
delete-*.jsonlfiles (drop the database / collection yourself before importing if needed). raw_sqlis not supported (theMongodbFieldschema does not declare it; anyraw_sqlkeys in scenario JSON are silently dropped on load). Usereplace_withfor masking.- The MongoDB adapter does not support the collection-level
filterfield (it raisesNotImplementedErrorif set, since the SQL-string filter cannot be applied to MongoDB).
Embedded documents
MongoDB models often store one-to-many relationships as embedded subdocument arrays (e.g. users documents with a posts: [...] field). To mask fields inside embedded subdocuments, declare a separate config with embedded_in:
// scenario/users.json — top-level collection
{
"name": "users",
"primary_key": "_id",
"belongs_tos": [{ "table_name": "shops", "foreign_key": "shop_id" }],
"fields": [
{ "name": "_id" },
{ "name": "name", "replace_with": "masked{_id}" },
{ "name": "shop_id" }
]
}
// scenario/posts.json — embedded under users.posts
{
"name": "posts",
"primary_key": "_id",
"embedded_in": { "collection_name": "users", "path": "posts" },
"belongs_tos": [],
"fields": [
{ "name": "_id" },
{ "name": "title", "replace_with": "masked-{_id}" }
]
}
At runtime:
postsis not dumped as its own jsonl file. Itsreplace_withrules are applied to the subdocuments inside the parentusersdocument at the pathposts.pathaccepts dot-separated paths for nested fields (e.g."profile.contacts").- Both arrays of subdocuments and a single Hash subdocument at
pathare supported. Multiple levels of nesting work via embedded chains. - Cross-collection references from inside an embedded subdocument (
belongs_toson an embedded config) are not supported and raiseArgumentErroron load. - Specifying an embedded config as
--target-tableraisesNotImplementedError; pass the top-level collection name instead.
How it works
- Load the table information from the specified config file.
- Calculate the dependency between tables.
- Generate the full list of INSERT sql based on the specified conditions.
- If the processing table has no relation with target tables, then dump all records.
- If the processing table has relation with target tables, then dump the records which are related to the target tables.
- Generate the full list of DELETE sql based on the specified conditions.
- If the processing table has no relation with target tables, then delete all records.
- If the processing table has relation with target tables, then delete the records which are related to the target tables.
Development
After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install.
To release a new version:
- Run the Release PR workflow from the Actions tab with the new version number (e.g.
0.2.3). This creates a PR that bumpsversion.rbandCHANGELOG.md. - Merge the PR. The Release workflow runs automatically, creating a git tag and publishing the gem to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/heyinc/exwiw.
License
The gem is available as open source under the terms of the MIT License.