How is database disk space calculated on Masto.host?

The Mastodon database includes many PostgreSQL indexes. Indexes aim to make the database more responsive when performing queries against large data sets. There is a particularity with the way PostgreSQL indexes work. They do not release disk space once some data is deleted from a table.

For example, let’s say you have a PostgreSQL database with 1000 records and a couple of indexes on those records. Then you decide to delete half of the records. You might expect the database disk space to be cut in half, but that does not happen in PostgreSQL. To release the disk space, you need to run a database REINDEX, which should not be done while the database is running. So, a Mastodon server must go offline to run the reindex.

Besides that, it would also need to run frequently to provide accurate disk space usage. I could use a tool like pg_repack that recovers the extra disk space without needing the database offline, but the required frequency would be the same. In the end, many CPU cycles would end up being wasted and open the door to other problems.

To work around this, I calculate database disk space usage with a multiplier. I use the database backup archive (pg_dump custom-format) as the base and multiply it by 4.9. Then I compare it with the actual database disk usage. Whichever is smaller is the value presented under database disk space usage.

Currently, the multiplier is 4.9, but it can change in the future as Mastodon evolves or if I realise that the value is not producing accurate results.

To give an example that will probably better illustrate how it’s done, let’s say a database pg_dump backup archive is 1 GB in size. I will assume the database disk space used is 4.9 GB. Then I compare it with the actual disk space used by that database. Whichever is smaller is the value presented under database disk space usage.

If you go to the Mastodon administration interface, you will see the actual disk space used by the database. The database disk space usage on Masto.host will always be less (or equal). Also, the value you see on Masto.host should be closer to the disk space needed to restore the database or the disk space used after running the reindex command.