Transformations in PostgreSQL
Standardizing data across the database for a Data Warehouse
This is a short summary of how to use PostgreSQL functions to transform data from an
OLTP application into something that is useful for an
OLAP data warehouse — especially taking into consideration all the non-standard SQL data types supported by PostgreSQL.
The idea is to write a functions library
transformation_utils to transform data in PostgreSQL. The process of transformation will include the following things
- changing one data type to another, e.g.,
- unnesting nested data types to store data in separate columns and exploit columnar nature of Redshift
- cleaning up strings with characters not accepted by Redshift
These transformations will not be updates to existing columns. They will write data to new columns in the same table. All the old columns (columns from the source system) will exist as they did in the source system.
Numeric data types won’t be required to transform much as Redshift supports almost all of the numeric data types.
Nomenclature — no change in column names from existing schema.
Amazon Redshift doesn’t allow for the following characters, so we will have to write functions to remove these characters (or replace them with acceptable characters) before the data is loaded into Redshift, otherwise it will cause the loading job to fail.
Removing Invalid Characters
One of the range of characters not supported by Redshift is → 0xD800–0xDB7F : High Surrogates — this range of characters is one example.
select replace('There is a � bad character here','�',' ');
If writing a function doesn’t work out, the last resort will be using
ACCEPTINVCHARS option while writing the data frame from the Glue Job to Redshift. Using that, the invalid character will be replace by the invalid character with a replacement character — this replacement character can also be specified by us. In case of a multi-byte invalid character, the character will be repeated as many times as the number of bytes in the original unaccepted character.
Nomenclature — no change in column names from existing schema. Also, please note that this is the one case where we don’t have to keep the original column as it is not accepted by Redshift.
These are 128-but universally unique identifiers. Redshift doesn’t support storage of
UUIDs. In Redshift, they’ll have to be stored as
Nomenclature — no change in column names from existing schema.
Let’s say there’s a column which is an array of text values — and we want to break that array into many columns for the purpose of analysis. We must decide preemptively how many maximum possible values this array can hold and create columns for each value.
Similarly, we can get all the array elements stored as columns but we would have to know how many maximum elements there can be.
Here’s another example from PostgreSQL’s documentation —
This is what the table looks like
If we don’t want to preemptively decide on the maximum number of elements in a given array, we’ll have to use
unnest to create rows for every element of the array and store the elements of the array as records.
unnest will return 5 rows if the array has 5 elements.
Nomenclature — Say, there is an
items. And we want to get the first 3 items of that
arrayas columns, the nomenclature will be
item_3. We will store the complete
arrayafter converting it to
We can use
json_to_record function to convert any given
JSON to a
record type. In this case, whenever there are additions to the
JSON structure, the structure of the target transformed table will need to be changed. For instance, look at this sample
JSON from Razorpay payment gateway.
Although, the application will be capturing this in a
JSON data type, native to PostgreSQL, we will have convert this to a more usable, queryable format.
This will give us a
record type which we can then insert into a new structured table where querying becomes easier. Although, Amazon Redshift supports
JSON, but they don’t recommend using
JSON with large data sets because it kind of defeats the purpose of using a columnar database if so much information is stored in a single column.
Nomenclature — See example 1.
Upper and lower cases are achieved by using native PostgreSQL functions. Camel or Snake casing in data won’t be of any use in data warehousing as far as we can think.
Nomenclature — if a column is applied with
lower()or other casing functions, a suffix will be added to the column name.
To use all the temporal data types as dimensions in the data warehouse, we need to break them up into smaller usable units of — dates and seconds of day — so that they can be queried against
date dimension and
date dimension table will contain the
date_key in the format
yyyymmdd as prescribed by
ISO-8601. Both of these dimensions will be pre-filled with data. For ensuring that all the source tables can use these dimension tables with a simple
JOIN, we need to ensure that we have temporal data types in these formats. For this too, we will create a couple of
Timestamp to Date
Date converted by this function will return an integer and will be stored as such — as the primary key of the
date dimension table. This key will act as a role-playing dimension.
Time to Seconds of Day
Time returned by this function will be an integer and will be stored as such — as the primary key of the
time dimension. This key will also act as a role-playing dimension.
created_atwill remain intact, but there will be two additional columns added to the table —
Because we’ll be using PostGIS for storing all location data in the application database, and our data warehouse doesn’t support
PostGIS data types, we’ll have to extract
Points stored in the database. All location related data will be stored in
double precision floating point in Redshift.
While for creating a point from a latitude-longitude pair, we do the following,
SELECT ST_MakePoint(-71.1043443253471, 42.3150676015829);
which gives us the following output
To extract latitude & longitude back from this string representation of a point, we’ll have to do the following
SELECT ST_X(0101000000E538D293ADC651C0F3699A2254284540) latitude, ST_Y(0101000000E538D293ADC651C0F3699A2254284540) longitude;
which gives us back the original latitude & longitude back. Conversions for
Linestring will also work similarly. Here’s an example of getting all nodes from a
Other shapes and forms used in PostGIS to be handled similarly.
Nomenclature —the original string that represents a
linestringwill remain. In case of a
point, two new columns will be added to the table — if the column is
order_location, the two new columns will be
In this example, how would we handle the following issues —
- When there are more than items in the array
order_items— we will store
JSONand load it to Redshift. We will add a new column
item_name_4and so on, if required. It will be a manual effort.
- When there are changes (additions, or changes in names of keys) in a
JSONcolumn, we’ll again have to manually create a column. If the schema changes and there’s no manual change, the data of the
JSONwill be queryable in Redshift — see this for JSON support in Redshift.
unnestonly when you are sure you want to duplicate the whole row — or create a separate table with unnested values with a reference to the original table.
enumtypes should be stored as
- Multi-byte Character Load Errors in Redshift
- ACCEPTINVCHARS in Redshift
- Supported Data Types in Redshift
- Making Lines from Points
- Array Functions in PostgreSQL
- Sample API in Razorpay
- JSON Functions in Redshift
- PostGIS — ST_DumpPoints