Transformations in PostgreSQL

Standardizing data across the database for a Data Warehouse

6 min readFeb 24, 2019

This is a short summary of how to use PostgreSQL functions to transform data from an OLTP application into something that is useful for an OLAP data warehouse — especially taking into consideration all the non-standard SQL data types supported by PostgreSQL.

The idea is to write a functions library transformation_utils to transform data in PostgreSQL. The process of transformation will include the following things

changing one data type to another, e.g., timestamp to date
unnesting nested data types to store data in separate columns and exploit columnar nature of Redshift
cleaning up strings with characters not accepted by Redshift

These transformations will not be updates to existing columns. They will write data to new columns in the same table. All the old columns (columns from the source system) will exist as they did in the source system.

Numeric

Numeric data types won’t be required to transform much as Redshift supports almost all of the numeric data types.

Nomenclature — no change in column names from existing schema.

String

Amazon Redshift doesn’t allow for the following characters, so we will have to write functions to remove these characters (or replace them with acceptable characters) before the data is loaded into Redshift, otherwise it will cause the loading job to fail.

Removing Invalid Characters

One of the range of characters not supported by Redshift is → 0xD800–0xDB7F : High Surrogates — this range of characters is one example.

select replace('There is a � bad character here','�',' ');
select encode('�','hex');

This is just an example that replaces one of the invalid characters — more characters will be added to this function to be removed.

If writing a function doesn’t work out, the last resort will be using ACCEPTINVCHARS option while writing the data frame from the Glue Job to Redshift. Using that, the invalid character will be replace by the invalid character with a replacement character — this replacement character can also be specified by us. In case of a multi-byte invalid character, the character will be repeated as many times as the number of bytes in the original unaccepted character.

Nomenclature — no change in column names from existing schema. Also, please note that this is the one case where we don’t have to keep the original column as it is not accepted by Redshift.

UUID

These are 128-but universally unique identifiers. Redshift doesn’t support storage of UUIDs. In Redshift, they’ll have to be stored as varchar.

Nomenclature — no change in column names from existing schema.

Array

Let’s say there’s a column which is an array of text values — and we want to break that array into many columns for the purpose of analysis. We must decide preemptively how many maximum possible values this array can hold and create columns for each value.

Similarly, we can get all the array elements stored as columns but we would have to know how many maximum elements there can be.

Here’s another example from PostgreSQL’s documentation —

This is what the table looks like

If we don’t want to preemptively decide on the maximum number of elements in a given array, we’ll have to use unnest to create rows for every element of the array and store the elements of the array as records. unnest will return 5 rows if the array has 5 elements.

Nomenclature — Say, there is an array called items. And we want to get the first 3 items of that array as columns, the nomenclature will be item_1, item_2 and item_3. We will store the complete array after converting it to JSON.

JSON

We can use json_to_record function to convert any given JSON to a record type. In this case, whenever there are additions to the JSON structure, the structure of the target transformed table will need to be changed. For instance, look at this sample JSON from Razorpay payment gateway.

Although, the application will be capturing this in a JSON data type, native to PostgreSQL, we will have convert this to a more usable, queryable format.

This will give us a record type which we can then insert into a new structured table where querying becomes easier. Although, Amazon Redshift supports JSON, but they don’t recommend using JSON with large data sets because it kind of defeats the purpose of using a columnar database if so much information is stored in a single column.

Nomenclature — See example 1.

Casing

Upper and lower cases are achieved by using native PostgreSQL functions. Camel or Snake casing in data won’t be of any use in data warehousing as far as we can think.

Nomenclature — if a column is applied with upper(), lower() or other casing functions, a suffix will be added to the column name. first_name will become first_name_lower.

Temporal

To use all the temporal data types as dimensions in the data warehouse, we need to break them up into smaller usable units of — dates and seconds of day — so that they can be queried against date dimension and time dimension.

The date dimension table will contain the date_key in the format yyyymmdd as prescribed by ISO-8601. Both of these dimensions will be pre-filled with data. For ensuring that all the source tables can use these dimension tables with a simple JOIN, we need to ensure that we have temporal data types in these formats. For this too, we will create a couple of pgplsql functions.

Timestamp to Date

Date converted by this function will return an integer and will be stored as such — as the primary key of the date dimension table. This key will act as a role-playing dimension.

Time to Seconds of Day

Time returned by this function will be an integer and will be stored as such — as the primary key of the time dimension. This key will also act as a role-playing dimension.

Nomenclature —created_at will remain intact, but there will be two additional columns added to the table — created_at_date, created_at_time.

Spatial

Because we’ll be using PostGIS for storing all location data in the application database, and our data warehouse doesn’t support PostGIS data types, we’ll have to extract Latitude and Longitude from Points stored in the database. All location related data will be stored in double precision floating point in Redshift.

Point

While for creating a point from a latitude-longitude pair, we do the following,

SELECT ST_MakePoint(-71.1043443253471, 42.3150676015829);

which gives us the following output

0101000000E538D293ADC651C0F3699A2254284540

To extract latitude & longitude back from this string representation of a point, we’ll have to do the following

SELECT ST_X(0101000000E538D293ADC651C0F3699A2254284540) latitude,        ST_Y(0101000000E538D293ADC651C0F3699A2254284540) longitude;

which gives us back the original latitude & longitude back. Conversions for Linestring will also work similarly. Here’s an example of getting all nodes from a Linestring.

Other shapes and forms used in PostGIS to be handled similarly.

Nomenclature —the original string that represents a point or a linestring will remain. In case of a point, two new columns will be added to the table — if the column is order_location, the two new columns will be order_location_latitude and order_location_longitude.

Examples

Example 1

In this example, how would we handle the following issues —

When there are more than items in the array order_items— we will store order_items as JSON and load it to Redshift. We will add a new column item_name_4 and so on, if required. It will be a manual effort.
When there are changes (additions, or changes in names of keys) in a JSON column, we’ll again have to manually create a column. If the schema changes and there’s no manual change, the data of the JSON will be queryable in Redshift — see this for JSON support in Redshift.
Use unnest only when you are sure you want to duplicate the whole row — or create a separate table with unnested values with a reference to the original table.
enum types should be stored as varchar in Redshift.

References

Art