Transformations in PostgreSQL
Standardizing data across the database for a Data Warehouse
This is a short summary of how to use PostgreSQL functions to transform data from an OLTP
application into something that is useful for an OLAP
data warehouse — especially taking into consideration all the non-standard SQL data types supported by PostgreSQL.
The idea is to write a functions library transformation_utils
to transform data in PostgreSQL. The process of transformation will include the following things
- changing one data type to another, e.g.,
timestamp
todate
- unnesting nested data types to store data in separate columns and exploit columnar nature of Redshift
- cleaning up strings with characters not accepted by Redshift
These transformations will not be updates to existing columns. They will write data to new columns in the same table. All the old columns (columns from the source system) will exist as they did in the source system.
Numeric
Numeric data types won’t be required to transform much as Redshift supports almost all of the numeric data types.
Nomenclature — no change in column names from existing schema.
String
Amazon Redshift doesn’t allow for the following characters, so we will have to write functions to remove these characters (or replace them with acceptable characters) before the data is loaded into Redshift, otherwise it will cause the loading job to fail.
Removing Invalid Characters
One of the range of characters not supported by Redshift is → 0xD800–0xDB7F : High Surrogates — this range of characters is one example.
select replace('There is a � bad character here','�',' ');
select encode('�','hex');
If writing a function doesn’t work out, the last resort will be using ACCEPTINVCHARS
option while writing the data frame from the Glue Job to Redshift. Using that, the invalid character will be replace by the invalid character with a replacement character — this replacement character can also be specified by us. In case of a multi-byte invalid character, the character will be repeated as many times as the number of bytes in the original unaccepted character.
Nomenclature — no change in column names from existing schema. Also, please note that this is the one case where we don’t have to keep the original column as it is not accepted by Redshift.
UUID
These are 128-but universally unique identifiers. Redshift doesn’t support storage of UUIDs
. In Redshift, they’ll have to be stored as varchar
.
Nomenclature — no change in column names from existing schema.
Array
Let’s say there’s a column which is an array of text values — and we want to break that array into many columns for the purpose of analysis. We must decide preemptively how many maximum possible values this array can hold and create columns for each value.
Similarly, we can get all the array elements stored as columns but we would have to know how many maximum elements there can be.
Here’s another example from PostgreSQL’s documentation —
This is what the table looks like
If we don’t want to preemptively decide on the maximum number of elements in a given array, we’ll have to use unnest
to create rows for every element of the array and store the elements of the array as records. unnest
will return 5 rows if the array has 5 elements.
Nomenclature — Say, there is an
array
calleditems
. And we want to get the first 3 items of thatarray
as columns, the nomenclature will beitem_1
,item_2
anditem_3
. We will store the completearray
after converting it toJSON
.
JSON
We can use json_to_record
function to convert any given JSON
to a record
type. In this case, whenever there are additions to the JSON
structure, the structure of the target transformed table will need to be changed. For instance, look at this sample JSON
from Razorpay payment gateway.
Although, the application will be capturing this in a JSON
data type, native to PostgreSQL, we will have convert this to a more usable, queryable format.
This will give us a record
type which we can then insert into a new structured table where querying becomes easier. Although, Amazon Redshift supports JSON
, but they don’t recommend using JSON
with large data sets because it kind of defeats the purpose of using a columnar database if so much information is stored in a single column.
Nomenclature — See example 1.
Casing
Upper and lower cases are achieved by using native PostgreSQL functions. Camel or Snake casing in data won’t be of any use in data warehousing as far as we can think.
Nomenclature — if a column is applied with
upper()
,lower()
or other casing functions, a suffix will be added to the column name.first_name
will becomefirst_name_lower
.
Temporal
To use all the temporal data types as dimensions in the data warehouse, we need to break them up into smaller usable units of — dates and seconds of day — so that they can be queried against date
dimension and time
dimension.
The date
dimension table will contain the date_key
in the format yyyymmdd
as prescribed by ISO-8601
. Both of these dimensions will be pre-filled with data. For ensuring that all the source tables can use these dimension tables with a simple JOIN
, we need to ensure that we have temporal data types in these formats. For this too, we will create a couple of pgplsql
functions.
Timestamp to Date
Date converted by this function will return an integer and will be stored as such — as the primary key of the date
dimension table. This key will act as a role-playing dimension.
Time to Seconds of Day
Time returned by this function will be an integer and will be stored as such — as the primary key of the time
dimension. This key will also act as a role-playing dimension.
Nomenclature —
created_at
will remain intact, but there will be two additional columns added to the table —created_at_date
,created_at_time
.
Spatial
Because we’ll be using PostGIS for storing all location data in the application database, and our data warehouse doesn’t support PostGIS
data types, we’ll have to extract Latitude
and Longitude
from Points
stored in the database. All location related data will be stored in double precision floating point
in Redshift.
Point
While for creating a point from a latitude-longitude pair, we do the following,
SELECT ST_MakePoint(-71.1043443253471, 42.3150676015829);
which gives us the following output
0101000000E538D293ADC651C0F3699A2254284540
To extract latitude & longitude back from this string representation of a point, we’ll have to do the following
SELECT ST_X(0101000000E538D293ADC651C0F3699A2254284540) latitude, ST_Y(0101000000E538D293ADC651C0F3699A2254284540) longitude;
which gives us back the original latitude & longitude back. Conversions for Linestring
will also work similarly. Here’s an example of getting all nodes from a Linestring
.
Other shapes and forms used in PostGIS to be handled similarly.
Nomenclature —the original string that represents a
point
or alinestring
will remain. In case of apoint
, two new columns will be added to the table — if the column isorder_location
, the two new columns will beorder_location_latitude
andorder_location_longitude
.
Examples
Example 1
In this example, how would we handle the following issues —
- When there are more than items in the array
order_items
— we will storeorder_items
asJSON
and load it to Redshift. We will add a new columnitem_name_4
and so on, if required. It will be a manual effort. - When there are changes (additions, or changes in names of keys) in a
JSON
column, we’ll again have to manually create a column. If the schema changes and there’s no manual change, the data of theJSON
will be queryable in Redshift — see this for JSON support in Redshift. - Use
unnest
only when you are sure you want to duplicate the whole row — or create a separate table with unnested values with a reference to the original table. enum
types should be stored asvarchar
in Redshift.
References
- Multi-byte Character Load Errors in Redshift
- ACCEPTINVCHARS in Redshift
- Supported Data Types in Redshift
- Making Lines from Points
- Array Functions in PostgreSQL
- Sample API in Razorpay
- JSON Functions in Redshift
- PostGIS — ST_DumpPoints
Art