Background and Motive
Data predominantly is core to any and every organisation and so does Data Engineering. Especially in financial/banking industry where data’s availability and its fast processing in timely manner is even more important, from different
perspective like customer retention, decision-making and regulatory requirements too.
An ask for sufficient processing power along with efficiently persisting exponentially increasing data has its challenge that are to be looked at.
Working with traditional RDBMS databases like Oracle /SQL Server/ DB2, etc might pose challenges when organisations and projects starts using humongous data (or BigData as you may call) or when they are in-parallel looking for data warehousing solutions.
Traditional RDBMS might be the appropriate choice when it comes to using them as OLTP systems but they start falling apart, when your use cases are among below or they change in due course to any of the below
- OLAP support
- dealing with terabytes of data
- historical analysis
- reading on millions of records in one go
- planning and predicting
- problem solving/decision support
- multi-dimensional view of different business functions
- periodic long running batch jobs using large amounts of data
This is where Teradata with its Massive Parallel Processing shines.
Below content, is my attempt to provide guidance, share learnings, talk through the use cases and encourage using the best practices around Teradata and its usage in world of data engineering and massive parallel processing.
The idea is not to talk about the features of Teradata, rather it is to pick major problem statements (around data) and respond/tackle them with best possible ways using Teradata.
To get the most out of this article. It is a prerequisite for the reader to have a basic understanding of RDBMS, way they work, and terminologies like throughput, latency, data consistency, parallelism, etc.
Different USE-CASES and their Addressment
1) What all should be considered if I need high throughput with my Teradata system?
Teradata works on Massively Parallel Processing (MPP) architecture. MPP essentially (as name suggests) try to break a big task into smaller chunks and run them parallelly leveraging the several independent yet collaborative processing units called
AMP. An AMP stands for Access Module Processor and it is an independent processing unit with its own independent storage and because these resources are NOT shared it is also called as Shared Nothing Architecture.
Apart from AMP and associated storage, there is another important component called the Parsing Engine and this is responsible to parse the query, produce explain plan and pass it on to the AMP for actual processing.
Visualize each AMP as bearer/owner of only a specific sub-set of data. Say your organisation deals in sales of 50 different products. Now you may choose to store your data of your products into a table having an Index defined on ProductName. Now the data
for each product will be assigned to each AMP in your Teradata system. Imagine you have Teradata setup with 50 AMP then each AMP will have only one Product stored. If you have in total only 25 AMPs then each AMP might store 2 distinct product information.
Now, in order to reap most of all this, it is important to formulate the queries in right way, and in turn have your index right. When doing numerous INSERTS/ UPDATES with massive data you would want to use allAMP together to achieve maximum parallelism
and speed. At the same time when you attempt to read data for a specific condition using SELECT query, you would want to traverse only a single AMP because reading through all AMP will will be a costly affair. And that’s where the Index column will help in
doing a pin pointed search going to a single AMP and bring back the result.
Refer below example where No ‘WHERE’ clause is specified in the SELECT query, and thus the optimizer will do a Full-Table scan and each and every row of the table is accessed (going through each AMP and each record with in). This can take from several
minutes to several hours depending upon the size of data you are dealing with.
Select * from Product
Now consider below query where Product table has an index on ProductName
This is going to give you a focus search- will go to a single AMP and bring back the result.
Select * from Product where ProductName =’Fishing-Equipment’
This is simplest of an example above (for ease of understanding) but the approach will remain same, no matter how complex the use case is.
Having said that, take a step forward and also consider a situation where you may reach to a single AMP using the ProductName =’Fishing-Equipment’ which can give you several years of historical data. But you might be interested in only a specific date (Purchase_Date).
This is where PPI (Partitioned Primary Index) will help out. Adding a PPI on date column to your Table along with a NUPI/UPI on ProductName will ensure the data with-in each AMP is further partitioned and ordered on date giving further
boost to performance. Remember PPI apart from giving improved performance, also help in getting rid of old data(purging) in a performant way because a whole partition is dropped instead of row by row deletion.
A PPI will look like as below:
PARTITION BY (RANGE_N
(PURCHASE_DATE BETWEEN DATE '2020-01-01' AND DATE '2020-12-31' EACH INTERVAL '1' DAY)
Precisely, to answer the question on ‘How to achieve throughput/performance/uniqueness/’, focus on your query and then go for right choice of index which should be governed by
a) data uniqueness and balancing the data in each AMP (for each value of that key) – A Unique Primary Index (UPI) will be very well balanced because every key will have only and only one row (unlike Non-Unique Primary index)
b) on a column or set of columns which are expected to be involved in search criteria (where clause) or in the joining conditions.
c) need of further data ordering with in an AMP along with a purging requirement (say based on date, month or year value). This calls for use of PPI.
2) My data is rapidly increasing I need extra space yet same throughput, How?
Teradata is highly scalable actually ‘Linearly scalable’. It means that the throughput of system is almost directly proportionate to the number of nodes. You can have thousands of nodes in your Teradata setup. For example, you can double the capacity
of the system by doubling the number of AMPs. Think of AMP as a processing unit with its own dedicated storage. When your data increases you increase AMP and the data might either re-distribute or simply new data gets into newly created AMP depending upon
the Index keys.
So, to make the most of all the AMP proper distribution of data is essential and thus a thoughtful choice of Index keys (UPI or NUPI as the condition may require). We are living in era where businesses change rapidly and data grows exponentially, Teradata
comes with flexibility of horizontal expansion for capacity and speed to respond to such scenarios.
3) I need to load millions of records in a table in few seconds, How?
Teradata’s provided Export/Import utility comes handy for loading and unloading data to and from it. It makes a good choice for smaller set of data which can span to few 10’s of thousands. But it will be too slow for larger datasets and it can even take
hours to load just 1 million records.
Now when you wanted to load a huge file containing millions of records and want to do it real fast. UseFASTLOAD as it does this in seconds.
Actually, instead of reading and importing record by record FASTLOAD works both by importing data in chunks (several records together) and by dividing & conquering (leverages all AMP together). Essentially it is a two phased process, in phase one
the input file is divided in chunks/blocks of data and given to AMP in parallel which stores and hashes them and then redistributes them to the designated AMP. In Phase 2, each AMP orders those record sets as per their row hash. This way data is loaded parallelly
and stored in an ordered fashion.
Here is a LIVE example along with the script and output screen showing how FASTLOAD internally works and how it was able to load 1 million records in just 6 seconds
In my script below, I am connecting to my server with credentials, specifying the input file type (comma separated in this case), defining the fields in file, file path and finally the Teradata table for import.
Script for FASTLOAD
.LOGON mymachine /userid1,Password12;
.SET RECORD VARTEXT ",";
FILE = "C:\Users\CG\Desktop\1million_records.txt";
INSERT INTO MY_DATABASE.TEST_DATA_IMPORT
Notice below in the First Phase, the record chunks are being sent to individual AMP for processing. The value of 50,000 is checkpoint specified in the script above which lets Teradata to log an entry after every 50,000 records are processed.
This helps in resuming from the logged point in case of failures.
In the Second Phase:
The data which was collected and redistributed (as part of Phase 1) is now committed in AMP in an ordered fashion as per the hash row value.
Notice above that the 1 million record set is loaded in 6 seconds using Fastload. I tried the same load using export/import utility and I have to cancel the load after 30 minutes (which could only load 45K records).
Very similar to FASTLOAD Teradata also provide MULTILOAD which works the same way while it is also capable of orchestrating load of data for multiple files in parallel. If you have a use case to frequently load multiple files into Teradata
go with this one.
To export the data from Teradata tables into an output file with high speed use FASTEXPORT. This too works on the principle of working on record blocks instead of record by record export.
4) Can I further tune my database for performance?
The answer is Yes. You can tune it in many ways, if you know well enough- your data
, metadata and how it is expected to be processed.
a) Drive your table joins simplistically
To make the most of the parallelism (using all AMP and improve performance the optimiser creates an execution plan. If the scripts and the join in them do not have a simple flow then the optimiser gets confused and could not come up with an optimised execution
plan. I have seen some queries taking several hours which could have taken couple of minutes or even seconds, had the query structure is fluid. Let me take example of 4 tables
Table A (having 200 million records)
Table B (100 million records)
Table C (100 thousand records)
Table D (1000 records)
‘A’ joins with ‘B’ then ‘B’ joins to ‘C’ and then ‘C’ to ‘D’, this is a very fluid /simple flow of joins and optimiser will select ‘A’ as the DRIVING table and produce an optimal plan.
A --> B --> C --> D
This is going to be super quick.
Now look at this where A additionally has some dependency with X and Y
Table X (having 50,000 records)
Table Y (having 100 records)
Join flow is like this in the query
A --> B --> C --> D
A --> X --> Y
At this point Optimiser might treat A as driving table but cannot come up with an optimal plan, because it confuses how to distribute data processing to AMPs.
I have this similar issue and it was crazily taking hours to run this query.
An ideal way is to join A, X and Y together and create a separate table (say AXY) with data.
Now use this new table as the driving table to do remaining work, such that the join flow looks as below
AXY -> B -> C -> D
It was a relief to see this query running in less than 4 minutes which earlier was not even completing in 4 hours. Small change big difference.
You can come across similar thing and if you see degraded performance, look back to your flow and try to figure out the root cause.
b) Usage of Multiset tables
If you know for sure that your data uniqueness is maintained and duplicate check is part of your Application/script you are sorted and you should go for MULTSET tables instead of SET tables. These are nothing but metadata of table structure which
tell the Teradata optimiser the treatment for records.
So, if your table is defined as a SET type, Optimiser will ensure that duplicate records are not accidentally inserted and for this reason, optimiser has to trade off performance for data uniqueness. Optimiser will do a check every-time, to see, whether
or not, a particular operation is causing duplicate insertion.
This small precaution of using MULTISET tables(when you know no duplicate will ever come in) will make a huge difference and definitely will bring a performance boost.
c) Use Spool Space judiciously
Whenever your scripts/queries are creating big tables using ‘AS Select’ (with data) clause, it is better to change it and rather do it in two steps first by creating a blank table and then doing an Insert. Similarly, when you are Updating some very large
table which is getting time consuming and resource intensive. It will be optimal to first delete the records and re-insert with all modifications required. In both these situations, much of the spool space can be used which can cause the process to be slow
or error out due to spool space crunch. Generally, in large organisations, Teradata is used as an Enterprise wide tool and can be shared between several departments. There-fore the usage of Spool space should be done judiciously to avoid space crunch.
Also bear in mind that Teradata distributes space to each AMP equally. So, if your data itself is imbalanced (may be due to improper choice of index keys), just because of one AMP getting full with space can cost you failure of your whole job/query or spool
d) Get rid of Staging tables
When not in use or no longer needed, keep Dropping all your volatile table and your temporary tables. Such tables can keep using both the spool space and permanent disk space.
e) Compress data for good
Teradata stores rows/records in blocks. And can also read data on block basis. When you compress and then store data, more records tend to fit in the same block size. This eventually helps in improving the query response time because ultimately lesser I/O
operations are performed to fetch same data.
You can provide compression command during the Table creation time or you can also do it later using a ALTER table command.
f) Operational: Collect those statistics and utilize Viewpoint
Last but not the least. Help the Teradata optimiser to help you. Execute COLLECT STATISTICS after INSERT operation or before running any script which is using multiple table and their joins. This collect command is basically gathering information
across Teradata system regarding the environment, the metadata, the data and its distribution. The optimiser uses this information to produce an optimal execution plan. Get into a habit of running COLLECT STATISTICS before big queries because
this do make a positive difference.
Teradata also comes with a tool called Viewpoint, this is very handy in monitoring various activities happening on the Teradata cluster at a given point in time. It also allows user to look at the Explain plan, Query Band, analysing delayed queries, etc.
This is also used as an Operational tool to go back in time and look at the queries / activities (using ‘Rewind’) and ‘Calendar’ utility of Viewpoint allows reminders to be set for monthly/weekly maintenance activities. Again, a good habit to use it frequently.
If we understand the underlying working of the Teradata system and ensure we are both using it the right way and the right feature at the right place, it can bring out the real capabilities and give lightning speed to our Application data ingestion and consumption
patterns. I am sure if the above points are kept in mind while developing data model and read/write queries and SQL scripts around it, this will have ample positive improvement to your overall system’s performance and will also help reduce re-work/re-engineering
down the line.
I hope these experiences can be of help to you too.
I will try to come up with something more interesting in my next whitepaper/article/blog.
Please do share your comments/questions/likes/dislikes. This further motivates to keep writing.
Do connect me at https://www.linkedin.com/in/chandan-goel-12abcd/
- Chandan Goel (Data Architect @PublicisSapient)