TeraTeam UK Teradata DBA Definition
Skewing:
Data on a Teradata system is distributed across the AMPs solely on the basis of the primary index (using a hashing algorithm) . Therefore, if the primary index is a unique one, the data is distributed evenly.
Example, 100mb of data with a unique primary index on a 100 AMP system, with a unique Primary index would assign 1 mb of data to each AMP.
Teradata assigns space on the basis of the maximum space usage one AMP times the number of AMPs, so max space usage here is 1mb x 100 AMPs = 100mb of space for 100mb of data = perfect distribution. Each AMP is always assigned the same space for each table, regardless of use.
However, if the Primary index choice for a table is non-unique, for an extreme example, in error a user creates a table with a PI of ‘sex’ with a choice of only male and female then the data is distributed only across 2 AMPs of the (hypothetical) 100. Thus 50 mb each on 2 AMPs. Space usage is then calculated as the maximum on a single AMP (50mb - assuming a 50/50 split between M and F) x 100 AMPs = 5000mb of space, so the data takes up 50 times the space on the system as each AMP has 50mb of space assigned for this data.
When this happens the table is said to be skewed.
Skewing in Queries.
In joins, the data is distributed on the joining columns.
When joining 2 (or more) tables, in order to perform the join, Teradata must have the data on the same AMP.
If the query is joining the tables on identical PI values, the data is already on the same AMP and no redistribution takes place. This is extremely efficient.
However, when joining tables with different PI's, or joining on non-PI values, the data for at least one table must be redistributed on the joining columns in order to perform the join. This may done in Spool or Memory depending on the size of the tables (or subsets of table data) selected.
Therefore the uniqueness of the joining column values is now a deciding factor in Spool distribution.
Where tables or spool are distributed on non-unique values, leading to a handful of Amps having most of the data, this is referred to as being 'skewed'.
Queries on Skewed data will take proportionally longer to run - queries on data distributed evenly over 240 Amps will be 10 times quicker than data distributed evenly on 24 Amps. This is because the query is only as fast as the slowest Amp.
Skewed Spool is created when data is redistributed by the optimiser on Columns which contain skewed values.
95% of occasions when a User receives a message that his query has run out of spool, the cause is: Skewed data or Skewed Spool. Because Teradata calculates a users total spool requirement by examining the amp using the most spool and then multiplying this value by the number of Amps.
Example:
Table STAFF_DETAILS
Staff_Number (Unique Primary Index)
First Name (non- unique many values - some repeated frequently)
Last Name (non- unique many values - some repeated frequently)
Location (non- unique many values - some repeated frequently)
Country (Very Non-Unique, mostly 'UK' for the sake of argument)
Sex ( Extremely non-unique - only 2 values)
No_Dependants (Extremely non-unique)
The above table will have only one entry per Staff_Number and as this is the Unique PI, distribution will be excellent.
A join on 'Staff_number' to another table with a PI of Staff_Number will be very efficient.
A join on 'Last Name' will produce some skewing, with popular values such as 'Smith', 'Davies' or 'Patel' producing Amps with proportionally higher workloads. This may produce a Spool space error dependant on data volumes.
A join on 'Sex', having only 2 values will distribute on only 2 Amps and is almost certain to produce a spool error (unless there are very few rows involved). The exception being that if the joining table is very small (one row for each value perhaps then the small table will be duplicated on all AMPs, assuming up to date statistics exist which let the Optimiser know the correct numbers of values.
You may have already deduced therefore that a query that runs effectively on 100 Amp system may not work on a system with 200 Amps if the data IF the data is not distributed evenly. In the above example, distribution on the Surname 'Smith' may produce a peak Amp Usage of 150mb on a 240 Amps system where the user has 40GB (166 per AMP) and will just run. If more Amps are added, the spool is divided further and the 150mb peak may now therefore blow the users spool.



