Thursday, March 28, 2013

Clustered Tables vs Heap Tables



Problem
One very important design aspect when creating a new table is the decision to create or not create a clustered index.  A table that does not have a clustered index is referred to as a HEAP and a table that has a clustered index is referred to as a clustered table.  A clustered table provides a few benefits over a heap such as physically storing the data based on the clustered index, the ability to use the index to find the rows quickly and the ability to reorganize the data by rebuilding the clustered index.  Depending on the INSERT, UPDATE and DELETE activity against your tables your physical data can become very fragmented.  This fragmentation can lead to wasted space in your database, because of partly full pages as well as the need to read several more pages in order to satisfy the query.  So what can be done?
Solution
The primary issue that we want to address is the fragmentation that occurs with normal database activity.  Depending on whether your table has a clustered index or not will determine if you can easily address the fragmentation problem down to the physical data level.  Because a heap or a clustered index determines the physical storage of your table data, there can only be one of these per table.  So a table can either have one heap or one clustered index.
Let's take a look at the differences between a heap and clustered table.
HEAP
  • Data is not stored in any particular order
  • Specific data can not be retrieved quickly, unless there are also non-clustered indexes
  • Data pages are not linked, so sequential access needs to refer back to the index allocation map (IAM) pages
  • Since there is no clustered index, additional time is not needed to maintain the index
  • Since there is no clustered index, there is not the need for additional space to store the clustered index tree
  • These tables have a index_id value of 0 in the sys.indexes catalog view

Clustered Table
  • Data is stored in order based on the clustered index key
  • Data can be retrieved quickly based on the clustered index key, if the query uses the indexed columns
  • Data pages are linked for faster sequential access
  • Additional time is needed to maintain clustered index based on INSERTS, UPDATES and DELETES
  • Additional space is needed to store clustered index tree
  • These tables have a index_id value of 1 in the sys.indexes catalog view
source: SQL Server 2005 books online
So based on the above you can see there are a few fundamental differences on whether a table has a clustered index or not.
Fragmentation
A problem that occurs on all tables is the issue of becoming fragmented.  Depending on the activity performed such as DELETES, INSERTS and UPDATES, your heap tables and clustered tables can become fragmented.  A lot of this depends on the activity as well as the key values that are used for your clustered index. 
  • If your heap table only has INSERTS occurring, your table will not become fragmented, since only new data is written.
  • If your clustered index key is sequential, such as an identity value, and you only have INSERTS, again this will not become fragmented since the new data is always written at the end of the clustered index.
  • But if your table is either a heap or a clustered table and there are a lot of INSERTS, UPDATES and DELETES the data pages can become very fragmented.  This results in wasted space as well as additional data pages to read to satisfy the queries. 
    • When a table is created as a heap, SQL Server does not force where the new data pages are written.  Whenever new data is written this data is always written at the end of the table or on the next available page that is assigned to this table.  When data is deleted the space becomes free in the data pages, but it is not reused because new data is always written to the next available page.
    • With a clustered index, depending on the index key, new records may be written to existing pages where free space exists or there may be need to split a page into multiple pages in order to insert the new data.  When deletes occur the same issue occurs as with a heap, but this free space may be used again if data needs to be inserted into one of the existing pages that has free space.
    • So based on this, your heap table could become more fragmented then your clustered table.
Identifying Fragmentation
To identify whether your clustered table or heap table is fragmented you need to either run DBCC SHOWCONTIG (2000 or 2005) or use the new DMV sys.dm_db_index_physical_stats (2005).  These commands will give you insight into the fragmentation problems that may exist in your table.  For further information on this take a look at this past tip: SQL Server 2000 to 2005 Crosswalk - Database Fragmentation.
Resolving Fragmentation
Clustered Tables
Resolving the fragmentation for a clustered table can be done easily by rebuilding or reorganizing your clustered index.  

Heap Tables
For heap tables this is not as easy.  The following are different options you can take to resolve the fragmentation:
  1. Create a clustered index
  2. Create a new table and insert data from the heap table into the new table based on some sort order
  3. Export the data, truncate the table and import the data back into the table
Additional Info
When creating a new table via Enterprise Manager or Management Studio when you specify a primary key for the table, the management tools automatically make this a clustered index, but this can be overridden.  When creating a new table via scripts you need to identify that the table be created with a clustered index.   So based on this most of your tables are going to have a clustered index, because of the primary key, but if you do not specify a primary key or build a clustered index the data will be stored as a heap.

No comments:

Post a Comment