Execution engine in Query processor SQL Server

The Execution Engine is, at its heart, a collection of physical operators that perform the functions of the query processor, which is to execute your query in an efficient way. Or, to look at it from the other direction, these operations implemented by the Execution Engine define the choices available to the Query Optimizer when building execution plans.

A scan reads an entire structure, which could be a heap(Table scan), a clustered index(Clustered Index scan), or a non-clustered index(Index scan).A seek, on the other hand, does not scan an entire structure but, instead, efficiently retrieves rows from an index. Therefore seeks can only be performed on a clustered or non-clustered index(Clustered Index seek or index seek). Just to make the difference between these structures clear, a heap contains all of a table’s columns, and its data is not stored sorted in any particular order. Conversely, in a clustered index, the data is stored logically sorted by the clustering key and, in addition to the clustering key, the clustered index also contains the remaining columns of the table. On the other hand, a non-clustered index can be defined on a clustered index or a heap, and usually contains only a subset of the columns of the table.

Scanning:

Scanning heap — SELECT * FROM DatabaseLog

SELECT * FROM Person.Address

Both the Table Scan and Clustered Index Scan operations are similar in that they scan the entire base table, but the former is used for heaps and the second one for clustered indexes.

SELECT AddressID, City, StateProvinceID FROM Person.Address

Note that the Query Optimizer was able to solve this query without even accessing the base table Person.Address, and instead decided to scan the IX_Address_ AddressLine1_AddressLine2_City_StateProvinceID_PostalCode index, which comprises fewer pages. The index definition includes AddressLine1, AddressLine2, City, StateProvinceID and PostalCode, so it can clearly cover columns requested in the query. However, you may wonder where the index is getting the AddressID column from. When a non-clustered index is created on a table with a clustered index, each non-clustered index row also includes the table clustering key. This clustering key is used to find which record from the clustered index is referred to by the non-clustered index row . In this case, as I mentioned earlier, AddressID is the clustering key of the table and it is stored in every row of the non-clustered index, which is why the index was able to cover this column.

Seeking:

An Index Seek does not scan the entire index, but instead navigates the B-tree index structure to quickly find one or more records.A benefit of a Clustered Index Seek, compared to a non-clustered Index Seek, is that the former can cover any column of the table.

SELECT AddressID, City, StateProvinceID FROM Person.Address WHERE AddressID = 12037

SELECT AddressID, StateProvinceID FROM Person.Address WHERE StateProvinceID = 32

Above the base table was not used at all and it was not even necessary to scan the entire index: there is a non-clustered index on the StateProvinceID and, as mentioned previously, it also contains the clustering key AddressID.

Bookmark Lookup:

what happens if a non-clustered index is useful to quickly find one or more records, but does not cover the query.Query Optimizer has to decide if it is more efficient to both use the non-clustered index to find these records quickly and also access the base table to obtain the additional fields, or to just go straight to the base table and scan it.

SELECT AddressID, City, StateProvinceID, ModifiedDate FROM Person.Address WHERE StateProvinceID = 32

Query Optimizer is choosing the index IX_Address_ StateProvinceID to find the records quickly. However, because the index does not cover the additional columns, it also needs to use the base table (in this case the clustered index) to get that additional information. This operation is called a bookmark lookup, and it is performed by the Key Lookup operator.

SELECT AddressID, City, StateProvinceID, ModifiedDate FROM Person.Address WHERE StateProvinceID = 20

This time, the Query Optimizer estimated that more records could be returned than when StateProvinceID was equal to 32, and it decided that it was cheaper to do a Table Scan than to do many bookmark lookups.Well, since a bookmark lookup requires random I/O, which is very expensive, it would not take many records for the Query Optimizer to switch from a bookmark lookup to a Clustered Index Scan (or a Table Scan).

Since non-clustered indexes can exist on both heaps and clustered indexes, we can also have a bookmark lookup on a heap. To follow the next example, create an index on the DatabaseLog table, which is a heap, by running the following statement:

CREATE INDEX IX_Object ON DatabaseLog(Object)
SELECT * FROM DatabaseLog WHERE Object = ‘City’

Heaps do not have clustering keys like clustered indexes do, and instead they have row identifiers (RID). A RID is a row locator that includes information like the database file, page, and slot numbers to allow a specific record to be easily located.Every row in a non-clustered index created on a heap contains the RID of the corresponding heap record.

Aggregations:

Aggregations are used in databases to summarize information about some set of data.SQL Server has two operators to implement aggregations, Stream Aggregate and Hash Aggregate, and they can be used to solve queries with aggregation functions (like SUM, AVG or MAX), the GROUP BY clause, or the DISTINCT keyword.

Stream Aggregate and Merge Join, require data to be already sorted.To provide sorted data, the Query Optimizer may employ an existing index, or it may explicitly introduce a Sort operator.Hashing is used by the Hash Aggregate and Hash Join operators, both of which work by building a hash table in memory.

Stream Aggregate:

Queries using an aggregate function and no GROUP BY clause are called scalar aggregates, as they return a single value, and are always implemented by the Stream Aggregate operator.

SELECT AVG(ListPrice) FROM Production.Product

SELECT ProductLine, COUNT(*) FROM Production.Product GROUP BY ProductLine

A Stream Aggregate operator always requires its input to be sorted by the GROUP BY clause predicate so, in this case, the Sort operator shown in the plan will provide the data sorted by the ProductLine column.

Since the purpose of the Stream Aggregate operator is to aggregate values based on groups, its algorithm relies on the fact that its input is already sorted by the GROUP BY clause, and thus records from the same group are next to each other.

Hash Aggregate:

Hash Aggregate and Hash Join, which work in a similar way and are, in fact, implemented by the same physical operator: Hash Match.

The Query Optimizer can select a Hash Aggregate for big tables where the data is not sorted, there is no need to sort it, and its cardinality estimates only a few groups.The SalesOrderHeader table has no index on the ContactID column:

SELECT ContactID, COUNT(*) FROM Sales.SalesOrderHeader GROUP BY ContactID

A hash table is created in memory, and a hash value is calculated for each row processed and for each hash value calculated, the algorithm will check if the corresponding group already exists on the hash table and, if it does not, it will create a new entry for it. In this way, the values for each record are aggregated in this entry on the hash table, and only one row for each group is stored in memory.

Joins:

There are three join operators that SQL Server uses to implement logical joins: the Nested Loops Join, the Merge Join and the Hash Join.

Nested Loops Join:

SELECT e.EmployeeID FROM HumanResources.Employee AS e INNER JOIN Sales.SalesPerson AS s ON e.EmployeeID = s.SalesPersonID

The input shown at the top in a Nested Loops Join plan is known as the outer input and the one at the bottom is the inner input. The algorithm for the Nested Loops Join is very simple: the operator used to access the outer input is executed only once, and the operator used to access the inner input is executed once for every record that qualifies on the outer input.

The Query Optimizer is more likely to choose a Nested Loops Join when the outer input is small and the inner input has an index on the join key. This join type can be especially effective when the inner input is potentially large, as only a few rows, indicated by the outer input, will be searched.

Merge Join:

SELECT Name FROM Sales.Store AS S JOIN Sales.Customer AS C ON S.CustomerID = C.CustomerID WHERE C.CustomerType = N’S’

One difference between this and a Nested Loops Join is that, in a Merge Join, both input operators are executed only once.Another difference is that a Merge Join requires an equality operator and its inputs sorted on the join predicate. In this example, the join predicate has an equality operator, is using the CustomerID column, and both clustered indexes are ordered by CustomerID, which is their clustering key.

Taking benefit from the fact that both of its inputs are sorted on the join predicate, a Merge Join simultaneously reads a row from each input and compares them. If the rows match, they are returned.

If the inputs are not sorted, the Query Optimizer it is not likely to choose a Merge Join.

Hash Join:

SELECT pv.ProductID, v.VendorID, v.Name FROM Purchasing.ProductVendor pv JOIN Purchasing.Vendor v ON (pv.VendorID = v.VendorID) WHERE StandardPrice > $10

In the same way as the Merge Join, the Hash Join requires an equality operator on the join predicate but, unlike the Merge Join, it does not require its inputs to be sorted and also its operations in both inputs are executed only once.The Query Optimizer will use a cardinality estimation to detect the smaller of the two inputs, called the build input, and will use it to build a hash table in memory.After the build input is hashed, the second table, called the probe input, will be read and compared to the hash table. If rows are matched they will be returned.