SQL and more with KRUTI

Thursday, 7 April 2016

All about Searching - Full-Text Indexes

SQL server has the capability to store text-based data in the form of char, varchar, nchar, nvarchar, text, ntext, image, xml, or varbinary(max). When querying the data stored in such columns the “LIKE” clause would be highly used for pattern matching.

Consider the below table

CREATE TABLE [dbo].[Books](

[id] [int] IDENTITY(1,1) NOT NULL,

[Author] [varchar](20) NULL,

[Title] [varchar](100) NULL,

[Published_yr] [date] NOT NULL,

[Synopsis] [nvarchar](max) NULL,

CONSTRAINT [pk_id] PRIMARY KEY CLUSTERED

(

[id] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

SET ANSI_PADDING OFF

SQL script for the table
dbo.Books.sql
It would be easier if we were filtering the rows based on authors something like this

SELECT * FROM BOOKS WHERE Author LIKE 'Nicholas%'

However consider the scenario where I wish to check if Jeremy is one of the characters in the books based on the synopsis given. My query would look something like this

SELECT * FROM BOOKS WHERE Synopsis LIKE '%Jeremy%'

It will work fine doing a table scans for the pattern match. However consider if this was a library management system consisting of millions of records. In such a scenario performing a table scan over a table is not a wise option. For such situations SQL server provides a functionality called full-text search. SQL Server's full-text search engine gives you fast queries and advanced pattern matching in an enterprise environment. It works against text data in full-text indexes by operating on words and phrases based on rules of a particular language such as English.

To understand its architecture and how it works we need to familiar ourselves with its terminologies.
https://technet.microsoft.com/en-us/library/ms142581(v=sql.105).aspx

Full-Text Terminologies

1) Full-Text Indexes

A special token-based index built and maintained by the Full Text Engine for SQL server. It is used to track occurrences of words or word-forms in columns containing unstructured text. Full-Text index structures are different from normal B-tree index structures. It’s an inverted, stacked, compressed index structure based on individual tokens from the text being indexed. Only one Full-Text index can be created per table. It does require a unique key to be set for each row for the given table. It helps to keep this key as small as possible. This could be a primary key. This structure is elaborated in the following link
https://technet.microsoft.com/en-us/library/ms142505(v=sql.105).aspx

2) Full-Text Catalogs

It’s a logical container for Full-Text indexes. A database can have one or more full-text catalogs. The indexes are aggregated as a collection in terms of catalog purely for administration and maintenance purpose. It is used to set common property values for all the indexes it stores.

3) Word Stemmers

For any language like English we have certain words which are considered as base or root words. From these new-words or its conjugates can be formed. Consider the word “Eat” from this we can derive words like “eating, will eat, shall eat, ate, etc” these are its derived forms. Such words are called stemmers. Identifying the word stemmers comes in handy for pattern searching while using full-text indexes.

4) Word Breakers

Any characters which delimit sentences or phrases for the language are called word breakers (Spaces Excluded). They are particular for each supported language in SQL server. Once identified further action can be taken in building the FTS index or processing the query.

5) StopWords and StopList

Every language has certain words like conjunctions, pro-nouns which provide no meaningful benefit when used in search criteria’s. For example words like “the, are, and, to,etc” are commonly discarded for full-text indexes. These words are called stopwords. By excluding these words the Full-Text indexes become more efficient. The indexes however do take into account the position of such words when encountered. As the name suggests StopList is a list of Stop words. SQL server lets you create & customize your own StopLists.

https://technet.microsoft.com/en-us/library/ms142551(v=sql.105).aspx

6) Population / Crawl Process

Unlike standard SQL Server indexes that are automatically maintained during data modifications full-text indexes aren’t repopulated. Creating and maintaining a full-text index involves populating the index by using a process called a population (also known as a crawl). There are 3 modes of population

A) Full Population - As the name suggests builds full-text indexes for all the rows of the table. Similar to that of a standard index rebuild operation. The default behavior of SQL server is to populate a new full-text index fully as soon as it is created.

B) Change Tracking-Based Population /Update Population This uses change tracking functionality of SQL server. SQL server maintains a table which tracks modifications made to the index after its initial full population. Based on CHANGE_TRACKING property of the index whether set to AUTO or MANUAL the indexes are populated.Default behavior is Automatic Change_Tracking.

C) Incremental Timestamp-Based Population - This requires that the indexed table has a column of timestamp data type. If such a column does not exists SQL server will treat it as a full population operation. If the column exists SQL Server will update the modified rows based on the value of the time-stamped column.

https://technet.microsoft.com/en-us/library/ms142575(v=sql.105).aspx

7) Thesaurus

Thesaurus comprises of synonyms or word expansions for any given word. Each language has its own set of defined synonyms. This is an XML file stored in file system. This broadens our search criteria to find similar words at querying time.

Thesaurus matching occurs only for CONTAINS and CONTAINSTABLE queries that specify the FORMSOF THESAURUS clause and for FREETEXT and FREETEXTABLE queries. The default location for the thesaurus files is as follows

<SQL_Server_data_files_path>\<VersionSpecificInstance>\MSSQL\FTDATA\

C:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\FTData

Beginning in SQL Server 2008, full-text search supports more than 50 diverse languages, such as English, Spanish, Chinese, Japanese, Arabic, Bengali, and Hindi. Once can check the list of supported full-text languages using the following query

SELECT * FROM sys.fulltext_languages

There are a list of system tables corresponding to fulltext which you can view for each of the above described components

Full-Text Index Architecture

1) Client application submits a query containing Full-Text search contents

2) Query is passed to the SQL server engine comprising of SQL Server Query Processor and Full-Text Engine along with various other components

3) The Full-Text searches are sent to Full-Text Engine during both the stage Compilation and Execution. SQL Server Query Processor is responsible for parsing, binding, optimization and execution.The Full-Text Engine receives the Full-Text part of the queries from the query processor. It works in conjunction with the SQL server query processor.

4) In order to full the requests a Full-Text engine calls for the Indexer. The indexer is responsible for evaluating the StopList and populating Full-Text Indexes which form a part of the database from SQL server 2008 onwards. Prior to this they were stored separately on file systems. StopList will also contain stopwords which are ignored during query evaluation.

5) Thesaurus is used during Full-Text query compilation and execution. Its external to the SQL server database. This file is also used by the Filer Daemon Launcher Service which is again external to the SQLservr.Exe

6) Filter Daemon Manager is responsible for monitoring the Filer Daemon Launcher Service or Host Service.Its external to the SQL Server Engine.Inside SQL server engine the user tables are evaluated during full-text index population or crawl process by the indexer. This data is also used by the Filter Daemon Manager process which passes this information to the Host Service.

7) The Filter Daemon Host is responsible for accessing, filtering, and word breaking data from tables, as well as for word breaking and stemming the query input: It comprises of

a) The protocol handler which is responsible for pulling SQL server data from memory and passing the database data to the filter.

b) The filter extracts chunks of data from the input provided by protocol handler. This is dependent on the document type. The embedded formatting is removed by the filter retaining only the textual contents and their position information. This result-set is then passed to the work-breakers.
c)The word-breakers delimit the result-set. These word-breakers are also passed to the indexer by the Filter Daemon Launcher Service. Just like the stoplist the word-breaker are ignored

8) During indexing process the indexer uses all of the above result-set namely

Tokenized daata from full-text crawl threads
Delimited result-set from the Filter Daemon Host Service
Stop-list words that are to be excluded

It then creates index-able words to inverted index fragments. All these fragments are then merged to form the full-text indexes for the tables

9) During the querying process the Full-Text Engine uses the full-text indexes generated by the indexer. The word-breakers and thesaurus are first referenced to generate all forms of query predicate. The query processor then looks-up for all these forms in the full-text indexes. The Full-Text Engine is also responsible for the optimization process much like a standard query optimization process to retrieve the data.

10) The SQL query execution results and Full-Text query execution results are combined and the final result-set is sent across the client.

In the next blog we shall see how to create Full-Text Catalogs,Indexes and how to use full-text index queries.

Tuesday, 15 March 2016

Memory Fundamentals for SQL Server - References

Book References

https://books.google.co.in/books/about/Professional_SQL_Server_2012_Internals_a.html?id=-TjHHnP8ivEC

https://books.google.co.in/books/about/Microsoft_SQL_Server_2008_Internals.html?id=NJUpAQAAMAAJ&redir_esc=y

https://books.google.co.in/books/about/Microsoft_SQL_Server_2012_Internals.html?id=wK1CAwAAQBAJ&redir_esc=y

Topic Wise Blogs

Keep Learning, Keep Exploring, Keep Experimenting

P.S. ( Not on Production directly )

Monday, 14 March 2016

Memory Fundamentals for SQL Server - Buffer Pool, Mem To Leave

In the Previous blogs we saw about SQL server logical tree and the various memory caches.

http://sqlandmorewithkruti.blogspot.in/2016/02/memory-fundamentals-for-sql-server-sql.html
In this blog we shall see about Buffer Pool and Mem To Leave a little more in detail.

The Buffer Pool

Buffer Pool comprises of a major component for SQL server memory. It comprises of various chunks of memory spread across User Mode of Virtual Address Space. Apart from the sections mentioned above any temporary memory needed by SQL server to perform hash or sort operations for a given query come from buffer pool. If the existing memory is insufficient SQL server may also use tempdb for the same.

Thread Space Calculation

We know that SQL server executes tasks in terms of threads.
Microsoft suggests the below tables for configuring maximum number of worker threads
https://msdn.microsoft.com/en-us/library/ms190219.aspx

Number of CPUs	32-bit computer	64-bit computer
<= 4 processors	256	512
8 processors	288	576
16 processors	352	704
32 processors	480	960
64 processors	736	1472
128 processors	4224	4480
256 processors	8320	8576

StackSize is the space required to create each thread stack. For each thread On x86 (32-bit) its 0.5MB, on x64 (64-bit) its 2MB, and on IA64 its 4MB.

Determining the Maximum Potential for Non-Buffer Pool Usage based on No. of Threads

This option is the most popular, as the aim is to calculate the worst-case scenario for memory requirements other than SQL Server’s buffer pool. You should allow the following:

➤ 2GB for Windows
➤ xGB for SQL Server worker threads. You can figure out how many threads your instance will configure using Table shown earlier. Each thread will use 0.5MB on x86, 2MB on x64,
and 4MB on Itanium.
➤ 1GB for multi-page allocations, linked servers, and other consumers of memory outside the buffer pool. See the “Reserved Memory, or MemToLeave” section for more details and other consumers.
➤ 1–3GB for other applications that might be running on the system, such as backup programs.
For example, on a server with eight CPU cores and 16GB of RAM running SQL Server 2008 x64 and a third-party backup utility, you would allow the following:

➤ 2GB for Windows
➤ 1GB for worker threads (576 × 2MB rounded down)
➤ 1GB for MPAs, etc.
➤ 1GB for the backup program

The total is 5GB, and you would configure Max Server Memory to 11GB. This is just a rough estimate how memory for the buffer pool can be calculated.
You can view the existing threads within SQL server with the below query.

SELECT * FROM SYS.dm_os_threads

In SQL server terms a buffer is equivalent to a page in memory of size 8KB. A buffer is a page in memory that’s the same size as a data or index page. The main components comprising the data cache and plan cache. You can view the pages in SQL server buffer as follows

You can modify the query further to view the pages allocated per database.

SELECT CASE
database_id

WHEN 32767 THEN 'ResourceDb'

ELSE db_name(database_id)

END AS 'Database',page_type,

count(*) as "Page Count",

count(*)*8 AS 'Cached Size (KB)'

FROM sys.dm_os_buffer_descriptors

GROUP BY db_name(database_id) ,database_id, page_type

ORDER BY 'Cached Size (KB)' desc

Hashing

To access these data pages faster SQL server implements a technique called hashing. A hash table is structure in memory pointing to various buffer pages . This is implemented as a linked list with hash tables spanning across multiple hash pages. Given a dbid-fileno-pageno identifier (a combination of the database ID, fi le number, and page number), the hash function converts that key to the hash bucket, any data retrieved by SQL server is from pages within memory. If the page is not available in memory SQL server will fetch the same from the data file to any of the available buffers.When a read is requested, it gets a list of 64 pages in cache and the server checks whether the free buffer list is below a certain threshold.

BUF array

Bpool needs to maintain information regarding the pages allocated and their types in order to maintain the number of free buffers. It does so using 2 array structures. One stores the pointer to the start of each region in buffer pool. The other stores the count of 8KB pages that are reserved for this region. Each page in the BPool will have a corresponding BUF structure. Each page's BUF structure functions as a type of header for it. Each time a page is referenced, this reference count is incremented. SQL also maintains a free buffer list which contains the count of free pages.

LRU-k Algorithm

The amount of time that pages stay in cache is determined by a least recently used (LRU) policy. The header of each page (BUF structure) in cache stores details about the last two times it was accessed, and a periodic scan through the cache examines these values. A counter is maintained that is decremented if the page hasn’t been accessed for a while; and when SQL Server needs to free up some cache, the pages with the lowest counter are flushed first.

Commit Bitmap

Pages are first updated in memory only then written onto disk. These include data and log pages. SQL server maintains what is called a Commit Bitmap. As each page in the BPool is committed, its corresponding bit in the commit bitmap is set. The commit bitmap is also set when the reference count decreases as per the LRU-k algorithm. Once a page is dirty and successfully written to disk it is un-hashed and added to the free list. If a page's reference count reaches zero and the page is not dirty, it is simply freed—i.e., moved to the free list without writing anything to disk.

Lazy-writer

Lazywriter is a thread that runs within SQL server. One for each NUMA node. Its task is to scan the BufferPool to ensure enough number of free pages are available in the free buffer list. It also monitors the server memory. If the OS free physical memory drops below 5MB the lazywriter releases memory to the operating system instead of adding it to the free list. If more than 5 MB of physical memory is free, the lazywriter recommits memory to the buffer pool by adding it to the free list.

The lazywriter recommits memory to the buffer pool only when it repopulates the free list; a server at rest does not grow its buffer pool. Lazywriter also ensure that the minimum size of buffer pool is maintained before releasing memory back to operating system. SQL Server uses separate physical structures for a page and its header allows a page to be flushed to disk and to "move" from list to list without anything on the page actually changing. Once a page is written onto the disk and marked as free its contents can be over-written.

CheckPoint

Issuing a checkpoint performs similar functionality like lazywriter of scanning the buffer-pool for flushing out any dirty pages. Difference being it cannot move pages to the free list. Checkpoint can be fired manually to run issuing the CHECKPOINT command. Lazywriter thread wakes up on its own during memory pressure. Also all dirty data pages are written to disk by the checkpoint, not just pages corresponding to committed transactions. The job of the checkpoint is to keep the potential time needed to recover the database to a small value.

Ideally CheckPoint is issued every minute. However this value is set based on the Recovery Interval setting for a database instance.Trace flag 3502 is an undocumented trace flag that records in the error log when a checkpoint starts and stops.

https://msdn.microsoft.com/en-IN/library/ms188748.aspx

You can view the all of the Page Life Expentancy, Checkpoint, Lazywriter setting for a given database instance using perfmon as follows

MeM To Leave

As seen in the above diagram MemToLeave is used for Dll's,COM objects, extended stored procs, linked servers, distributed queries,SQL server code.The size of the MemToLeave region can be adjusted using the -g command line parameter. This may be required when more memory is needed for linked server queries or any other objects residing in MemToLeave region. On the other hand reducing this is not advisable but provides more virtual memory space for Buffer Pool region.

One may view the loaded modules through the below command.

SELECT * FROM SYS.dm_os_loaded_modules

Since MultiPage Allocations are a part of MemToLeave region,
Allocations needing contiguous memory blocks higher than 8KB come from this region. It could be procedure cache or data pages as well. However not necessary allocations needing lower than 8KB will always end up in buffer pool. It could also be found in MemToLeave region.

As stated earlier SQL server memory is a vast topic to be covered within a few blogs. But as promised I shall provide the references that may be used to study each of these topics in detail in the next blog.

Monday, 22 February 2016

Memory Fundamentals for SQL Server - SQL Server Memory Components, SQL Server Memory Logical Tree

In the Previous blogs we have seen about virtual memory related concepts.
http://sqlandmorewithkruti.blogspot.in/2016/01/memory-fundamentals-for-sql-server.html

We came across concepts like Address Window Extension, Physical Address Extension.

http://sqlandmorewithkruti.blogspot.in/2016/02/memory-fundamentals-for-sql-server-awe.html

We also saw the various CPU related architectural concepts and briefly touch based SQLOS.

http://sqlandmorewithkruti.blogspot.in/2016/02/memory-fundamentals-for-sql-server.html

In this blog we shall see how SQL server memory is divided.

When SQL server starts memory is first reserved for the MEM To Leave area also known as Virtual Address Space Reservation area. This is not user configurable. Windows Memory Management allows user process to reserve a contiguous block of address space without actually consuming committed pages. This is called VAS reservation or MEM to Leave. Components of the Mem To Leave area are mentioned above. All allocations for a contiguous memory block larger than 8KB come from the MemToLeave region. it's possible that allocations smaller than 8KB could end up coming from the MemToLeave region.

The default size of Mem To Leave region is 256MB but can be changed using the -g parameter during SQL server startup.

Next comes the BUFFER Pool

Calculation of BufferPool Region

We have the maximum server memory option for any SQL server instance. This option decides the amount of memory that will be allocated for the Buffer Pool region. Bpool consists of 32 separate memory regions organized into 8KB pages.

In 32-Bit Systems

Once the MemToLeave region is occupied the remaining Virtual Address Space is checked. The remainder of the VAS is calculated and reserved for SQL server. If this value is less than Max Server Configuration then that becomes the size of the Buffer Pool.

The calculation of the Buffer Pool size however differs if the AWE is enabled and /PAE switch is used. Here the physcial memory on the server or the max server memory configuration setting, whichever is smaller becomes the size of the Buffer Pool.

In 64-Bit Systems

There is no need to use AWE to allocate memory above 3GB for SQL Server since the user mode VAS is 8TB there is always ample VAS to utilize all of the physical memory available on the server.

For 64 bit servers, the maximum size of the BPool is the size of physical memory or the max server memory configuration, whichever is smaller. For example, a 64 bit server with 16GB RAM, and a default max server memory configuration, the maximum size of the Bpool would be 16GB.

SQL server memory logical level tree

The SQL Server memory manager has a three-level structure. At the bottom are memory nodes which are the lowest level allocators for SQL Server memory, the second level consists of memory clerks which are used to access the memory nodes and cache stores which are used for caching. The top layer contains memory objects which provide a smaller level of granularity than the memory clerks allow directly.

Memory Nodes
As described in the previous blog it’s a logical unit in which SQL server access memory.

Memory Clerks

Memory clerks access memory nodes to allocate memory to different objects. Information on different types of memory clerks is listed using the below query. Till SQL server version 2008 you will be able to see the segregation of the single pages allocated and multiple pages allocated for particular objects.

select distinct type from sys.dm_os_memory_clerks where
type like 'mem%' order by type.

Memory Caches

SQL Server uses three types of caching mechanism: object store, cache store, and user store.

Object Store :

Object stores are used to cache homogeneous types of stateless data. Each entry within this store can be used by only one client at a time. SQLOS does not control the lifetime of the entries in the object store type of cache. Hence it does not implement Least Recently Used (LRU) Algorithm for costing the entries and maintaining their lifetime. One example for use of an object store is the SNI, which uses the object store for pooling network buffers. No state, no cost , all buffers of the same size.

User Store:

User stores are used for objects that has its own storage user's mechanism. How the objects are stored depends on the underlying development framework. SQLOS controls the lifetime of entries and maintains a cost based on LRU algorithm. One example of User Store objects is metadata cache.

Cache Store:

For cache store the lifetime is fully controlled by SQLOS's caching framework. The storage mechanism is SQL OS managed. SQLOS does this with what is called the hashing technique wherein the entry for each object is stored in one or multiple tables. This supports for faster look-ups. SQLOS controls the lifetime of entries and maintains a cost based on LRU algorithm. Object Plans (CACHESTORE_OBJCP), SQL Plans (CACHESTORE_SQLCP), Bound Trees (CACHESTORE_PHDR), Extended Stored Procedures (CACHESTORE_XPROC) all fall within this cache store.

I recommend you go through the below blogs to study further.

http://blogs.msdn.com/b/slavao/archive/2005/03/18/398651.aspx
http://blogs.msdn.com/b/slavao/archive/2005/08/30/458036.aspx
http://blogs.msdn.com/b/slavao/archive/2005/02/19/376714.aspx
https://msdn.microsoft.com/en-us/library/cc293624.aspx

Thus SQL server memory gets divided into various stores. On the whole MemToLeave and BufferPool is what comprises the SQL server memory. In the next blog we shall see how SQL server responds to memory pressure. How the memory caches are accounted for based on LRU & hashing algorithms.

SQL and more with KRUTI

PageViews Last Month

Thursday, 7 April 2016

All about Searching - Full-Text Indexes

Full-Text Terminologies

Full-Text Index Architecture

Tuesday, 15 March 2016

Memory Fundamentals for SQL Server - References

Book References

Topic Wise Blogs

P.S. ( Not on Production directly )

Monday, 14 March 2016

Memory Fundamentals for SQL Server - Buffer Pool, Mem To Leave

The Buffer Pool

MeM To Leave

Monday, 22 February 2016

Memory Fundamentals for SQL Server - SQL Server Memory Components, SQL Server Memory Logical Tree

SQL server memory logical level tree

Popular Posts

Contact Form

PageViews Last Month

Thursday, 7 April 2016

All about Searching - Full-Text Indexes

Full-Text Terminologies

Full-Text Index Architecture

Tuesday, 15 March 2016

Memory Fundamentals for SQL Server - References

Book References

Topic Wise Blogs

P.S. (** Not on Production directly **)

Monday, 14 March 2016

Memory Fundamentals for SQL Server - Buffer Pool, Mem To Leave

The Buffer Pool

MeM To Leave

Monday, 22 February 2016

Memory Fundamentals for SQL Server - SQL Server Memory Components, SQL Server Memory Logical Tree

SQL server memory logical level tree

Popular Posts

Contact Form

P.S. ( Not on Production directly )