Skip to content

Software Development News: .NET, Java, PHP, Ruby, Agile, Databases, SOA, JavaScript, Open Source

Methods & Tools

Subscribe to Methods & Tools
if you are not afraid to read more than one page to be a smarter software developer, software tester or project manager!

Database

Using Windows Azure to Speed up Genome Research and Save Millions of Dollars

A year ago at Virginia Tech researchers needed 2 weeks to analyze just 1 genome. Today, they can analyze 100 genomes each day. Why is this important? Scientists can learn more about our DNA and uncover more effective strategies for detecting, diagnosing, and treating diseases such as cancer. What’s helping to make this possible? An innovative solution developed by Virginia Polytechnic Institute and State University (Virginia Tech) that’s based on Windows Azure and the Windows Azure HDInsight Service.

There are currently an estimated 2,000 DNA sequencers generating around 15 petabytes of data every year.  Additionally, data volumes are doubling every 8 months, significantly ahead of Moore’s law of compute capability’s which is doubling only every 24 months.  Most institutions can’t afford to scale data centers fast enough to store and analyze all of the new information. To overcome this challenge, Virginia Tech developed a high-performance computing (HPC) solution with Windows Azure. It gives global researchers a highly scalable, on-demand IT infrastructure in the cloud that they can use to store and analyze Big Data, accelerate genome research, and increase collaboration.    

To make it easy for researchers to use the solution, Virginia Tech developed two cloud applications. One streamlines the creation of Genome Analysis Toolkit (GATK) pipelines (for DNA sequencing) using Windows Azure HDInsight. The other program simplifies the use of Hadoop MapReduce pipelines to automate data transfers and analyze information that resides on local and cloud-based systems in a hybrid scenario.

The new solution is saving Virginia Tech—and other organizations—millions of dollars because scientists pay only for the resources that they use. This includes Windows Azure Blob storage for temporary or long-term data storage and HDInsight clusters for on-demand HPC nodes. Provisioning a new resource takes just seconds.

Global scientists can also collaborate with less effort because they can now easily share insights and data sets virtually anytime, anywhere—and with any device. As a result, in the future scientists or doctors may be able to use the solution to develop custom treatments for individual patients faster, by engaging in genome analysis directly at hospitals.  

You can learn more about Virginia Tech’s solution by watching the video below or reading the detailed case study here.   

Categories: Database

Assigning Roles to MySQL Users

Database Journal News - Thu, 02/13/2014 - 09:01

Ever since Oracle became the owner of MySQL when it acquired Sun Microsystems in 2010, improvements to the software haven’t been as forthcoming as one might hope. There still doesn’t seem to be anything like groups in MySQL. According to Oracle, we can expect it to arrive for MySQL 7.0! Until then, this article presents a few software offerings that may help tide you over.

Categories: Database

RUNSTATS: What's new in DB2 10 for Linux, UNIX, and Windows

IBM - DB2 and Informix Articles - Thu, 02/13/2014 - 06:00
DB2 10.1 provides significant usability, performance, serviceability, and database administration enhancements for database statistics. In this article, learn about the significant performance enhancements to the RUNSTATS facility. Examples show how to take advantage of new features such as new keywords, index sampling options, enhancements to automatic statistics collection, and functions to query asynchronous automatic runstats work.
Categories: Database

Making Big Data Work for Everyone

This has been a really exciting week for the Microsoft data platform. On Monday, we announced the general availability of Power BI for Office 365. This morning, we announced that support for Hadoop 2.2 clusters is now in preview for Windows Azure HDInsight—our cloud-based Hadoop service. And, I just presented at Strata, one of our industry’s premier big data events, where I talked about making data work for everyone.

A few weeks ago, I shared some thoughts about the changing nature of data in the modern business enterprise, leading to the conclusion that there is a “new value” of data. At Microsoft, we have a bold vision of bringing big data to a billion people. That’s how we think of big data: a billion people are more informed and have a better perspective of the world around them. 

When I say, “bringing big data to a billion people,” I’m talking about changing the way people live and work through data. And, I am talking about everyone, not just data scientists and experts. Big data is valuable for people in all walks of life. It’s certainly valuable for the enterprise; people in businesses and organizations of all types clearly need the power of big data to drive faster decisions and insights. But it’s also valuable for consumers, citizens and people of all generations, whether they are seeking answers and insights for some other aspect of their lives or just satisfying their own curiosity.

Let me share a personal example of “satisfying curiosity.” A week ago, my boys and I were talking about the World Cup and how it’s being hosted in Brazil and how the Winter Olympics are being hosted in Russia. Somehow, this dialog led to us wondering about the populations of countries by continent. I was able to produce the view below in just a couple of minutes using some of our business intelligence tools, Power BI for Office 365 and Excel. I opened Excel, did a Power Query search on country population, found the data I wanted and then added that data to a worksheet. I then searched on countries by continent, found the data and added that to another worksheet. I inserted both worksheets into Power View, linked them together, dragged the right data columns into the Power View canvas, made the continents a slicer (a type of filter), and voila!, I had the information my boys were curious about. 

This little story illuminates the art of what’s possible when disparate data sources are quickly and easily brought together for analysis and visualization. Our family had a better, more informed dialog over lunch because of the power of structured information that is accessible in a friction-free way. Now just imagine what the professionals in finance and engineering I work with every day are doing – Power BI has changed how I run my business and organization day-to-day.

At Microsoft, we are striving to put the power of data insights into the hands of the people who are closest to real-world challenges. We want subject matter experts to analyze and visualize data and explore solutions as easily as they can create a spreadsheet. Think of all the different kinds of people that could benefit from this: How would farmers change their planting and animal-husbandry practices? How might law enforcement practices change? How will city planners create better experiences for all citizens of their communities?

Our view is that it takes the combined effect of three elements to bring big data to a billion people: robust tools that everyday people can use, easy access to all kinds of data sets, and a complete data platform.

Today, with the latest additions and updates to our data platform, we are another step closer to our vision. As I mentioned at the outset, on Monday, we announced the general availability of Power BI for Office 365, which brings together Microsoft’s strengths in cloud computing, productivity software, and business intelligence to enable people to easily find, analyze and visualize data in Excel, discover valuable insights, and share and collaborate on those insights from anywhere with Office 365. More than that, it does so in a turnkey way, empowering users while giving IT the ability to help manage and maximize the work. In addition, we are making available a preview of Windows Azure HDInsight with support for Hadoop 2.2. Windows Azure HDInsight combines the best of open source and flexibility of cloud deployment with the security and manageability required to accelerate the adoption of Hadoop in the enterprise. With this new version, we’ve updated HDInsight to take advantage of some key capabilities in the latest version of Hadoop.

It’s a really exciting time in the data industry. We are at the tipping point of bringing bring big data to a billion people. It’s happening now: doctors, farmers, lawyers, teachers and marketing teams are incorporating big data into their daily decision-making and fundamentally reinventing the way they do business.   

Quentin Clark
Corporate Vice President
Data Platform Group

Categories: Database

Simplifying Business Intelligence through Power BI for Office 365

A couple of weeks ago, Quentin Clark published “What Drives Microsoft’s Data Platform Vision?” In that post, Quentin outlined our work to simplify business intelligence. Here, I’d like to connect that introduction to our focus on Power BI for Office 365.

In my role at Microsoft, I demo software A LOT. In fact, showing off our solutions is one of my favorite parts of the job. My team and I love to show the capabilities of the product we are building—the ease of use, the fast and fluid interface, the power of collaboration in data analysis, etc. We deeply believe in our mission to simplify everything associated with Big Data and Analytics, and we think we have solutions that do exactly that! It is all about simplicity: to make BI/Analytics transparent and to enable everyday users to discover insights from data without having to learn new or complex tools.

From “OH NO!” to “Oh Yeah!!”

A few weeks ago, just as I was about to start an important meeting with a customer, I realized that I did not have my “demo laptop” with me. I started to panic…clammy palms and a nagging unease with “What am I going to do?” running through my head. The demo laptop is a big deal, you see. It has our entire “stack” installed, everything from Microsoft Office to SharePoint to SQL Server to a local instance of Hadoop, etc. Usually, no demo laptop means no demo.

This was a significant meeting with an important audience that had been difficult to schedule. Luckily, the meeting room had a PC with a browser (Chrome, actually) and a connection to the internet, and it turns out that this was all I needed for a great demo. All the components which had been set up previously on my laptop are now available online, as a simple yet powerful cloud service.

On the PC, I pointed the browser to my Power BI site, which I signed up for at powerbi.com, and as simple as that, I had access to everything required to demo how to identify and share some really interesting insights from a dataset. I searched for interesting data using Power Query. I quickly created a mash up and some visualizations in Excel—my favorite (and broadly familiar) data tool—and then published it to my Power BI Site on SharePoint Online in Office 365. The act of publishing the Excel workbook saved my work on the SharePoint site, which is provisioned and managed for me, and automatically moved the associated analytical model into Windows Azure.

A Little Background

To understand how exciting this experience was for me, it might help to have a little background. In the past, I—or someone on my team—would spend a few hours setting up a demo machine in preparation for a customer meeting or presentation at a conference. Essentially, we needed to maintain a “mini IT” capability on the team. Now, however, I am free of that dependency. I can dive into a feature-rich product demo with nothing more than an internet connection and a browser. Just like our customers, I can focus on the data, compelling visualizations, and business insights. I can show off the capabilities of our software and data tools without worrying about the infrastructure.

Another cool aspect of this experience is that when I published the Excel workbook to Power BI on Office 365, the solution instantly became a cloud-scale, multi-user, refreshable BI solution—no longer an island of data, but a manageable, reliable organizational asset. The solution renders in all modern browsers; it is device and platform independent, not even requiring a specific version of Excel on the consuming user’s desktop. Anyone with access to the Excel workbook through SharePoint Online, on any type of device, can work with the data and its underlying analytical model.

The Analytical Model

The analytical model is a core part of the Excel workbook and is incredibly powerful. The model describes the natural business language, relationships, and calculations that structure the overall solution. The model enables interactive and visual exploration and reporting of the data by business users—through Power View—without requiring a technical understanding of data modeling concepts like schema, primary and secondary keys, hierarchies, etc. Creating this model is as simple as—and, in fact, is initiated by—creating a chart or a PivotTable in Excel.

The embedded analytical model enables the Q&A capabilities of Power BI, which provide the ability to answer questions submitted using natural language. The business terminology captured in the model becomes the basis for the natural language interpretation by Q&A. Of course, the basic analytical model can—and usually should be—further annotated (through Excel) to strengthen the natural language interpretation capabilities. All it takes for business users to get insights on their data is to type in questions, similar to the way they ask a colleague a question. Power BI translates the natural language question into a technical query and returns results automatically formatted and visualized in the most relevant fashion. The best way to understand this is to see it in action; check out this video (made by Patrick Baumgartner on my team):

What Does It All Mean?

Power BI for Office 365 is a powerful solution that enables every business professional, regardless of technical skills or experience with data analysis, to explore data and uncover important, meaningful insights and to share these with others quickly and easily. This empowers business users who understand and run the business to make data-driven decisions without the usual complexities and dependencies associated with IT and BI systems. No special tools, no complex software installations, no hardware infrastructure to configure and manage. Just the simple, everyday tools of the business world: Excel, a browser, and natural business language. This is what makes Power BI so powerful; this is what cloud computing enables. So, give it a shot: go to PowerBI.com to learn more and sign up. Free trials are available now!

Kamal Hathi
Director of PM
Data Platform Group

Categories: Database

MySQL February Newsletter - Prepare Early for MySQL Connect 2014 Call for Papers

MySQL AB - Tue, 02/11/2014 - 01:59
Welcome to the Oracle Information InDepth Newsletter: MySQL Edition for February 2014. The call for papers for MySQL Connect 2014 is scheduled to open on March 4 for five weeks. Read more details in this edition and start preparing the topics you'd like to share with attendees.
Categories: Database, Vendor

Oracle 11g and Index Access

Database Journal News - Mon, 02/10/2014 - 17:32

New index access paths in Oracle 11g and later releases can use existing multi-column indexes even when the column you're looking for isn't the leading column. Read on to see how Oracle accomplishes this feat.

Categories: Database

pgBadger 5: Analyze your logs daily with the incremental mode

PostgreSQL News - Mon, 02/10/2014 - 01:00

Paris, France - Feb. 10, 2014

DALIBO is proud to announce the release of pgBadger 5, a PostgreSQL performance analyzer, built for speed with fully detailed reports based on your Postgres log files.

This major version comes with a bunch of new metrics such as the SQL queries times histogram, some fixes in the new HTML5 design and the ability to build cumulative reports.

New incremental mode

The incremental mode is an old request issued at PgCon Ottawa 2012 that concern the ability to construct incremental reports with successive runs of pgBadger. It is now possible to run pgbadger once a day (or even every hours) and have cumulative reports per day and per week. A top index page allow you to go directly to the weekly and daily reports.

Here's screenshot of the new index page http://dalibo.github.io/pgbadger/screenshots/pgbadgerv5_index.png

This mode have been build with simplicity in mind. You just need to running pgbadger with cron as follow:

0 23 * * * pgbadger -q -I -O /var/www/pgbadger/ /var/log/postgresql.log

This is enough to have daily and weekly reports viewable using your browser.

Take a look at our demo here: http://dalibo.github.io/pgbadger/demov5/

There's also a useful improvement to allow pgBadger to seek directly to the last position in the same log file after a successive execution. This feature is only available using the incremental mode or the -l option and parsing a single log file. Let's say you have a weekly rotated log file and want to run pgBadger each days. With 2GB of log per day, pgbadger was spending 5 minutes per block of 2 GB to reach the last position in the log, so at the end of the week this feature will save you 35 minutes. Now pgBadger will start parsing new log entries immediately. This feature is compatible with the multiprocess mode using -j option (n processes for one log file).

New Histograms

This new major release adds some new metrics like an hourly graphic representation of the average count and duration of top normalized queries. Same for errors or events, you will be able to see graphically at which hours they are occurring the most often.

For example: http://dalibo.github.io/pgbadger/screenshots/pgbadgerv5_histogram.png

There's also a new "Histogram of query times", which is a new graph in the top queries slide that shows the query times distribution during the analyzed period.

For example: http://dalibo.github.io/pgbadger/screenshots/pgbadgerv5_histogram_2.png

There is also some graphic and report improvements, such as the mouse tracker formatting that have been reviewed. It now shows a vertical crosshair and all dataset values at a time when mouse pointer moves over series. Automatic queries formatting has also been changed, it is now done on double click event as simple click was painful when you wanted to copy some part of the queries.

Autovacuum reports now associate database name to the autovacuum and autoanalyze entries. Statistics now refer to "dbname.schema.table", previous versions was only showing the pair "schema.table".

This release also adds "Session peak" information and a report about "Simultaneous sessions". Parameters log_connections and log_disconnections must be enabled in postgresql.conf for this.

Links & Credits

DALIBO would like to thank the developers who submitted patches and the users who reported bugs and feature requests, especially Martin Prochazka, Herve Werner, tmihail, Reeshna Ramakrishnan, Guillaume Smet, Alexander Korotkov and Casey Allen Shobe.

pgBadger is an open project. Any contribution to build a better tool is welcome. You just have to send your ideas, features requests or patches using the GitHub tools or directly on our mailing list.

About pgBadger:

pgBagder is a new generation log analyzer for PostgreSQL, created by Gilles Darold (also author of ora2pg, the powerful migration tool). pgBadger is a fast and easy tool to analyze your SQL traffic and create HTML5 reports with dynamics graphs. pgBadger is the perfect tool to understand the behavior of your PostgreSQL servers and identify which SQL queries need to be optimized.

Docs, Download & Demo at http://dalibo.github.io/pgbadger/

About DALIBO:

DALIBO is the leading PostgreSQL company in France, providing support, trainings and consulting to its customers since 2005. The company contributes to the PostgreSQL community in various ways, including : code, articles, translations, free conferences and workshops

Check out DALIBO's open source projects at http://dalibo.github.io

Categories: Database, Open Source

Differential Database Backup with Memory-Optimized Tables

This blog describes how differential database backup is taken on database with one or more memory-optimized tables. For full database backups, please refer to Database Backup with Memory-Optimized Tables  for the details and Creating and Managing Storage for Memory-Optimized Objects to get more in-depth understanding of data/delta files referenced later in the section below.

For disk-based tables, the differential database backup includes only the pages that have changed since the last full database backup.  SQL Server 2014 supports differential backup on databases with memory-optimized tables. The differential backup of a database with memory-optimized tables contains the following data

  • The differential backup component for the disk-based tables is exactly the same as it is in databases without memory-optimized tables
  • Active transaction log. This is exactly the same as with full database backup.
  • For a memory-optimized filegroup, the differential backup uses the same algorithm as a full database backup to identify data/delta files for backup but it then filters out the subset of files as follows:
    • Data File – A data file contains newly inserted rows and once it is full, it is closed. Once the data file is closed, it is only accessed in read-only mode. A data file is backed up only if it was closed after the last full database backup. In the other words, the differential backup only backs up data files containing the inserted rows since the last full database backup.
    • Delta File – A delta file stores references to the deleted data rows. A delta file is always backed up. Since any future transaction can delete a row, a delta file can be modified anytime in its life time, it is never closed. Note, the delta files typically take < 10% of the storage.

Note, this optimization is only available in RTM. In CTP2, the differential database backup included all data/delta files just like in full database backup.

Let us walk through an example showing the difference in size of full database backup and a differential database backup. The example is based on the following database and the table schema. Please refer to blog State-transition-of-checkpoint-files-in-databases-with-memory-optimized-tables  for details on the state of CFPs in the following examples.

CREATE DATABASE imoltp

GO

ALTER DATABASE imoltp ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA

ALTER DATABASE imoltp ADD FILE (name='imoltp_mod', filename='c:\data\imoltp_mod') TO FILEGROUP imoltp_mod

GO

use imoltp

go

-- create the table with each row around 8K

CREATE TABLE dbo.t_memopt (

       c1 int NOT NULL,

       c2 char(40) NOT NULL,

       c3 char(8000) NOT NULL,

       CONSTRAINT [pk_t_memopt_c1] PRIMARY KEY NONCLUSTERED HASH (c1)

WITH (BUCKET_COUNT = 100000)

) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)

go

-- load 8000 rows. This should load 5 16MB files

declare @i int = 0

while (@i < 8000)

begin

       insert t_memopt values (@i, 'a', replicate ('b', 8000))

       set @i += 1;

end

-- Do Manual checkpoint

Checkpoint

Example-1 Full and Differential database backup a database after loading 8000 rows and completing the manual checkpoint
  • Full Database Backup: Wait 5 minutes and then do the full database backup. The 5-minute wait is more of a safety factor to make sure relevant data files are not missed due to time-drift. This is not a concern in production environment there as typically there is significant time span between checkpoints, full database backup and subsequent differential database backup.

BACKUP DATABASE [imoltp] TO DISK = N'C:\data\imoltp-full-data.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

  • Output:

Processed 304 pages for database 'imoltp', file 'imoltp' on file 1.

10 percent processed.

20 percent processed.

Processed 8064 pages for database 'imoltp', file 'imoltp_mod' on file 1.

30 percent processed.

40 percent processed.

Processed 8612 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE successfully processed 16980 pages in 1.592 seconds (83.323 MB/sec).

  • Result: The size of the full database backup is around 138 MB. This includes size around 70MB of transaction log and around 70MB of CFPs.
  • Differential Database Backup: Since we have completed the full database backup, the differential backup will skip data files as no new using the following command

BACKUP DATABASE [imoltp] TO 

DISK = N'C:\data\imoltp-diff-data.bak' WITH DIFFERENTIAL, NOFORMAT, INIT, 

NAME = N'imoltp-diff Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Output:

Processed 48 pages for database 'imoltp', file 'imoltp' on file 1.

Processed 0 pages for database 'imoltp', file 'imoltp_mod' on file 1.

10 percent processed.

20 percent processed.

30 percent processed.

40 percent processed.

51 percent processed.

60 percent processed.

71 percent processed.

81 percent processed.

91 percent processed.

Processed 8613 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE WITH DIFFERENTIAL successfully processed 8661 pages in 0.760 seconds (89.031 MB/sec).

  • Result: The size of differential database backup is around 72 MB predominantly consisting of transaction log. The output shows that the existing data files in ‘ACTIVE’ state are not backed up as they were not changed since the last full database backup.
Example-2 Differential database backup after deleting 4000

 -- now delete 50% rows

declare @i int = 0

while (@i <= 8000)

begin

       delete t_memopt where c1 = @i

       set @i += 2;

end

go

checkpoint

go

  • Checkpoint File Pairs (CFPs) – No changes as only deleted existing rows

  • Differential Database Backup: Execute the following command

-- do the differential database backup

BACKUP DATABASE [imoltp] TO 

DISK = N'C:\data\imoltp-diff-data-del50-chkpt.bak' WITH  DIFFERENTIAL , NOFORMAT, INIT, 

NAME = N'imoltp-diff Database Backup', SKIP, NOREWIND, NOUNLOAD,  STATS = 10

GO

  • Output: the pages reported are computed at 8k size.

Processed 88 pages for database 'imoltp', file 'imoltp' on file 1.

Processed 47 pages for database 'imoltp', file 'imoltp_mod' on file 1.

10 percent processed.

20 percent processed.

30 percent processed.

40 percent processed.

50 percent processed.

60 percent processed.

Processed 8845 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE WITH DIFFERENTIAL successfully processed 8980 pages in 0.941 seconds (74.552 MB/sec).

  • Result: Note, that transaction log has few more pages to account for deleted rows and the checkpoint. Also, 47 pages from files in memory-optimized filegroup were included in the backup. Out of this 47, the 32 pages are from the empty data file that was created with manual checkpoint. The internal page size in data files is 256K or 32 8k pages. The first page in the data file contains header information therefore it is part of the backup. Other 15 pages are from delta files. The total size of the backup is around 74.4MB
Categories: Database

Getting Trained on Microsoft’s Expanding Data Platform

With data volumes exploding, having the right technology to find insights from your data is critical to long term success.  Leading organizations are adjusting their strategies to focus on data management and analytics, and we are seeing a consistent increase in organizations adopting the Microsoft data platform to address their growing needs around data.  The trend is clear: CIOs named business intelligence (BI) and analytics their top technology priority in 2012, and again in 2013. Gartner expects this focus to continue during 2014. 2

At Microsoft, we have great momentum in the data platform space and we are proud to be recognized by analysts like IDC reporting that Microsoft SQL Server continues to be the unit leader and became the #2 database vendor by revenue.1 Microsoft was named a leader in both the Enterprise Data Warehouse and Business Intelligence Waves by Forrester, 3,4 and is named a leader in the OPDMS Magic quadrant. 5

The market is growing and Microsoft has great momentum in this space, so this is a great time to dig in and learn more about the technology that makes up our data platform through these great new courses in the Microsoft Virtual Academy.

Microsoft’s data platform products

Quentin Clark recently outlined our data platform vision. This calendar year we will be delivering an unprecedented lineup of new and updated products and services:

  • SQL Server 2014 delivers mission critical analytics and performance by bringing to market new in-memory capabilities built into the core database for OLTP (by 10X and up to 30X) and Data Warehousing (100X). SQL Server 2014 provides the best platform for hybrid cloud scenarios, like cloud backup and cloud disaster recovery, and significantly simplifies the on-ramp process to cloud for our customers with new point-and-click experiences for deploying cloud scenarios in the tools that are already familiar to database administrators (DBAs).
  • Power BI for Office 365 is a new self-service BI solution delivered through Excel and Office 365 which provides users with data analysis and visualization capabilities to identify deeper business insights from their on-premises and cloud data.
  • Windows Azure SQL Database is a fully managed relational database service that offers massive scale-out with global reach, built-in high availability, options for predictable performance, and flexible manageability. Offered in different service tiers to meet basic and high-end needs, SQL Database enables you to rapidly build, extend, and scale relational cloud applications with familiar tools.
  • Windows Azure HDInsight makes Apache Hadoop available as a service in the cloud, and also makes the Map Reduce software framework available in a simpler, more scalable, and cost efficient Windows Azure environment.
  • Parallel Data Warehouse (PDW) is a massively parallel processing data warehousing appliance built for any volume of relational data (with up to 100x performance gains) and provides the simplest way to integrate with Hadoop. With PolyBase, PDW can also seamlessly query relational and non-relational data.
In-depth learning through live online technical events 

To support the availability of these products, we’re offering live online events that will enable in-depth learning of our data platform offerings. These sessions are available now through the Microsoft Virtual Academy (MVA) and are geared towards IT professionals, developers, database administrators and technical decision makers. In each of these events, you’ll hear the latest information from our engineering and product specialists to help you grow your skills and better understand what differentiates Microsoft’s data offerings.

Here is a brief overview of the sessions that you can register for right now:

Business Intelligence

Faster Insights with Power BI Jumpstart | Register for the live virtual event on February 11

Session Overview: Are you a power Excel user? If you're trying to make sense of ever-growing piles of data, and you're into data discovery, visualization, and collaboration, get ready for Power BI. Excel, always great for analyzing data, is now even more powerful with Power BI for Office 365. Join this Jump Start, and learn about the tools you need to provide faster data insights to your organization, including Power Query, Power Map, and natural language querying. This live, demo-rich session provides a full-day drilldown into Power BI features and capabilities, led by the team of Microsoft experts who own them.

Data Management for Modern Business Applications

SQL Server in Windows Azure VM Role Jumpstart | Register for the live virtual event on February 18

Session Overview: If you're wondering how to use Windows Azure as a hosting environment for your SQL Server virtual machines, join the experts as they walk you through it, with practical, real-world demos. SQL Server in Windows Azure VM is an easy and full-featured way to be up and running in 10 minutes with a database server in the cloud. You use it on demand and pay as you go, and you get the full functionality of your own data center. For short-term test environments, it is a popular choice. SQL Server in Azure VM also includes pre-built data warehouse images and business intelligence features. Don't miss this chance to learn more about it.

Here’s a snapshot of the great content available to you now, with more to come later on the on the MVA data platform page:

Data Management for Modern Business Applications

Modern Data Warehouse

For more courses and training, keep tabs on the MVA data platform page and the TechNet virtual labs as well.

Thanks for digging in.

Eron Kelly
General Manager
Data Platform Marketing

----------- 

1Market Analysis: Worldwide Relational Database Management Systems 2013–2017 Forecast and 2012 Vendor Shares, IDC report # 241292 by Carl W. Olofson, May 2013

2Business Intelligence and Analytics Will Remain CIO's Top Technology Priority G00258063 by W. Roy Schulte | Neil Chandler | Gareth Herschel | Douglas Laney | Rita L. Sallam | Joao Tapadinhas | Dan Sommer 25 November 2013

3The Forrester Wave™: Enterprise Data Warehouse, Q4 2013, Forrester Research, Inc.,  December 9, 2013

4The Forrester Wave™: Enterprise Business Intelligence Platforms, Q4 2013, Forrester Research, Inc.,  December 18, 2013

5Gartner, Magic Quadrant for Operational Database Management Systems by Donald Feinberg, Merv Adrian and Nick Heudecker, October 21, 2013.

Disclaimer:

Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. 

Categories: Database

Windows Azure SQL Database - Uploading Data by using SQL Server Integration Services

Database Journal News - Thu, 02/06/2014 - 09:01

Recently we presented a procedure for uploading data to a Windows Azure SQL Database from an on-premise SQL Server instance. Today we will accomplish the same objective by employing a more traditional approach to data extraction, transformation, and loading (ETL) that relies on SQL Server Integration Services (SSIS).

Categories: Database

Using R with databases

IBM - DB2 and Informix Articles - Thu, 02/06/2014 - 06:00
R is not just the 18th letter of the English language alphabet, it is a very powerful open source programming language that excels at data analysis and graphics. This article explains how to use the power of R with data that's housed in relational database servers. Learn how to use R to access data stored in DB2 with BLU Acceleration and IBM BLU Acceleration for Cloud environments. Detailed examples show how R can help you explore data and perform data analysis tasks.
Categories: Database

Database Backup with Memory-Optimized Tables

The memory-optimized tables are backed up as part of regular database backups so you don’t need to do anything special to manage backups on databases with memory-optimized tables. Like for disk-based tables, the CHECKSUM of data/delta files is validated as part of database backup to proactively detect any storage corruption. However, as described in the blog storage allocation and management, the storage used by memory-optimized tables can be much larger than its footprint in the memory.  A full backup of a database with one or more memory-optimized tables consist of the allocated storage for disk-based tables, active transaction log and the data/delta file pairs (i.e. checkpoint file pairs) for memory-optimized tables. This blog focuses on the size of database backups that you can expect in database with memory-optimized tables.  

Full Database Backup

For the discussion here, we will focus on the database backups for databases with just durable memory-optimized tables because the backup part for the disk-based tables is the same irrespective of the existence of memory-optimized tables. The data/delta file pairs, also referred to as Checkpoint File Pairs or CFPs residing in the filegroup could be in various states at a given time. Please refer to merge-operation-in-memory-optimized-tables for details. The table below describes what part of the files is backed up both in CTP2 and in RTM.  

CFP State

Backup in CTP2

Backup in RTM

PRECREATED

File metadata only

File metadata only

UNDER CONSTRUCTION

File metadata + allocated bytes

File metadata only

ACTIVE

File metadata + allocated bytes

File metadata + used bytes

MERGE SOURCE

File metadata + allocated bytes

File metadata + used bytes

MERGE TARGET

File metadata + allocated bytes

File metadata only

REQUIRED FOR BACKUP/HA

File metadata + allocated bytes

File metadata + used bytes

IN TRANSITION TO TOMBSTONE

File metadata + allocated bytes

File metadata only

TOMBSTONE

File metadata + allocated bytes

File metadata only

Table - 1: CFP and Database Backup

You will find that the size of database backup in SQL Server 2014 RTMis relatively smaller than what you had in CTP2.

Let us walk through a few examples to show the size of the backups. All these examples are based on the following database and the table schema using pre-RTM bits. The state of checkpoint file pairs (i.e. CFPs) in the example here please refer to the blog state-transition-of-checkpoint-files-in-databases-with-memory-optimized-tables.

CREATE DATABASE imoltp

GO

ALTER DATABASE imoltp ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA

ALTER DATABASE imoltp ADD FILE (name='imoltp_mod', filename='c:\data\imoltp_mod') TO FILEGROUP imoltp_mod

GO

use imoltp

go

-- create the table with each row around 8K

CREATE TABLE dbo.t_memopt (

       c1 int NOT NULL,

       c2 char(40) NOT NULL,

       c3 char(8000) NOT NULL,

       CONSTRAINT [pk_t_memopt_c1] PRIMARY KEY NONCLUSTERED HASH (c1)

WITH (BUCKET_COUNT = 100000)

) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_AND_DATA)

Go

We will use the following query to look at the states of CFPs

select file_type_desc, state, state_desc, internal_storage_slot,file_size_in_bytes, file_size_used_in_bytes, inserted_row_count, deleted_row_count, lower_bound_tsn, upper_bound_tsn

from sys.dm_db_xtp_checkpoint_files

order by file_type_desc, upper_bound_tsn

Example-1: Backup a database with no rows in memory-optimized tables
  • Backup Command:

BACKUP DATABASE [imoltp] TO DISK = N'C:\data\imoltp-empty-data.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD,STATS = 10

  • Result: The size of the full database backup is around 5.3MB.  As you can see even though the total storage taken on the disk is (16MB*9) + (1MB*9) = 153MB, but the size of the backup is much smaller. Other thing to note is that even though there are no data rows inserted, still significant storage (i.e. 144 MB) was taken primarily because of the fixed storage overhead.
Example-2: Backup the database after loading 8000 rows

-- load 8000 rows. This should use 5 16MB files

declare @i int = 0

while (@i < 8000)

begin

       insert t_memopt values (@i, 'a', replicate ('b', 8000))

       set @i += 1;

end

Checkpoint files: Besides 8 CFPs in ‘PRECREATED’, there are now 5 CFPs ‘UNDER CONSTRUCTION’, each storing up to 1870 rows, to contain the 8000 data rows just inserted. Please refer to blog state-transition-of-checkpoint-files-in-databases-with-memory-optimized-tables.

  • Backup Command:

BACKUP DATABASE [imoltp] TO  DISK = N'C:\data\imoltp-full-data.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

  • Output:

Processed 304 pages for database 'imoltp', file 'imoltp' on file 1.

Processed 0 pages for database 'imoltp', file 'imoltp_mod' on file 1.

10 percent processed.

20 percent processed.

Processed543 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE successfully processed 8847 pages in 1.191 seconds (58.027 MB/sec).

  • Result: The size of the full database backup is around 73.5MB.  Note that the size of the transactional log is 70MB and there is no data actually backed up because the CFPs are either in ‘PRECREATED’ or ‘UNDER CONSTRUCTION’ as show in the Table-1. The data is only guaranteed to be in the data/delta files after CFP transitions into ‘ACTIVE’ state.
Example-3: Backup the database after taking an explicit checkpoint

-- do an manual checkpoint

checkpoint

  • Checkpoint files: Besides 8 CFPs in ‘PRECREATED’, there are now 5 CFPs with state ‘ACTIVE’. The manual checkpoint transitions the CFPs in ‘UNDER CONSTRUCTION’ state to ‘ACTIVE’.
  • Backup Command:

-- the backup will include full log and the data. So the size is double - 138MB

BACKUP DATABASE [imoltp] TO DISK = N'C:\data\imoltp-full-data-chkpt.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Output:

Processed 304 pages for database 'imoltp', file 'imoltp' on file 1.

10 percent processed.

20 percent processed.

Processed 8064 pages for database 'imoltp', file 'imoltp_mod' on file 1.

30 percent processed.

40 percent processed.

Processed 8548 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE successfully processed 16916 pages in 1.872 seconds (70.594 MB/sec).

  • Result: The size of the full database backup is around 138MB. It consists of 70MB of transaction log and also 66MB (8064 8k pages) of contents from data/delta files.  
Example-4: Backup the database after deleting 50% rows and merging the CFPs

-- now delete 50% rows

declare @i int = 0

while (@i <= 8000)

begin

       delete t_memopt where c1 = @i

       set @i += 2;

end

checkpoint

-- Do the manual merge. It generates merge-target and other files stay as regular files

-- The transaction range here is picked up by querying the DMV

-- sys.dm_db_xtp_checkpoint_files. Please find the appropriate range for your test.

--

exec sys.sp_xtp_merge_checkpoint_files 'imoltp',  1877, 12004

go

-- This installs the merge leading to 5 CFPs as MERGE SOURCE and the merge target

-- transitions to ACTIVE state

Checkpoint

go

BACKUP DATABASE [imoltp]

TO DISK = N'C:\data\imoltp-full-data-chkpt-del50-merge-chkpt.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Checkpoint files: Besides 8 CFPs in ‘PRECREATED’, there are now 5 CFPs with state ‘MERGE SOURCE’ and 2 CFPs as ‘ACTIVE’.
  • Backup Command:

-- the backup will include full log and the data.

BACKUP DATABASE [imoltp] TO DISK = N'C:\data\imoltp-full-data-chkpt.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Output:

Processed 304 pages for database 'imoltp', file 'imoltp' on file 1.

10 percent processed.

20 percent processed.

Processed 12143 pages for database 'imoltp', file 'imoltp_mod' on file 1.

30 percent processed.

40 percent processed.

Processed 8815 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE successfully processed 21262 pages in 1.794 seconds (92.588 MB/sec).

  • Result: The size of the full database backup is around 172 MB. It consists of 72MB of transaction log and also 100MB of data/delta files. The reason the size has grown around 50% even though we have deleted 4000 rows is because ‘MERGE SOURCE’ CFPs still have all 8000 rows.
Example-5: Backup after MERGE SOURCE CFPs transition to TOMBSTONE state

checkpoint

go

-- Do the log backup. This log backup is around 71MB

BACKUP LOG [imoltp] TO DISK = N'C:\data\imoltp-log-1.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-log Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

checkpoint

go

-- Do one more log backup. This backup reduces active transaction log size

-- to 7MB

BACKUP LOG [imoltp] TO DISK = N'C:\data\imoltp-log-2.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-log Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

checkpoint

go

-- do one more log backup

BACKUP LOG [imoltp] TO DISK = N'C:\data\imoltp-log-3.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-log Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Checkpoint files: Besides 8 CFPs in ‘PRECREATED’, there are now 6 CFPs ‘IN TRANSITION TO TOMBSTONE’ and there 5 CFPs ‘ACTIVE’.[JdB1][SA2]  Only one ACTIVE CFP contains 4000 data rows other ACTIVE CFPs were created due to manual checkpoint. In a production environment, the checkpoints will occur automatically and log backups will be taken as part of regular operational process.
  • Backup Command:

Since the CFPs are converted to either TOMBSTONE or are in transition to it, the size of database backup is now 38MB (only 1 copy of data as transaction log has been backed up and freed)

BACKUP DATABASE [imoltp]

TO DISK = N'C:\data\imoltp-full-data-chkpt-del50-merge-chkpt-logbkup3.bak'

WITH NOFORMAT, INIT, NAME = N'imoltp-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

GO

  • Output:

Processed 288 pages for database 'imoltp', file 'imoltp' on file 1.

10 percent processed.

Processed 4128 pages for database 'imoltp', file 'imoltp_mod' on file 1.

Processed 23 pages for database 'imoltp', file 'imoltp_log' on file 1.

100 percent processed.

BACKUP DATABASE successfully processed 4439 pages in 0.394 seconds (88.012 MB/sec).

  • Result: The size of the full database backup is around 38 MB. This is because we have around 32MB of data in ACTIVE CFP. There are no CFPs in ‘MERGE SOURCE’ or ‘REQUIRED FOR BACKUP/HA’ states.
Summary

The size of backup of databases with one or more memory-optimized tables is typically bigger than the in-memory size of memory-optimized tables but smaller than the on-disk storage. The extra size will depend upon number of Checkpoint File Pairs (i.e. CFPs) in the states ‘MERGE SOURCE’ and ‘REQUIRED FOR BACKUP/HA’ which indirectly depends upon the workload.

Categories: Database

DBD::Pg 3.0.0 released

PostgreSQL News - Wed, 02/05/2014 - 01:00

Version 3.0.0 of DBD::Pg, the Perl interface to Postgres, has just been released. The new version is available from CPAN.

Categories: Database, Open Source

Natively Compiled Stored Procedures with SQL Server 2014

Database Journal News - Mon, 02/03/2014 - 09:01

With the new SQL Server 2014 In-Memory OLTP tables, stored procedures can be natively compiled and can improve performance. Read on to learn more.

Categories: Database

Barman 1.3.0 released

PostgreSQL News - Mon, 02/03/2014 - 01:00

2ndQuadrant is proud to announce the release of version 1.3.0 of Barman, Backup and Recovery Manager for PostgreSQL.

This major release has a new code infrastructure in terms of output, subprocesses, remote commands, file system, events ('hooks') and metadata management, making the application more robust.

Barman 1.3.0 introduces a few configuration options for:

  • managing hook scripts before and after archiving a WAL file
  • network compression of remote backup, and
  • recovery operations and immediate checkpoint at backup time

Administrators can now force a rebuild of the "xlog.db" file (WAL archive) through the "'barman rebuild-xlogdb'" command.

This major version now correctly handles those tablespaces that had been created inside the "PGDATA" directory, removing content duplication.

Remote recovery code has been improved in order to behave exactly like local recovery does, including preparation of the directories layout ("PGDATA" and tablespaces).

Minor bugs have also been fixed.

Many thanks for funding towards the development of this release go to Ecometer (www.ecometer.it), Agile Business Group (www.agilebg.com) and a French company (that prefers to remain anonymous).

For a complete list of changes, see the "Release Notes" section below.

Links Release notes
  • Refactored "BackupInfo" class for backup metadata to use the new "FieldListFile" class ("infofile" module)
  • Refactored output layer to use a dedicated module, in order to facilitate integration with Nagios ("NagiosOutputWriter" class)
  • Refactored subprocess handling in order to isolate stdin/stderr/stdout channels ("command_wrappers" module)
  • Refactored hook scripts management
  • Extracted logging configuration and userid enforcement from the configuration class.
  • Support for hook scripts to be executed before and after a WAL file is archived, through the "'pre_archive_script'" and "'post_archive_script'" configuration options.
  • Implemented immediate checkpoint capability with "--immediate-checkpoint" command option and "'immediate_checkpoint'" configuration option
  • Implemented network compression for remote backup and recovery through the "'network_compression'" configuration option (#19)
  • Implemented the "'rebuild-xlogdb'" command (Closes #27 and #28)
  • Added deduplication of tablespaces located inside the "PGDATA" directory
  • Refactored remote recovery code to work the same way local recovery does, by performing remote directory preparation (assuming the remote user has the right permissions on the remote server)
  • "'barman backup'" now tries and create server directories before attempting to execute a full backup (#14)
  • Fixed bug #22: improved documentation for tablespaces relocation
  • Fixed bug #31: "'barman cron'" checks directory permissions for lock file
  • Fixed bug #32: "xlog.db" read access during cron activities
Download About Barman:

Barman (Backup and Recovery Manager) is an open source administration tool for disaster recovery of PostgreSQL servers written in Python. It allows your organisation to perform remote backups of multiple servers in business critical environments and help DBAs during the recovery phase. Barman’s most requested features include backup catalogues, retention policies, remote recovery, archiving and compression of WAL files and backups. Barman is distributed under GNU GPL 3.

Categories: Database, Open Source

Microsoft to Join the Big Data Conversation as an Elite Sponsor at Strata, Santa Clara, CA

Join the big data conversations at Strata Santa ClaraOn February 11-13, Microsoft will join 2,700 of the world’s leading big data scientists, CIOs and executives at the Strata conference in Santa Clara, CA. For some, Microsoft does not come immediately to mind when thinking about leadership in data; however, along with the power of HDInsight, Microsoft is working to democratize big data – connecting companies to previously untouched data sources and enabling everyone to gain insight through familiar and powerful tools in Microsoft Office Excel.

Quentin Clark, Corporate Vice President of the Data Platform Group will share some of that vision in his keynote address followed by an interview with O’Reilly’s Mac Slocum - streamed live from the Strata conference website. Additionally, Herain Oberoi, Director Product Marketing, Microsoft will deliver a joint session with Bryan Hurd, Director of Advanced Analytics, Microsoft Cybercrime Center, highlighting how they use Microsoft’s portfolio of Big Data solutions, including simple, familiar end user tools and cutting-edge cloud technologies, to fight organized crime and BotNets that commit hundreds of millions of dollars of identity fraud ever year (view the Cybercrime Center video on YouTube).  Visit the Strata Website for full Event Agenda.

Microsoft is also hosting a hack-a-thon on February 8-9.  Not limited to technology, this big data hack-a-thon is aimed at spending 1½ days hacking for a good cause. Extra credit will go to those who develop using HDInsight and Power BI, but you are free to use whichever technology you like.  Sign up here.

Join the conversation – attend one of our sessions in person, visit our exhibit booth (#401) and interact with our data experts or follow us on Facebook.com/sqlserver and Twitter @SQLServer. You will how Microsoft’s complete data platform delivers Big Data: enabling you to enrich your data with external data sources; connect any type of data – structured or unstructured; and extend the accessibility of data analysis beyond scientists into the hands of every employee.

Categories: Database

IBM Business Analytics Proven Practices: IBM Cognos BI Dispatcher Routing Explained

IBM - DB2 and Informix Articles - Fri, 01/31/2014 - 06:00
A short document to explain the main concepts of IBM Cognos Dispatchers when routing requests.
Categories: Database

VoltDB Launches Version 4.0 of Its In-Memory Operational Database

Database Journal News - Thu, 01/30/2014 - 22:30

VoltDB adds enhanced analytics, expanded integrations, MySQL migration and more to VoltDB 4.0, the latest version of its in-memory database.

Categories: Database

In-Memory OLTP Index Troubleshooting, Part II

With the In-Memory OLTP feature in SQL Server 2014 we introduce two new types of indexes for memory-optimized tables: the memory-optimized NONCLUSTERED indexes, and the memory-optimized HASH indexes.

Memory-optimized NONCLUSTERED indexes behave similar to traditional NONCLUSTERED indexes from an application point-of-view. The main difference being that memory-optimized indexes are always covering (i.e. all columns are virtually included), while with traditional disk-based NONCLUSTERED indexes you need to specify which column you want to include alongside the index key columns.

Memory-optimized HASH indexes behave different from NONCLUSTERED indexes. They are optimized for point-lookup operations, and do not support ordered scans or inequality seek operations. In addition, you need to specify a BUCKET_COUNT when creating the index, and you need to pick the right value in order for the index to perform optimally. In earlier posts we covered the following three index troubleshooting aspects:

  • Bucket_count troubleshooting for HASH indexes: how to find out whether a bucket_count is too low, and how to determine what should be the correct bucket_count. You can also find more information about determining the right bucket_count in Books Online.
  • Searching on a subset of the index key: HASH indexes do not support searching on the leading columns of an index key.
  • Searching on inequality predicates and ordered scans: both operations are not supported with HASH indexes, but they are supported with NONCLUSTERED indexes. But note that NONCLUSTERED indexes support ordered scans only in the direction indicated with the index key: for example, if the index key is (c1 DESC), the index supports retrieving the values for c1 in descending order, but not in ascending order.

All that said, it seems that HASH indexes have quite a few limitations, compared with NONCLUSTERED indexes. For that reason, it is usually a safer bet to start with NONCLUSTERED indexes, both for new applications and when migrating existing disk-based tables to memory-optimized. You can then use HASH indexes to further optimize the workload. Indeed, HASH indexes are the most efficient index for equality search (point lookup) and full table scan operations.

In the remainder of this post we:

  • Describe a way to troubleshoot seek vs. scans on memory-optimized indexes using the new DMV sys.dm_db_xtp_index_stats: if there are more full index scans than expected, one of the above-mentioned issues may be the case: trying to search on a subset of a HASH index key or trying to search on inequality predicates with a HASH index.
  • Troubleshoot duplicates: you may run into problems if the index keys of a memory-optimized index contain a lot of duplicate values, particularly when using a HASH index. We describe how to detect this situation and how to work around.
Troubleshooting seek vs. scan using XTP index DMVs

The DMV sys.dm_db_xtp_index_stats shows statistics for index operations performed by the in-memory storage engine. The index contains stats about the usage of the index since its creation in memory – note that memory-optimized indexes are always recreated on database restart. You can use the following query to retrieve key statistics about the usage of indexes on your table:

SELECT ix.index_id, ix.name, scans_started, rows_returned

FROM sys.dm_db_xtp_index_stats ixs JOIN sys.indexes ix ON

ix.object_id=ixs.object_id AND ix.index_id=ixs.index_id

WHERE ix.object_id=object_id('<table name>')

For troubleshooting indexes, the columns ‘scans_started’ and ‘rows_returned’ contain key information:

  • scans_started – this is the number of scan operations the in-memory storage engine has started. Note that from the storage engine point-of-view, all operations to locate a row or the location to insert a new row are scans: for example, a full index scan, a point lookup and, a row insert all require a single scan operation.
  • rows_returned – the cumulative number of rows returned by all scan operations in the storage engine. Note that this number reflects the rows returned by the storage engine, not the number of rows returned to the client. For example, the query plan may call for a filter condition or aggregation that reduces the number of rows before it is returned to the client.
    • Insert operations do not result in rows being returned. Update and delete operations consist of a scan, to locate the row to be updated, followed by the actual row update/delete.

If the number of rows_returned is significantly larger than the scans_started, this is an indication that, on average, index operations scan a large part of the index. If all index operations are expected to be point lookups, this could be an indication of one of the earlier-mentioned problems where the query calls for an operation to be supported by the index, thus causing a revert to full index scan, such as: search requires a subset of hash index key columns or search on inequality predicates with a hash index.

The scans_started being larger than rows_returned is an indication that the workload is insert-heavy, or that a lot of point lookups failed to locate a row.

Index keys with many duplicate values

Issue: Index keys with many duplicate values can cause performance problems. If each index key has 5 duplicates this is usually not a problem, but if the discrepancy between the number of unique index keys and the number of rows in the tables becomes very large – more than 10X – this can become problematic.

All rows with the same index key end up in the same duplicate chain. For hash indexes this can create a lot of overhead in case of hash collisions: if multiple index keys end up in the same bucket due to a hash collision, index scanners always need to scan the full duplicate chain for the first value before they can locate the first row corresponding to the second value. For nonclustered indexes this causes additional overhead for garbage collection.

Symptom: For hash indexes the performance of DML operations degrades and CPU utilization increases. In addition, there is an increase in CPU utilization during database startup, and a potential increase in recovery time. This becomes especially clear when inserting a large number of rows.

For nonclustered indexes garbage collection will start to consume more resources, resulting in an overall increased CPU utilization in the system. The problem does not affect DML operations directly,[1] but it does put more overall pressure on system resources.

How to troubleshoot: The average number of duplicate rows for a given index key can be obtained using T-SQL queries. First determine the row count of the table, then determine the number of unique index key values. Divide the row count by the number of unique index keys to obtain the average number of duplicates.

To determine the row count for the table use the following query:

select count(*) as 'Row count' from <tableName>

To determine the number of unique index key values use the following query:

select count(*) as 'Distinct index key values' from (select distinct <indexKeyColumns> from <tableName>) a

For hash indexes specifically, you can also troubleshoot using the hash index stats DMV. Use the following query:

SELECT hs.object_id, object_name(hs.object_id) AS 'object name', i.name as 'index name', hs.*

FROM sys.dm_db_xtp_hash_index_stats AS hs

JOIN sys.indexes AS i ON hs.object_id=i.object_id AND hs.index_id=i.index_id;

If the average chain length is high and the number of empty buckets is high, it is likely that there are many rows with duplicate index key values or there is a skew in the key values.

Workaround: First, evaluate if the index is truly needed. The index may have been added in the past to support queries that are no longer part of the workload.

For HASH indexes there are two ways to work around the issue of many duplicates:

  • In most cases you will want to use a NONCLUSTERED index instead, as NONCLUSTERED indexes generally perform better in case of duplicates. If you go for this option, consider uniquifying the index key, as indicated below.
  • Alternatively, you can over-size the index by using a very high bucket count; for example, 20 – 100 times the number of unique index key values. This will make hash collisions unlikely.

For NONCLUSTERED indexes with a lot of duplicates, consider adding additional columns to the index key. For example, you can add the primary key columns to the index key to make it unique, in other words to uniquify the index.

Adding columns to the index key does come with a performance penalty for DML operations. Therefore, it is important to test the new uniquified index under production load, and compare the performance with the old index. Consider the overall throughput of the system, query performance, and also the overall resource utilization, especially CPU.


[1] In SQL 2014 CTP2 user transactions could be affected, due to garbage collection performed at commit time. This issue will be resolved in SQL 2014 RTM: garbage collection no longer directly affects user transactions.

Categories: Database