Large database being hammered in the evening!

n30_mkii · 14 Jan 2012 at 14:40

As per the title at work we have a SQL 2005 node as part of a cluster.

This particular node hosts our a few DB's but mainly our biggest database which is around 900GB in size. We have been getting complaints that in the evening this database becomes unusable for a couple of hours. Our initial thoughts were that backups would be causing this but after a little more monitoring we can see they are not the cause.

Basically our backups do an incremental each evening with a full on a Friday. The incremental backup finished between 8-9pm, we have even suspended the backup on this server one evening to fully rule this out for our DB team.

I have looked at the server reports using various monitoring and can clearly see that between just after 9pm something kicks off on the server and it constantly begins reading in 200MB/s data from the SAN. This continues until around midnight.

Looking into a little further I can see that the read operation is purely being done by the SQL process and at the time I can see it's the massive DB that seems to being read in (file monitoring level) i.e. gigs and gigs of it. No other process on the server is causing it. And it's that single file from the servers data drive.

There are no scheduled maintenance plans at that time the earliest on is at around 10:45 PM but this starts a 9pm. It does it every night at the same time so it must be something that's scheduled. The fact is that whilst this DB is been read in so intesivly end users are suffering and this is a DB which must have 24/7 access.

I have looked through SQL logs but nothing is showing as being run at that time. I'm not a DBA and know only a little of SQL but seem unable to track down the cause and to be honest our DB team are a waste of space at anything other than setting up the odd maintenance plan.

Does anyone have any ideas as to what could be causing this? If someone had created a very bad scheduled report etc would that cause it? If so how could I track that down? Could it be some sort of expensive query that is been remotely executed? Is there any circumstance that SQL would be doing something with the DB outside of a maintenance plan?

Any help would be really appreciated as I'm really struggling to pinpoint the issue. Perfmon is showing large read disk queue depths as to be exepected with the I/O that we are seeing. I've tried looking in activity monitor within management studio but can't seem to see anything I'm guessing there may be something I'm missing

Rroff · 14 Jan 2012 at 14:46

It does sound like a report or some kind of indexing task somewhere running, if it was maintenance tasks i.e. optimisation they usually do a fair amount of write as well.

Unfortunatly its 7+ years since I did database admin but there should be query logs somewhere.

Randomface74 · 14 Jan 2012 at 14:47

n30_mkii · 14 Jan 2012 at 14:48

We see around 1MB/s write's nothing compared the the reads.

If I connected into the server again this evening, is there anything I can do in SQL that will show me if some massive report or a indexing job was running?

AV isn't scanning other than realtime and as stated this is all to do with the SQL process MDF's and LDF's etc are also excluded from AV scans

The only other thing one of the DB team mentioned was something regarding auto update statistics was enabled for the DB and they didn't know if that may cause this issue?

n30_mkii · 16 Jan 2012 at 12:14

Still stuggling with this. Anyone have any further thoughts?

Appreciated as always,

ruffneck · 16 Jan 2012 at 12:19

Have you checked through the SQL jobs? I had a similar problem once, turned out to be SQL replication.

IndieSchmindie · 16 Jan 2012 at 12:20

You need to have a look in the activity monitor whilst it's happening really. You'll be able to see the queries it's running and figure out what's happening.

skyripper · 16 Jan 2012 at 12:22

Ideas off the top of my head..

Database - Replication/Mirroring, stats gathering/reindexing, crappy SQL report/extract, SQL backups, scheduling issues
Non-Database - antivirus, defragmentation, OS and filesystem backups

I'm not a DBA, but I do sit with them

If it is stats generation and depending on the number of tables and number of rows in these tables - and the growth/changerate in them, you could boot different tables/tablespaces stats collection to different days.

Turn on SQL tracing at 8:45pm and see what kicks in... Turn it off again after the monster performance hit starts to occur and take it from there.

oneilldo · 16 Jan 2012 at 14:02

Just use SQL Server profiler and save results to a table. Then the next day select from the table order by duration or reads descending and you will find the culprit. Depending on usage you might want someone in to keep an eye during the trace running as it can add load to the server and also the table it creates could be quite large. Profiler can create large temp files as well depending on usage

A faster way might be If you can connect to SQL server management studio at the time it is happening, run exec sp_who2 look for status runnable and a large value for DiskIO. Make a note of the SPID of the ones that are highest for this. Wait 1 minute or so and run exec sp_who2 again now get the diskio out for the highest again. Your looking for the SPID where the DiskIO has increased the most in that minute period.

sp_who2 will also show the DBName and userid for that Spid so that should help you track it down.
dbcc inputbuffer(XX) where xx is the SPID will show what is running (or last ran) from that SPID.

NB: The values are for that connection so if you leave connections open for a while the values could be quite high. However comparing the values as outlined above between a minute period and identifying the SPID with the highest difference

101001101 · 16 Jan 2012 at 14:23

^^^^^ great advice, SQL Profiler is a really handy place to start to look for this kind of issue - I have used it myself many times its a life safer.

Just fire it up in the evening and set a filter to filter all SQL traffic that is > than a specific value for the CPU load - I personally would run it manually and watch it myself.

oneilldo · 16 Jan 2012 at 14:58

Thanks, it should be as I have been using it for > 10 years lol

Curiosityx · 16 Jan 2012 at 21:02

Just throwing this out there, have you looked at the LAN for any bottlenecks, oversubscription on switch uplinks or possible NIC teaming issues?

n30_mkii · 16 Jan 2012 at 22:22

oneilldo said:
Just use SQL Server profiler and save results to a table. Then the next day select from the table order by duration or reads descending and you will find the culprit. Depending on usage you might want someone in to keep an eye during the trace running as it can add load to the server and also the table it creates could be quite large. Profiler can create large temp files as well depending on usage

A faster way might be If you can connect to SQL server management studio at the time it is happening, run exec sp_who2 look for status runnable and a large value for DiskIO. Make a note of the SPID of the ones that are highest for this. Wait 1 minute or so and run exec sp_who2 again now get the diskio out for the highest again. Your looking for the SPID where the DiskIO has increased the most in that minute period.

sp_who2 will also show the DBName and userid for that Spid so that should help you track it down.
dbcc inputbuffer(XX) where xx is the SPID will show what is running (or last ran) from that SPID.

NB: The values are for that connection so if you leave connections open for a while the values could be quite high. However comparing the values as outlined above between a minute period and identifying the SPID with the highest difference

That has worked a treat I think I have tracked it down to a remote DBCC execution. Hopefully I can nail it down now many thanks to everyone for your help

oneilldo · 17 Jan 2012 at 09:29

no probs, if you need anymore help then let me know.