SQL performance

Caporegime
Joined
18 Oct 2002
Posts
32,618
I don't know much about SQL and what to expect performance wise. We have a PSQL instance running on AWS with plenty of resources (CPU and memory never top 5%)

Some of the queries are incredibly slow, like taking several minutes, but the amount of data we get is not that large. We are thinking of dumping the DB altogether and just using flat files (something like goggle protobufs)

There is a table:
id | mid | type | timestamp | int1 | int2 | float1 | float2 | str | add_time

id, mid, type, timestamp are all ints. Str is a string but is onyl 3 characters.
add_time is an SQL time type.

The query is like this:
SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

There are about 500 entries for a particular value of "mid" and maybe 5-10 different "mid" values.


Not sure on the size of SQL data types but I am estimating between 20-100KB of data, which shouldn't be that much. Does the querying and sorting really take that long? Is waiting 3 minutes to get this data realistic? If so we will drop SQL at once because we can do similar in a few millseconds using flat files.
 
Man of Honour
Joined
26 Dec 2003
Posts
30,925
Location
Shropshire
That's way too long for 500 records, I can grab a few thousand records and have them pulled into a nicely formatted report in seconds from our data so minutes to pull 500 records of raw data is definitely not right.
 
Associate
Joined
5 Oct 2007
Posts
353
You probably don't have any indexes so it is having to do a full table scan to satisfy the where clause which is very slow. I'm a Microsoft SQL man myself but I presume PSQL behaves the same way.
 
Caporegime
OP
Joined
18 Oct 2002
Posts
32,618
How many rows are in the table in total?

Currently 6,990,058 although we tend to purge old data as it isn't needed.


The data is pulled from the database and displayed on a webserver as some nice graphs/It things like number of users connected, or a simple heart beat every 5 minutes that server X is alive and some performance metrics. When the web-server is first loaded it needs to grab 24 hours worth of data and this is where the 3-5 minute wait is happening, then the webserver updates every 5 minutes and we request the data since the last update so this is just a few dozen rows typically but even that takes several seconds which just seems absurd.


We are well versed in handling big data, we crunch through hundreds of GB without issue. The DB was meant to make life easier for creating a webserver but it is proving to be a pain. What is strange is sometimes it is much faster (e.g. 10x, which we still consider very slow) but the resources of the AWS server that the DB is running is not pushed at all.
 
Caporegime
OP
Joined
18 Oct 2002
Posts
32,618
You probably don't have any indexes so it is having to do a full table scan to satisfy the where clause which is very slow. I'm a Microsoft SQL man myself but I presume PSQL behaves the same way.

Yeah, I was thinking this is the kind of issue but I know next to nothing about databases so wanted to get some feedback as I have no idea what to expect.
 
Soldato
Joined
18 Oct 2002
Posts
3,926
Location
SW London
It sounds like you're doing a full table scan on all 7million rows.
3 minutes still sounds like a pretty long time to do that query though.

What indexes are on the table at the moment?
Can you get a query plan to see exactly what it's doing?

I'm not sure what profiling tools you have on Postgres, but you should be able to get some info about where the time is being spent.
 
Caporegime
OP
Joined
18 Oct 2002
Posts
32,618
These are the current indexes:
table_name | index_name | column_name
------------+--------------------+-------------
data | data_pkey | id
data | idx_data_timestamp | timestamp


Thinking about the typically queries the id column is not important, the timestamp is though but the mid and type columns are also important.
Queries are like this typically:

Code:
 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

or
Code:
 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 AND type = 2 ORDER BY timestamp ASC;
 
Soldato
Joined
18 Oct 2002
Posts
3,926
Location
SW London
It does kind of seem that what you're doing may not be suited to using an RDBMS.
From what you've described you're just querying a single table over and over again, presumably with increasing timestamps to get the latest data?

What exactly is the data for and how does it end up in the db?
When you mention about getting the previous 24 hours of data on startup and then getting regular updates could you do something with a message queue?

Whatever inserts to the db now could push a message onto the queue with a particular time to live. Then on startup you could fetch the current state of the world, which would give you all live messages.
Then, once subscribed, you would receive all further updates which you could buffer/window as appropriate if you want to process things at set intervals by the consumer.
 
Associate
Joined
15 Oct 2005
Posts
647
It does kind of seem that what you're doing may not be suited to using an RDBMS.

This is what I thought. If you already use AWS, would a NoSQL offering such as DynamoDB not be a better option if you want just very fast reads.

Then, as said above, a queue such as Amazon SQS could be used if required.
 
Soldato
Joined
16 Feb 2004
Posts
4,786
Location
London
These are the current indexes:
table_name | index_name | column_name
------------+--------------------+-------------
data | data_pkey | id
data | idx_data_timestamp | timestamp


Thinking about the typically queries the id column is not important, the timestamp is though but the mid and type columns are also important.
Queries are like this typically:

Code:
 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

or
Code:
 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 AND type = 2 ORDER BY timestamp ASC;

looks like you can add a combined index over mid and timestamp to help speed it up. It should add a fair amount of performance if the values in mid are fairly limited and repeating.

The initial days queries will be sending a fair amount of data over the pipe to your app, you'd need to find out the query speed and the send speed to see if there's a bottleneck somewhere. How many rows are being sent for the day?
 
Caporegime
OP
Joined
18 Oct 2002
Posts
32,618
It does kind of seem that what you're doing may not be suited to using an RDBMS.
From what you've described you're just querying a single table over and over again, presumably with increasing timestamps to get the latest data?

What exactly is the data for and how does it end up in the db?
When you mention about getting the previous 24 hours of data on startup and then getting regular updates could you do something with a message queue?

Whatever inserts to the db now could push a message onto the queue with a particular time to live. Then on startup you could fetch the current state of the world, which would give you all live messages.
Then, once subscribed, you would receive all further updates which you could buffer/window as appropriate if you want to process things at set intervals by the consumer.

That is kind of what we are thinking.

For are main number crunching software we don't bother with a database because we need to load the previous X-Days worth of historic data which is saved in files for each day. We can load GBs in very little time. The DB was just used to monitor somethings like number of connected users, heartbeats from server, CPU and memory utilization. The idea of a DB was just to make integration with a webserver easy but it hasn't really achieved that for us.


I will see if I can get our sys admin to looking at making an index used mid and timestamp as a temp solution.
 
Caporegime
OP
Joined
18 Oct 2002
Posts
32,618
Create a single index on fields (timestamp, mid)

Make sure stats are being collected just after you load a large number of rows into the table.

Will see if I can get the sys admin to do this. he was very against the idea of making changes to the indexing but I'm not sure he likes getting criticism from someone who knows next to nothing about DBs!
 
Associate
Joined
19 Jul 2011
Posts
2,343
Just ask him to look at the IO counters for the database drives when you run your query :)

Alternatively put it in a way where he/she can be a hero.
"We're looking for help with making this query quicker. Can you check if there are any indexes we could add as I have a feeling we're missing a trick here".

Imagine using a dictionary to look for a bunch of words beginning with M. If you don't have an index (so you at least know to start from page 110 thru to 115), then you have to start at the beginning and check each page. That's what your server is doing for 7million rows.
 
Associate
Joined
29 Dec 2004
Posts
421
Location
Fife, Scotland
I don't use the database you are using, but I suspect indexes and how they are interacting with the WHERE and ORDER BY clauses are at the bottom of the problem.

Have you looked at the query execution plan?

http://www.postgresql.org/docs/8.1/static/sql-explain.html

WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC

I would have thought it should run through in timestamp order from the given value discarding rows where mid <= 400. That implies an index on timestamp, mid.

Note: I see someone else has suggested this index!
 
Back
Top Bottom