Noobie to SQL. I have a simple query here that is 70 million rows, and my work laptop will not handle the capacity when I import it into Tableau. Usually 20 million rows and less seem to work fine. Here’s my problem.
Table name: Table1
Fields: UniqueID, State, Date, claim_type
Query:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021'
This gives me what I want, BUT, I can limit the query significantly if I count the number of uniqueIDs that have been used in 3 or more different states. I use this query to do that.
SELECT unique_id, count(distinct states), claim_type, date FROM table1 WHERE date >= '11-09-2021' GROUP BY Unique_id, claim_type, date HAVING COUNT(DISTINCT states) > 3
The only issue is, when I put this query into Tableau it only displays the FIRST state a unique_id showed up in, and the first date it showed up. A unique_id shows up in multiple states over multiple dates, so when I use this count aggregation it’s only giving me the first result and not the whole picture.
Any ideas here? I am totally lost and spent a whole business day trying to fix this
Expected output would be something like
uniqueID | state | claim type | Date
123 Ohio C 01-01-2021
123 Nebraska I 02-08-2021
123 Georgia D 03-08-2021
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If your table is only of those four columns, and your queries are based on date ranges, your index must exist to help optimize that. If 70 mil records exist, how far back does that go… Years? If your data since 2021-09-11 is only say… 30k records, that should be all you are blowing through for your results.
I would ensure you have the index based on (and in this order)
(date, uniqueId, claim_type, states). Also, you mentioned you wanted a count of 3 OR MORE, your query > 3 will results in 4 or more unless you change to count(*) >= 3.
Then, to get the entries you care about, you need
SELECT date, uniqueID, claim_type FROM table1 WHERE date >= '2021-09-11' group by date, uniqueID, claim_type having count( distinct states ) >= 3
This would give just the 3-part qualifier for date/id/claim that HAD them. Then you would use THIS result set to get the other entries via
select distinct date, uniqueID, claim_type, states from ( SELECT date, uniqueID, claim_type FROM table1 WHERE date >= '2021-09-11' group by date, uniqueID, claim_type having count( distinct states ) >= 3 ) PQ JOIN Table1 t1 on PQ.date = t1.date and PQ.UniqueID = t1.UniqueID and PQ.Claim_Type = t1.Claim_Type
The “PQ” (preQuery) gets the qualified records. Then it joins back to the original table and grabs all records that qualified from the unique date/id/claim_type and returns all the states.
Method 2
Yes, you are grouping rows, so therefore you ‘loose’ information on the grouped result.
You won’t get 70m records with your grouped query.
Why don’t you split your imports in smaller chunks? Like limit the rows to chunks of, say 15m:
1st:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000;
2nd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 15000000;
3rd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 30000000;
and so on..
I know its not a perfect or very handy solution but maybe it gets you to the desired outcome.
See this link for infos about LIMIT and OFFSET
https://www.bitdegree.org/learn/mysql-limit-offset
Method 3
It is wise in the long run to use DATE
datatype. That requires dates to look like '2021-09-11
, not ’09-11-2021′. That will let >
correctly compare dates that are in two different years.
If your data is coming from some source that formats it '11-09-2021'
, use STR_TO_DATE()
to convert as it goes in; You can reconstruct that format on output via DATE_FORMAT()
.
Once you have done that, we can talk about optimizing
SELECT unique_id, count(distinct states), claim_type, date FROM table1 WHERE date >= '2021-09-11' GROUP BY Unique_id, claim_type, date HAVING COUNT(DISTINCT states) > 3
Tentatively I recommend this composite index speed up the query:
INDEX(Unique_id, claim_type, date, states)
That will also help with your other query.
(I as assuming the ambiguous '11-09-2021'
is DD-MM-YYYY
.)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0