I have some data in a sql database and I’d like to calculate the slope. The data has this layout:
Date | Keyword | Score 2012-01-10 | ipad | 0.12 2012-01-11 | ipad | 0.17 2012-01-12 | ipad | 0.24 2012-01-10 | taco | 0.19 2012-01-11 | taco | 0.34 2012-01-12 | taco | 0.45
I’d like the final output to look like this by creating a new table using SQL:
Date | Keyword | Score | Slope 2012-01-10 | ipad | 0.12 | 0.06 2012-01-11 | ipad | 0.17 | 0.06 2012-01-12 | ipad | 0.24 | 0.06 2012-01-10 | taco | 0.19 | 0.13 2012-01-11 | taco | 0.34 | 0.13 2012-01-12 | taco | 0.45 | 0.13
To complicate things, not all Keywords have 3 dates worth of data, some have only 2 for instance.
The simpler the SQL the better since my database is proprietary and I’m not quite sure what formulas are available, although I know it can do OVER(PARTITION BY) if that helps. Thank you!
UPDATE: I define the slope as best fit y=mx+p aka in excel it would be =slope()
Here is another actual example that I usually manipulate in excel:
date keyword score slope 1/22/2012 water bottle 0.010885442 0.000334784 1/23/2012 water bottle 0.011203949 0.000334784 1/24/2012 water bottle 0.008460835 0.000334784 1/25/2012 water bottle 0.010363991 0.000334784 1/26/2012 water bottle 0.011800716 0.000334784 1/27/2012 water bottle 0.012948411 0.000334784 1/28/2012 water bottle 0.012732459 0.000334784 1/29/2012 water bottle 0.011682568 0.000334784
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The cleanest one I could make:
SELECT Scores.Date, Scores.Keyword, Scores.Score, (N * Sum_XY - Sum_X * Sum_Y)/(N * Sum_X2 - Sum_X * Sum_X) AS Slope FROM Scores INNER JOIN ( SELECT Keyword, COUNT(*) AS N, SUM(CAST(Date as float)) AS Sum_X, SUM(CAST(Date as float) * CAST(Date as float)) AS Sum_X2, SUM(Score) AS Sum_Y, SUM(CAST(Date as float) * Score) AS Sum_XY FROM Scores GROUP BY Keyword ) G ON G.Keyword = Scores.Keyword;
It uses Simple Linear Regression to calculate the slope.
Result:
Date Keyword Score Slope 2012-01-22 water bottle 0,010885442 0,000334784345222076 2012-01-23 water bottle 0,011203949 0,000334784345222076 2012-01-24 water bottle 0,008460835 0,000334784345222076 2012-01-25 water bottle 0,010363991 0,000334784345222076 2012-01-26 water bottle 0,011800716 0,000334784345222076 2012-01-27 water bottle 0,012948411 0,000334784345222076 2012-01-28 water bottle 0,012732459 0,000334784345222076 2012-01-29 water bottle 0,011682568 0,000334784345222076
Every database system seems to have a different approach to converting dates to numbers:
- MySQL:
TO_SECONDS(date)
orTO_DAYS(date)
- Oracle:
TO_NUMBER(TO_CHAR(date, 'J'))
ordate - TO_DATE('1','yyyy')
- MS SQL Server:
CAST(date AS float)
(or equivalentCONVERT
)
Method 2
If you’re defining slope as just the slope from the earliest point to the latest point, and if score only increases with date, then you can get the output above with this:
SELECT * FROM scores JOIN (SELECT foo.keyword, (MAX(score)-MIN(score)) / DATEDIFF(MAX(date),MIN(date)) AS score FROM scores GROUP BY keyword) a USING(keyword);
However if you want linear regression, or if scores can decrease as well as increase with time, you’ll need something more complex.
Method 3
Cast to decimal does not give correct results for me, it is not linear to the dates. Use TO_DAYS(date_field)
instead, this becomes correct.
Method 4
Use this
SUM(CONVERT(float, datediff(dd, '1/1/1900', date_field)))
instead of
SUM(CAST(date_field AS float))
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0