Questions About Microsoft_Linear_Regression Algorithm

Jul 2, 2007

Currently I want to run a vanilla multivariate regression and get some statistics back about the regression that is built. For instance, besides the coefficients, I also want the two-sided p-values on the coefficients and the R2 of the model.

I've tried playing with the Microsoft_Linear_Regression algorithm and have run into two issues. I'm doing all this programmatically using DMX queries rather than through the BI studio.

(a) I can never get the coefficients from the regression to match with results I would get from running R or Excel. The results are close but still significantly off. I suspect this is because the Linear Regression is just a subset of the Decision/Regression Trees functionality, in which case some kind of Bayesian prior is being incorporated here. Is that the issue? And if so, is there some way to turn off the Bayesian scoring and get a vanilla multivariate regression? I don't see anything in the inputs to the linear regression that would let me do this, and even running Microsoft_Decision_Trees with a few different settings, I can't get the output I'm looking for. If there's no way to turn off the Bayesian scoring, can someone explain to me what the prior being used here is and how Bayesian learning is being applied to the regression?

(b) Using the Generic Tree Viewer, I see that there are a few "statistics" values in the Node_Distribution, but I'm not sure what they're referring to. One of them looks like it might be the MSE. I could play with this some more to find out, but I'm hoping someone here can save me that work and tell me what these numbers are. Hopefully they will constitute enough information for me to rebuild the p-values and the R2.

Hello, I was working with Microsoft Time Series (MTS) algorithm and simulated data in order to evaluate/know it a little more. I simulated 24 points of the model y[t] = 5.74-0.1486 y[t-1] + e[t] and 19 points of the model y[t] = 10.48-0.0486 y[t-1] + e[t] (a change of level), where e ~ N(0,0.01). The MTS output is: if time>=23.5 then AR(3) else AR(1): y[t] = 6.23-0.2536 y[t-1]. So, I am wondering: how the algorithm works whit the time variable as a split variable? Like the other variables? Only considering 4 time points? Why the MTS algorithm produces AR(p) models where p is a little large (like the example: I simulated an AR(1) model and the output is an AR(3) model), what about parsimony models? A AR model is a stationary model, so what happen if some data have trend? We need eliminate the trend before the MTS algorithm can be used?Thanks for your time

First of all I would like to politely greet everybody as I'm new on that forum and new to Data Mining in fact.

To introduce myself I can say I'm a student of Computer Science and I'm trying to use Time Series algorithm for weather analysis. I know that forecasting weather is a hopeless task even for the fastest computers in the world but what I'm trying to do is a kind of aposteriori analysis of historical data to notice some dependencies or characteristic weather behavior on a specified region and perhaps make some short time predictions.

I tried Time Series Algorithm although I have some doubts about methodological justification of this choice (if You have any critical comments please share them with me). But my main questions are about the usage of the algorithm itself:

I've read the documentation and a tutorial on this page for historical predictions but I still don't know what exactly are HistoricalModelCount and HistoricalModelGap. I know that my historical predictions are bounded by a €“ HistoricalModelCount*HistoricalModelGap*, but it's a rather operational knowledge... The explanation is always clouded with an €œinternal model€? phrase. Can You point me to a document where I can find some more detailed information? (What is the form of the model? How is it built? etc.)

Periodicity Hint. How should I treat these optional values? Are they other possible periods of data? I have data about weather measurements made every six hours for thirteen years** so is it a good choice to set this parameter to {365*4,4} (The first goes for a year and the second for a day)?

This is a technical question and I'm really ashamed of myself that I bother You with it. On the time chart in a model Viewer I can see date from the last year only. Zooming out/in, clicking insanely on every pixel on the screen, did not give any result (apart of broken mouse buttons). Is is possible to browse that data in mining model viewer chart? Thank You in advance for Your replies!

*This formula suggests how this parameters could work but I would like to know it for sure €“ don't want to make some awful mistakes in my project. :-) **Of course I plan to reduce the amount of data but the period will stay.

The first question is how to of TimeSeries Algorithm?

Using SQL Server 2005 TimeSeries Algorithm ,I build a data mining model.But after three days,it is still training.The data has 2,200,00 rows.

So what can i do to improve the processing speed.

Thanks!

The second question is parameters in Data Mining Query Task.

Data Mining Query Task is used to get data from data mining model.In the mining model form, i choose a mining model . And in the query form,i wrote a dmx ,"select flattened top 100 predicttimeseries([Xssl],1) from [Time Series XSSL]".Last i choose a table that is for the data from mining model.

I am confused on key column of case table and key time column of nested table by using Time Series algorithm.

In my case, the case table structure is as below:

Territory key text (the ID is actually dimrisk_key, in this case, I use the name column binding to combine the Territory column of case table Dimrisks),

While the nested table structure is as below:

Cal_month key time (in this case, actually the ID is dimdate_key, again, I used name column bining property to bind the Cal_month to the ID)

So my question is, as the key column of case table has been set to be Territory, as a result, does the model training still cover all the cases (rows) based on the ID of the table?

Also, in the nested table, as the key time column has been set to Cal_month rather than Dimdate_key of the nested table, as a result, would the single series based on the cal_month?

Hope it is clear for your advices and help.

And I am looking forward to hearing from you shortly.

I am new to DM and I am not sure which algorithm would be best to use.

I am trying to build a custom comparitor application that companies can use to compare themselves against other companies based on certain pieces of information. I need to group a company with 11 other companies based on 6 attributes. I need the ability to apply weightings to each of the 6 attributes and have those taken into consideration when determining which 10 other companies each company is grouped with. Each group must contain 11 members, the company for the user logged in and 10 other companies that it will be compared against.

At first I thought that clustering would be a good fit for this but I can not see a way to mandate that each cluster contain exactly 11 members, I cannot see a way to weight the inputs, and I think each company can only be in one cluster at a time which do not meet my requirements.

Well, i have read in claude seidman book about data mining that some algorithm inside in microsoft decision tree are CART, CHAID and C45 algorithm. could anyone explain to me about the tree algorithm and please explain to me how the tree algorithm used together in one case?

Hello,Do you know if the algorithm for the BINARY_CHECKSUM function in documentedsomewhere?I would like to use it to avoid returning some string fields from theserver.By returning only the checksum I could lookup the string in a hashtable andI think this could make the code more efficient on slow connections.Thanks in advanced and kind regards,Orly Junior

What kind of algorithm does the MAX command uses? I have a table that I need to get the last value of the Transaction ID and increment it by 1, so I can use it as the next TransID everytime I insert a new record into the table. I use the MAX command to obtain the last TransID in the table in this process. However, someone suggested that there is a problem with this, since if there are multiple users trying to insert a record into the same table, and processing is slow, they might essentially come up with the same next TransID. He came up with the idea of having a separate table that contains only the TransID and using this table to determine the next TransID. Will this really make a difference as far as processing speed is concerned or using a MAX command on the same table to come up with the next TransID enough? Do you have a better suggestion?

I have few questions regarding Clustering algorithm.

If I process the clustering model with Ks (K is number of clusters) from 2 to n how to find a measure of variation and loss of information in each model (any kind of measure)? (Purpose would be decision which K to take.)

Which clustering method is better to use when segmenting data K-means or EM?

I want to predict which product can be sold together , Pl help me out which algorithm is best either association, cluster or decision and pl let me know how to use case table and nested table my table structure is

1. Is it legal and OK to use a MSDN SQL copy on a production environment or is it strickly for test environments ??

2. If I own a legal copy of SQL 7 with 5 cals, can I legally use SQL MSDE and have more than 5 people access my SQL server or am I also limited to 5 users as my original ??

hi, i am using sqlserver2005 as back end for my project. actually we developing an stand alone web application for client, so we need to host this application in his server. he is not willing to install sql server 2005 edition in his sever so we r going by placing .mdf file in data directory of project.

but before i developed in server2005 i used aes_256 algorithm to encrypt n decrypt the pwd column by using symmetric keys.it is working fine.

but when i took the .mdf file of project n add into my project it is throwing error at creation of symmetric key that "Either no algorithm has been specified or the bitlength and the algorithm specified for the key are not available in this installation of Windows."

i'm making my master thesis about a new plug-in algorithm, with the LVQ Algorithm. I make the tutorial with the pair_wise_linear_regression algorithm and i have some doubts. i was searching for the code of the algorithm in the files of the tutorial and i didn't saw it. I have my new algorithm programmed in C++ ready to attach him, but i don't know where to put him, in which file i have to put him to start to define the COM interfaces? And in which file is the code of the pair_wise_linear_regression algorithm in the SRC paste of the tutorial?

Obviosly for Person1 and 200501 I expect to see on MS Time Series Viewer $3000, correct? Instead I see REVENUE(actual) - 200501 VALUE =XXX, Where XXX is absolutly different number.

Also there are negative numbers in forecast area which is not correct form business point Person1 who is tough guy tryed to shoot me. What I am doing wrong. Could you please give me an idea how to extract correct historical and predict information?

I have a code for Nearest neighbour algorithm, I want to build a datamining algorithm using that code..

I have the following link that includes the source code for a sample plug-in algorithm written in C#.

(managed plug-in framework that's available for download here: )http://www.microsoft.com/downloads/details.aspx?familyid=DF0BA5AA-B4BD-4705-AA0A-B477BA72A9CB&displaylang=en#DMAPI.

But i am confused on where to insert my algorithm logic?

What is the algorithm that generates the itemsets in the Association model? I'm looking to possibly use this part of the Association algorithm (i.e. the grouping into itemsets) in a separate plug-in algorithm.

I am building data mining models to predict the amount of data storage in GB we will need in the future based on what we have used in the past. I have a table for each device with the amount of storage on that device for each day going back one year. I am using the Time Series algorithm to build these mining models. In many cases, where the storage size does not change abruptly, the model is able to predict several periods forward. However, when there are abrupt changes in storage size (due to factors such as truncating transaction logs on the database ), the mining model will not predict more than two periods. Is there something I can change in terms of the parameters the Time Series Algorithm uses so that it can predict farther forward in time or is this the wrong Algorithm to deal with data patterns that have a saw tooth pattern with a negative linear component.

I am on this project that will search an optimal route for user from starting point to his/her destination on a map in my SQL Server 2005. I hv create two versions to test out the performance of the path finding algorithm. I have a few classes, which are:

PriorityQueue class which is implemented as List() object and plus codes to sort them in order PathNode class which are instances for the nodes of the search tree with information on heuristics value DataSource class which stores data retrieved from the SQL Server 2005 into the RAM for faster execution of the path finding PathFinding class which implements the path searching algorithm (based on A* algorithm), with PriorityQueue as the openlist, List() object as the closedlist, PathNode as the nodes in both the list to store information and lastly retrieve data from DataSource object that loads the whole table from SQL Server 2005In the first version, i simply use SELECT query to retrieve every correspondent nodes data from the SQL Server 2005 which makes the performance very low which i hv used SQL Server Profiler to check. Next, i use the current version to load all the data into my RAM to increase the execution, which has successfulyl achieved <1sec as opppsed to the 1st version ~8secs.

Now, my problem is to port the algorithm part to my SQL Server 2005 as SQL CLR integration to achieved better results withour the need to burden on client PC. My question is how am i going to do this? I tried before, and several erros like i need to serialize my current PathNode class and i did it. Do i need to make all class into UDT compatible? or??

I was walking through the Text Mining example - which at one step required me to set Algorithm Parameters - MAXIMUM_OUTPUT_ATTRIBUTES=0. When I tried that the project would not build giving an error - Error (Data mining): The 'MAXIMUM_INPUT_ATTRIBUTES' data mining parameter is not valid for the 'XYZ' model.

I was getting the same error when I tried to set it for Microsoft_neural_netowrk - Hidden_Node_ratio. When I do a properties from "set Algorithm Properties" from Mining Model, I do not see these properties set as default.

I have installed SQLServer 2005 Standard Edition Microsoft SQL Server Management Studio 9.00.1399.00 Microsoft Analysis Services Client Tools 2005.090.1399.00

I'm using SQL Server 2005. The problem I have is as follows. I have several production lines and as with everything parts in the line tend to break. I have data from all the breaks that occurred in the last 2 years. What I want to do is predict the next break and the production line it's going to happen on. I would also like to go to a future date and check what possible breaks might occur on that date. I've run quite a few models but none of them helps me with future events. I think I might be using the wrong algorithm or I€™m just not doing it right. If somebody can please suggest an algorithm and maybe help me with a web site that has a tutorial similar to my problem

as we know we get clustering algorithm with managed plugin algorithm API

does anyone have developed any other plugin algorithm as i want to check what are the things that needs to be modified. i am not data mining algorithm developer but i just want to check where we have to make changes. i would be better if i get source code for algorithm other than clustering

I need to create a set of cases for a project that uses the Microsoft Association Rules algorithm to make recommendations for products to customers. My question is: the set of scenarios must include all transactions of customers for training?. or is it sufficient some percentage of total transactions? If i do not use all transactions of customers, could be that the algorithm does not consider some products in their groups or rules and could not make recommendations about these?

Hi, I am a novice data miner, working primarly in the BI field. I want to learn more about Data Mining so I am doing some experimenting.

I have a question regarding input attributes. I am particurlary wondering about the Neural Network algorithm, but also for Data Mining in general. What I am thinking about is if, and if so to what extend, I should create derived attributes for the algoritms. I´ll try to clarify with an example:

Lets say I am analysing sales performance for departments in a large company. Some of those departments has a high staff turnover, which might affect sales negatively (although I don't know that...). The high staff turnover could be detected, by the algorithm and humans, by looking at each sales, and which salesperson that handled it. If there are a lot more different salespersons in different departments by the same size and during the same time period, this is a sign of a high staff turnover.

Now is this info enough for the algorithm? Or should I add a column in the case dataset, where I discretesize the staff turnover as "High,Medium,Low"? Does this help the algorithm or can it affect the performance?

I hope you'll get the idea of my question, otherwise ask me!

We are running SQL Server 7.0 SP2, and are experiencing the following out-of- space error message:

"Could not allocate new page for database 'FooBar'. There are no more pages available in filegroup SECONDARY. Space can be created by dropping objects, adding additional files, or allowing file growth."

Needless to say, but the the database is set for 10% unlimited autogrowth and there IS available space in the partition where the filegroup resides.

Any ideas as to why this is happening? What is SQL Server's algorithm for allocating space when growing a database? Must it satisfy the request in one 'extent' and the cause of our problem is that our disk is fragmented?