-
Notifications
You must be signed in to change notification settings - Fork 0
Machine Learning Model Ideas
- Looking at prior work and the data set please provide information and Ideas for what model best fits this data set
- (Some of the documentation is provided with the support of ChatGPT and should not be taken as fact)
- First they split the data set into 80% training data and 20% test data using train_test_split()
- Ran normalization of training data and test data using norm()
- Created a neural network
This code defines a function called "create_model" which takes a single argument "input_len". The purpose of this function is to create a neural network model using the Keras API in Python.
The first line of the function creates an "Input" layer for the model, with the shape of the input being (len(train.columns), ). This means that the input to the model has a shape equal to the number of columns in a data set called "train".
Next, there are six dense layers defined in the code using the "Dense" function from the Keras API. Each dense layer is connected to the previous layer by using the "( )( )" syntax, where the first argument inside the brackets is the previous layer and the second argument inside the brackets is the current layer. The number of units in each dense layer is specified using the "units" argument, and the activation function used by each dense layer is specified using the "activation" argument.
The first three dense layers are defined with 128, 128, and 64 units respectively and with the "relu" activation function. These layers form the bulk of the neural network model, with the input passing through these dense layers and the activations being computed in each layer.
The next three dense layers are defined with 32, 64, and 16 units respectively and with the "sigmoid" activation function. These layers produce the three outputs of the model. The "y1_output", "y2_output", and "y3_output" variables are the output layers of the model, with each having a single unit and a different name specified using the "name" argument.
Finally, a "Model" object is created using the "Model" function from the Keras API. This is used to define the overall architecture of the model and connect all the layers together. The "inputs" argument specifies the input layer of the model, which is the "input_layer" variable created earlier, and the "outputs" argument specifies the output layers of the model, which are the "y1_output", "y2_output", and "y3_output" variables.
The "model.summary()" function call at the end of the code prints a summary of the model, which includes information about the number of parameters and the shapes of the layers in the model.
The function returns the "model" object, which can be used for training, evaluation, or prediction.
Layers are the building blocks of a neural network. In a neural network, layers perform transformations on the input data, and the output from one layer is passed as input to the next layer. The combination of multiple layers in a neural network enables it to model complex relationships between the input and output variables.
In the code, the layers are implemented using the "Dense" function from the Keras API, which creates a dense, fully-connected layer. The "units" argument specifies the number of neurons in the layer, and the "input_shape" argument specifies the shape of the input data.
Activation functions are mathematical functions that are applied element-wise to the outputs from each neuron in a layer. They introduce non-linearity into the output from the neurons, which is essential for modeling complex relationships between the input and output variables in a neural network.
There are several activation functions that are commonly used in neural networks, including:
ReLU (Rectified Linear Unit): This activation function replaces all negative values in the output from a neuron with zero. It is defined as f(x) = max(0, x). Sigmoid: This activation function compresses the output from a neuron to the range between 0 and 1. It is defined as f(x) = 1 / (1 + exp(-x)). Tanh (Hyperbolic Tangent): This activation function compresses the output from a neuron to the range between -1 and 1. It is defined as f(x) = 2 / (1 + exp(-2x)) - 1. In the code, the activation functions are specified using the "activation" argument in the "Dense" function.
- Project
- Added an additional column to the dataset, 'sumFlares', which is the total number of x, c, and m class flares in a row.
Created a graph using PowerBI which maps the influence of the columns on the occurrences of the solar flares. This graph showed that a LargestSpot of 1 or 2 increases the likelihood of a solar flare. They also found that an 'activity' of 1 means that solar flares are much less likely to form.
The Keras model was chosen.
Multiple machine learning modules were used and their accuracies were compared. The one with the highest accuracy was saved. Approximately 10-30% of the data was converted into testing data.
- Support Vector Machines (SVM)
- Decision Tree
- Naive Bayes
- Very simple to implement
- Often outperforms most sophisticated models
- Linear Regression
- Logistic Regression
My research on predictive analysis and correlation has yielded some interesting findings. First and foremost, it seems that regression is what we are looking for, and what is really one's best bet when trying to use a large dataset like this for any kind of prediction. First, we must understand exactly what a correlation and correlation coefficient tell us.
A correlation coefficient is a descriptive statistic, this means that is summarizes data without really letting you infer anything. This is where statistical significance comes into play. Using an inferential statistic like central tendency, distribution, and variability, you can then use what is called an "F test" or a "t test" to calculate the statistical significance of your data. Now what does it mean when something is statistically significant? It means simply that it is unlikely to happen solely by chance or random factors, which must mean that there are factors which effect if it will occur or not, in our case, if a solar flare will occur or not. It is to be expected that there are factors at play the effect that outcome of things, but when doing formal analysis of a dataset, it is important to prove this as a fact. This will also help in determining exactly what variables effect our outcome. By running statistical significance testing on pairs of variables you can find exactly which ones are likely to actually effect the outcome.
With this in mind, we can use the correlations that we already have to do regression. There are a couple ways that you can go about doing this. A simple linear regression will allow us to see a fairly general prediction of outcomes. By this I mean, based on the correlation coefficient, the prediction will tell us where it trends. If the correlation is positive, then our prediction will be one that shows an increase in two variables and vice versa. You can go pretty in-depth with linear regression, so it is certainly something that should be considered when doing our predictive analysis. There are also decision trees. Specifically in our case, I believe that a regression tree is the most useful tree. A regression tree predicts continuous values based on previous data. Essentially, it predicts what is likely to happen based on past events. Seeing as our data set is one entirely constructed from past events, I think this suits us perfectly. In essence, a regression tree utilizes linear regression itself, so it really all comes back to that. There are also classification trees. These trees determine if something has happened or not, giving very binary outcomes of yes or no. This tree can be used to great effect and from my understanding it, along with regression trees, makes up the backbone of machine learning analysis. So it is something to consider for sure. However, it could be redundant if used alongside statistical significance testing. They are two different tests, but from what it seems, they essentially give you the same data. Though classification trees are either a "yes" or "no" whereas a statistical significance is more of a "most likely". So in that sense it could be useful to do both, as they can be used to help further reinforce conclusions made by either.
Another way to go about predictive analysis is to use Bayes' Theorem. As we know, this theory describes the probability of an event based on prior knowledge of conditions related to the event. Especially when it comes to Naive Bayesian theory, it would be fairly easy to use with our dataset. Like the trees, I believe doing this alongside significance testing would be a worthwhile endeavor, and lead to greater understanding of our dataset.
In summary, going forward, it would be worthwhile to consider and use the following:
- Linear Regression
- Statistical Significance
- Regression Tree
- Classification Tree
- Baye's Theorem
Each of these methods use sample data to either predict outcomes or lead to deeper insights into variables which can aid in predicting outcomes. They also are all widely used in machine learning, and if that is the way we want to approach doing our predictive modeling, then these methods are where we should start.