Given an overfit vector a dataset, use genetic algorithms to reduce the overfitting by eliminating the noise and getting a lower error.
The function is a linear function on 11 dimensions with mean squared error as the error function.
The solution is a 11 dimensional vector of the form:
About Genetic Algorithms
Genetic Algorithms takes the approach similar to the natural evolution. It takes "Survival of the fittest" as an approach to find a solution according to our requirements.
The algorithm takes the same approach as natural selections, but in a very simple manner.
A pool of vectors which satisfy the constraints described by the problem.
Each individual is a ordered list of values which is called a gene.
In a broad sense, the algorithm is a loop with 3 abstract steps with a termination condition.
A set of vectors are picked randomly or are given by the problem statement.
The population undergoes crossover pairwise where 2 individuals referred as parents create 2 other individuals referred as offsprings which take their genes from either of the parents based on the crossover technique.
A subset of the population undergoes changes in a subset of their genes.
The survival of an individual is based on its fitness, fitness is a function which maps an individual to a number. Higher fitness an individual has, higher chance of it surviving the selection. After selection the population left is the new generation.
The algorithm can be stopped after a generation if it satisfies the termination condition. The usual termination condition are either the number of generations or the convergence of the fitness.
The base genetic algorithms which has been described above gives us the abstract understanding of the algorithm. The specifics of each step is dependent on the problem. We had tried multiple implementations for each step, which finally led us from the original over-fit vector to our current submission.
Given that the problem statement had already given us the over-fit vector, the goal was to eliminate the noise from the weights.
We ran our original program (here by program I shall mean our genetic algorithm implementation) over the given over-fit vector and observed the proceeding generations, in order to identify the noise in the given data.
Why? Because identifying noise was the key factor in eliminating overfitting. So, once we had a rough idea about which elements of the vector corresponded with noise, we set them to 0 manually.
Crossover has a large impact when the population size is large, but in our case, we had to set it to be small. Each pair of crossover produces 2 offsprings.
In the entire duration of our project, we have tried and tested a lot of heuristics for crossover.
Keep the Best
Before all the individuals are sent for crossover and mutation, we kept some (in final case just 1) of our population vector in theby default.
We allowed it to get cross-overed or in other words "to mate", but before that made sure to keep one copy of it unchanged in ourvector.
Prioritizing the fittest
Rather than letting all the individuals be a part of the crossover, we only select the top half vectors for creating the next generation. But the process of picking the individuals out of this filtered population is random.
This is part of the selection since we are eliminating half of the population before making the offsprings - hence, ensuring that only the best vectors are allowed to "mate".
We used this heuristic for the major part of our project. However, towards the end - when we were fine tuning our vectors - we ignored selection in this step and allowed all the vectors to "mate" or get cross-overed.
We had tried a lot of variants for our crossover function but in the end settled with simulated binary crossover, as mentioned below.
Simulated Binary Crossover
As evident below, in our crossover function we took two parents, namely and along with a hyper-parameter .
The hyper-parameter has the following given range: .
The purpose of is to serve as a distribution index that determines how far children go from parents. The greater its value the closer the children are to parents.
The entire idea behind simulated binary crossover is to generate two children from two parents, satisfying the following equation. All the while, being able to control the variation between the parents and children using the distribution index value.
For values of such scale with such large error values, mutation was the most crucial part of the algorithm, hence involved a lot of experimentation. Mentioned below is the final method that was used.
The value by which a gene must be mutated should be of the same order as that of the value itself. The the probability of mutation itself also must be not very large, larger the mutation, smaller should the probability be.
Gaussian distribution with appropriate standard deviation and mean should suffice these conditions.
The mutated value should be close to the original value since the mutation is small, which means where is the value of the gene.
The mutation value must of the same order as the value itself, where scale is a hyper-parameter which has to be tuned.
The scale is the order by which the mutation is smaller than the value itself, for finer changes we set the scale to be high.
Since there is a strict bound over the weights, if the weights ever cross them after mutation, they are set to 0.
The above function was vectorized and was applied to the population in parallel.
The main purpose of selection here is to sort the new offsprings based on their fitness values in descending order.
Selection happens based on the fitness values. In our original implementation, the only form of selection that is happening is during the crossover when we eliminate half the population.
In later implementations, we included the following snippet because we were not eliminating individuals from mating during crossover.
Saving the fit vector
Other than sorting, we also save the most fit function of the population, we compare it with the best vector so far and replace if it performs better. This gives us the best vector throughout all the populations.
Fitness function is designed based on the problem statement and the requirements. For a simple linear regression problem with mean squared error problem, the fitness function can just be based on the train error.
Fitness function is more complicated when there are chances of falling in a local minima, but in the case of mean squared error there are no local minimas.
Throughout our implementation, we have played around with several forms for the fitness function. What we used the fitness function for is to sort the population — during initialization — based on the chosen definition of fitness function.
Below shown are some of our choices:
There is no definite termination condition defined for the genetic algorithm, rather it's the choice of the programmer. The termination conditions we used were the number of generations (as— an hyper-parameter) due to the limited number of queries.
GIven 3 images (below) are 3 iterations of the genetic algorithm with the certain chosen vectors as the population. All our major generations have been documented asfiles under the folder .
Note: When this following was generated selection was being done within crossover (check Crossover Section for more details).
Values were changed multiple times until the perfect value was found. All the hyper-parameters are listed as below:
The names of the above parameters are self-explanatory in terms of their functionality. Necessity of has been furthered explored in Crossover Section.
The errors of the best vector we have submitted are:
Below presented is a rough sketch of our transition from given overfit vector to the best submitted vector.
Removal of noise
Increased both the errors severely since it was no longer overfit.
Fitting on train error
Population is filled with underfit array and train error is used for fitness and is run for generations in batches of 10 with hyper-parameters changed in between.
Changing the values after every 10 generations was done manually based on the pattern in train and test error for complete population.
Why does it work?
Since the noise which resulted in overfitting was eliminated, it made the weights underfit the data.
The weights were selected based on the training error, whereas the hyper-parameter tuning was done based on the validation error.
This resulted in a good vector which performed well on unseen data.
For finding the correct values for the noise weights, we did manual testing due to the limitation of the queries.
For checking the variation in the error for each vector to determine one of the hyper-parameters, we checked out the errors by tweaking the values of the weights.
We also compared our vectors manually to check approximate the direction of the minima is.