# Random forest

## Learningalgorithm

Buildeachtreeaccordingtothefollowingalgorithm:

1. UseNtorepresenttrainingThenumberofusecases(samples),Mrepresentsthenumberoffeatures.

2. Enterthenumberoffeaturesmtodeterminethedecisionresultofanodeonthedecisiontree;wheremshouldbemuchsmallerthanM.

3. FromNtrainingcases(samples)withreplacementsampling,takesamplesNtimestoformAtrainingset(iebootstrapsampling),anduseun-selectedusecases(samples)tomakepredictionstoevaluatetheerror.

4. Foreachnode,randomlyselectmfeatures,andthedecisionofeachnodeonthedecisiontreeisdeterminedbasedonthesefeatures.Accordingtothesemcharacteristics,calculatethebestsplitmethod.

5. Eachtreewillgrowcompletelywithoutpruning,whichmaybeusedafteranormaltree-likeclassifierisbuilt).

1)Formanykindsofdata,itcangeneratehighaccuracyClassifier;

2)Itcanhandlealargenumberofinputvariables;

3)Itcanevaluatetheimportanceofvariableswhendeterminingthecategory;

4)Whenbuildingtheforest,itcanproduceanunbiasedestimateofthegeneralizederrorinternally;

5)Itcontainsagoodmethodtoestimatethemissingdata,andifthereisalargepartoftheIfthedataismissing,theaccuracycanstillbemaintained;

6)Itprovidesanexperimentalmethodtodetectvariableinteractions;

7)Forunbalancedclassificationdatasets,Itcanbalanceerrors;

8)Itcalculatestheclosenessineachcase,whichisveryusefulfordatamining,detectingoutliersandvisualizingdata;

9)Usetheabove.Itcanbeextendedtounlabeleddata,whichusuallyusesunsupervisedclustering.Itcanalsodetectdeviatorsandwatchdata;

10)Thelearningprocessisveryfast.

## Relatedconcepts

1.Split:Inthetrainingprocessofthedecisiontree,thetrainingdatasetneedstobesplitintotwosub-datasetsagainandagain.Thisprocessiscalledsplitting.

3.Featurestobeselected:Intheprocessofconstructingthedecisiontree,itisnecessarytoselectfeaturesfromallthefeaturesinacertainorder.Thefeaturestobeselectedarethesetoffeaturesthathavenotbeenselectedbeforethestep.Forexample,ifallthefeaturesareABCDE,inthefirststep,thecandidatefeatureisABCDE,andinthefirststep,Cisselected,theninthesecondstep,thecandidatefeatureisABDE.

4.Splitfeature:Thedefinitionofthereceptionselectionfeature.Eachselectedfeatureisthesplitfeature.Forexample,intheaboveexample,thefirstsplitfeatureisC.Becausetheseselectedfeaturesdividethedatasetintodisjointparts,theyarecalledsplitfeatures.

## Decisiontreeconstruction

Weusetheprocessofselectingquantitativetoolstovisualizetheconstructionofthedecisiontree.Supposewewanttochooseanexcellentquantitativetooltohelpusbetterstocks,howtochoose?

Thefirststep:seeifthedataprovidedbythetoolisverycomprehensive,donâ€™tuseitifthedataisnotcomprehensive.

Step2:CheckiftheAPIprovidedbythetooliseasytouse.IftheAPIisnotgood,don'tuseit.

Step3:Checkwhetherthebacktestingprocessofthetoolisreliable,andthestrategiesthatarenotreliablebacktestingarenotused.

Itcanbeseenthatthemainjobofthedecisiontreeistoselectfeaturestodividethedataset,andfinallyputthedataontwodifferenttypesoflabels.Howtochoosethebestfeature?Alsousetheexampleofselectingquantizationtoolsabove:supposethereare100quantizationtoolsonthemarketasthetrainingdataset,andthesequantizationtoolshavebeenlabeled"available"and"unavailable".

Wefirsttriedtodividethedatasetintotwocategoriesby"IstheAPIeasytouse";wefoundthattheAPIsof90quantitativetoolsareeasytouse,andtheAPIsof10quantitativetoolsarenoteasytouse.Amongthe90quantitativetools,40arelabeledas"available"and50arelabeledas"unavailable".Then,theclassificationeffectofthe"APIiseasytouse"onthedataisnotEspeciallygood.Because,givenyouanewquantitativetool,evenifitsAPIiseasytouse,youstillcannotlabelitas"used"well.

## Randomforestconstruction

Howtobuildarandomforest?Therearetwoaspects:randomselectionofdata,andrandomselectionoffeaturestobeselected.

1.Randomselectionofdata:

First,takeasamplewithreplacementfromtheoriginaldatasettoconstructasub-dataset.Thedatavolumeofthesub-datasetisthesameastheoriginaldata.Setthesame.Elementsindifferentsub-datasetscanberepeated,andelementsinthesamesub-datasetcanalsoberepeated.Second,usesub-datasetstoconstructsub-decisiontrees,putthisdataineachsub-decisiontree,andeachsub-decisiontreeoutputsaresult.Finally,ifthereisnewdatathatneedstobeclassifiedthroughtherandomforest,theoutputresultoftherandomforestcanbeobtainedbyvotingonthejudgmentresultsofthesub-decisiontree.AsshowninFigure3,assumingthatthereare3sub-decisiontreesintherandomforest,theclassificationresultof2sub-treesistypeA,andtheclassificationresultof1sub-treeistypeB,thentheclassificationresultoftherandomforestistypeA.

2.Randomselectionoffeaturestobeselected

Similartorandomselectionofdatasets,eachsplittingprocessofthesubtreeintherandomforestdoesnotuseallthefeaturestobeselected,Butrandomlyselectacertainfeaturefromallthefeaturestobeselected,andthenselecttheoptimalfeaturefromtherandomlyselectedfeatures.Inthisway,thedecisiontreesintherandomforestcanbedifferentfromeachother,andthediversityofthesystemisimproved,therebyimprovingtheclassificationperformance.