Data mining technology

Technicalprocess

Consideringthedataitself,dataminingusuallyrequires8stepsincludingdatacleaning,datatransformation,dataminingimplementationprocess,patternevaluationandknowledgerepresentation.

(1)Informationcollection:Abstractthecharacteristicinformationneededindataanalysisaccordingtothedetermineddataanalysisobject,thenselecttheappropriateinformationcollectionmethod,andstorethecollectedinformationinthedatabase.Formassivedata,choosingasuitabledatawarehousefordatastorageandmanagementiscrucial.

(2)Dataintegration:Logicallyorphysicallycentralizedatafromdifferentsources,formats,andcharacteristics,soastoprovideenterpriseswithcomprehensivedatasharing.

(3)Dataprotocol:Ittakesalongtimetoexecutemostdataminingalgorithmsevenonasmallamountofdata,andtheamountofdataisoftenverylargewhendoingbusinessoperationdatamining.Dataspecificationtechnologycanbeusedtoobtainthespecificationrepresentationofthedataset,whichismuchsmaller,butstillclosetomaintainingtheintegrityoftheoriginaldata,andtheresultsofdataminingafterthespecificationarethesameoralmostthesameastheresultsbeforethespecification.

(4)Datacleaning:someofthedatainthedatabaseisincomplete(someattributesofinterestaremissingattributevalues),noisy(includingwrongattributevalues),andareinconsistent(Thesameinformationisexpressedindifferentways),sodatacleaningisrequiredtostorecomplete,correct,andconsistentdatainformationinthedatawarehouse.

(5)Datatransformation:Transformdataintoaformsuitablefordataminingthroughsmoothaggregation,datageneralization,andstandardization.Forsomereal-numbereddata,itisalsoanimportantsteptotransformthedatathroughconceptualstratificationanddiscretizationofthedata.

(6)Dataminingprocess:Accordingtothedatainformationinthedatawarehouse,selecttheappropriateanalysistools,applystatisticalmethods,case-basedreasoning,decisiontrees,rule-basedreasoning,fuzzysets,evenneuralnetworks,geneticsThealgorithmicmethodprocessesinformationandobtainsusefulanalysisinformation.

(7)Modelevaluation:Fromabusinessperspective,industryexpertsverifythecorrectnessofthedataminingresults.

(8)Knowledgerepresentation:theanalysisinformationobtainedbydataminingispresentedtousersinavisualmanner,orstoredasnewknowledgeintheknowledgebaseforusebyotherapplications.

Thedataminingprocessisaniterativeprocess.Ifeachstepdoesnotachievetheexpectedgoal,youneedtogobacktothepreviousstep,re-adjustandexecuteit.Noteverydataminingjobrequireseverysteplistedhere.Forexample,whentherearenomultipledatasourcesinacertainjob,thestep(2)dataintegrationstepcanbeomitted.

Step(3)Dataspecification(4)Datacleaning(5)Datatransformationisalsocalleddatapreprocessing.Indatamining,atleast60%ofthecostmaybespentinstep(1)informationcollectionstage,andatleast60%oftheenergyandtimeisspentondatapreprocessing

Operationmethod

Neuralnetwork

Becauseofitsgoodrobustness,self-organizationandadaptability,parallelprocessing,distributedstorage,andhighfaulttolerance,neuralnetworksareverysuitableforsolvingdataminingproblems.Theyareusedforclassification,Thefeedforwardneuralnetworkmodelforpredictionandpatternrecognition;representedbyHopfield'sdiscretemodelandcontinuousmodel,thefeedbackneuralnetworkmodelusedforassociativememoryandoptimizationcalculations;representedbytheartmodelandtheKoholonmodel,usingSelf-organizingmappingmethodforclustering.Thedisadvantageoftheneuralnetworkmethodisthe"blackbox"nature,anditisdifficultforpeopletounderstandthelearninganddecision-makingprocessofthenetwork.

GeneticAlgorithm

GeneticAlgorithmisarandomsearchalgorithmbasedonbiologicalnaturalselectionandgeneticmechanism.Theimplicitparallelism,easyintegrationwithothermodelsandotherpropertiesofgeneticalgorithmmakeitbeusedindatamining.

Sunilhassuccessfullydevelopedadataminingtoolbasedongeneticalgorithm,usingthistooltoconductdataminingexperimentsontherealdatabasesoftwoplanecrashes,theresultsshowthatgeneticalgorithmisaneffectivemethodfordataminingOneof[4].Theapplicationofgeneticalgorithmisalsoreflectedinthecombinationwithneuralnetwork,roughsetandothertechnologies.Forexample,thegeneticalgorithmisusedtooptimizethestructureoftheneuralnetwork,andtheredundantconnectionsandhiddenunitsaredeletedwithoutincreasingtheerrorrate;thegeneticalgorithmandthebpalgorithmareusedtotraintheneuralnetwork,andthentherulesareextractedfromthenetwork.However,thealgorithmofgeneticalgorithmismorecomplicated,andtheproblemofearlyconvergenceinthelocalminimumhasnotbeensolvedyet.

Decisiontreemethod

Decisiontreeisanalgorithmcommonlyusedinpredictivemodels.Itcanfindsomevaluableandpotentialinformationfromalargeamountofdatabypurposefullyclassifyingit.Itsmainadvantagesaresimpledescriptionandfastclassificationspeed,whichisespeciallysuitableforlarge-scaledataprocessing.Themostinfluentialandearliestdecisiontreemethodisthefamousid3algorithmbasedoninformationentropyproposedbyquinlan.Itsmainproblemsare:id3isanon-incrementallearningalgorithm;id3decisiontreeisaunivariatedecisiontree,itisdifficulttoexpresscomplexconcepts;therelationshipbetweenthesamesexisnotemphasizedenough;noiseresistanceispoor.Inresponsetotheaboveproblems,manybetterimprovedalgorithmshaveemerged.Forexample,Schlimmerandfisherdesignedtheid4incrementallearningalgorithm;ZhongMingandChenWenweiproposedtheiblealgorithm.

Roughsetmethod

Roughsettheoryisamathematicaltoolforstudyinginaccurateanduncertainknowledge.Theroughsetmethodhasseveraladvantages:noadditionalinformationisrequired;theexpressionspaceoftheinputinformationissimplified;thealgorithmissimpleandeasytooperate.Theobjectofroughsetprocessingisaninformationtablesimilartoatwo-dimensionalrelationaltable.However,themathematicalbasisofroughsetsissettheory,anditisdifficulttodirectlydealwithcontinuousattributes.Thecontinuousattributesintheactualinformationtableareuniversal.Therefore,thediscretizationofcontinuousattributesisthedifficultythatrestrictsthepracticalapplicationofroughsettheory.

Methodofcoveringpositiveexamplesandrejectingcounterexamples

Itusestheideaof​​coveringallpositiveexamplesandrejectingallcounterexamplestofindrules.First,chooseaseedfromthesetofpositiveexamplesandcomparethemonebyoneinthesetofnegativeexamples.Ifitiscompatiblewiththeselectorformedbythefieldvalue,itwillbediscarded,otherwise,itwillberetained.Accordingtothisthought,allpositiveexampleseedsarelooped,andtheruleofpositiveexample(theconjunctiveofselector)willbeobtained.ThemoretypicalalgorithmsareMichalski'saq11method,HongJiarong'simprovedaq15methodandhisae5method.

Statisticalanalysismethod

Therearetworelationshipsbetweendatabasefielditems:functionalrelationship(deterministicrelationshipthatcanbeexpressedbyfunctionformula)andcorrelationrelationship(notexpressedbyfunctionformula),Butitisstillarelevantdeterministicrelationship).Statisticalmethodscanbeusedfortheiranalysis,thatis,theuseofstatisticalprinciplestoanalyzetheinformationinthedatabase.Commonstatistics(seekingthemaximum,minimum,sum,average,etc.inalargeamountofdata),regressionanalysis(usingregressionequationstoexpressthequantitativerelationshipbetweenvariables),correlationanalysis(usingcorrelationcoefficientstomeasurethecorrelationbetweenvariables)Degree),differenceanalysis(fromthevalueofthesamplestatisticstodeterminewhetherthereisadifferencebetweentheoverallparameters),etc.

Fuzzysetmethod

Thefuzzysettheoryisusedtoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonpracticalproblems.Thehigherthecomplexityofthesystem,thestrongerthefuzziness.Generally,fuzzysettheoryusesthedegreeofmembershiptodescribethefuzzythings.Onthebasisoftraditionalfuzzytheoryandprobabilityandstatistics,LiDeyiandothersproposedaqualitativeandquantitativeuncertaintyconversionmodel-thecloudmodel,andformedthecloudtheory.

Miningobjects

Accordingtotheinformationstorageformat,theobjectsusedforminingincluderelationaldatabases,object-orienteddatabases,datawarehouses,textdatasources,multimediadatabases,spatialdatabases,andtemporaldatabases,Heterogeneousdatabasesandinternet,etc.

Dataminingsoftware

SASEM

ModelerofIBMSPSSCompany

K-MinerofShenzhouGeneralCompany

TempoofMerrillLynchDataTechnologyCo.,Ltd.

Related Articles
TOP