




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
OriginalPaper
DebbieRankin1PhD,Correspondingauthor,
d.rankin1@ulster.ac.uk
,+442871675841
MichaelaBlack1PhD,
mm.black@ulster.ac.uk
RaymondBond2PhD,
rb.bond@ulster.ac.uk
JonathanWallace2MSc,
jg.wallace@ulster.ac.uk
MauriceMulvenna2PhD,
md.mulvenna@ulster.ac.uk
GorkaEpelde3,4PhD,
gepelde@
1SchoolofComputing,EngineeringandIntelligentSystems,UlsterUniversity,Derry~Londonderry,NorthernIreland,UnitedKingdom
2SchoolofComputing,UlsterUniversity,Jordanstown,NorthernIreland,UnitedKingdom
3VicomtechFoundation,BasqueResearchandTechnologyAlliance(BRTA),Donostia-SanSebastián,Spain
4BiodonostiaHealthResearchInstitute,eHealthGroup,Donostia-SanSebastián,Spain
ReliabilityofSupervisedMachineLearningUsingSyntheticDatainHealthcare:AModeltoPreservePrivacyforDataSharing
Abstract
Background:
Theexploitationofsyntheticdatainhealthcareisatanearlystage.Syntheticdatagenerationcouldunlockthevastpotentialwithinhealthcaredatasetsthataretoosensitiveforreleaseduetoprivacyconcerns.Severalsyntheticdatageneratorshavebeendevelopedtodate,howeverstudiesevaluatingtheirefficacyandgeneralisabilityarescarce.
Objective:
Thisworksetsouttounderstandthedifferenceinperformanceofsupervisedmachinelearningmodelstrainedonsyntheticdatacomparedwiththosetrainedonrealdata.
Methods:
Atotalof19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentalwork.SyntheticdataisgeneratedusingthreepopularsyntheticdatageneratorsthatapplyClassificationandRegressionTrees,parametricandBayesiannetworkapproaches.Realandsyntheticdataareused(separately)totrainfivesupervisedmachinelearningmodels:stochasticgradientdescent,decisiontree,k-nearestneighbors,randomforestandsupportvectormachine.Modelsaretestedonlyonrealdatatodeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.Evaluationmetricsarecomputedanddifferentialsinthesescoresarecompared.Theimpactofstatisticaldisclosurecontrolonmodelperformanceisalsoassessed.
Results:
TheaccuracyofMLmodelstrainedonsyntheticdataislowerthanmodelstrainedonrealdatain92%ofcases.Tree-basedmodelstrainedonsyntheticdatahavedeviationsinaccuracyfrommodelstrainedonrealdataof17.7-19.3%,whilstothermodelshavelowerdeviationsof5.8-7.2%.Thewinningclassifierwhentrainedandtestedonrealdataversusmodelstrainedonsyntheticdataandtestedonrealdataisthesamein26.3%ofcasesforCARTandparametricsyntheticdata,andin21.1%ofcasesforBayesiannetworkgeneratedsyntheticdata.Tree-basedmodelsperformbestwithrealdataandarethewinningclassifierin94.7%ofcases.Thisisnotthecaseformodelstrainedonsyntheticdata.Whentree-basedmodelsarenotconsidered,thewinningclassifierforrealandsyntheticdataismatchedin73.7%,52.6%and68.4%ofcasesforCART,parametricandBayesiannetworksyntheticdata,respectively.Statisticaldisclosurecontrolmethodsdidnothaveanotableimpactondatautility.
Conclusions:
Theresultsofthisstudyarepromisingwithsmalldecreasesinaccuracyobservedinmodelstrainedwithsyntheticdatacomparedtomodelstrainedwithrealdata,wherebotharetestedonrealdata.Suchdeviationsareexpectedandmanageable.Tree-basedclassifiershavesomesensitivitytosyntheticdataandtheunderlyingcauserequiresfurtherinvestigation.Thisstudyhighlightsthepotentialofsyntheticdataandtheneedforfurtherevaluationitsrobustness.Syntheticdatamustensureindividualprivacyanddatautilityispreservedinordertoinstilconfidenceinhealthcaredepartmentswhenutilisingsuchdatatoinformpolicydecision-making.
Keywords:SyntheticData;SupervisedMachineLearning;DataUtility;Healthcare;DecisionSupport;StatisticalDisclosureControl
Introduction
Background
NationalHealthcareDepartmentsholdvastvolumesofdataonpatientsandthepopulationthatisnotbeingusedtoitsfullpotentialduetovalidprivacyconcerns.Machinelearning(ML)hasthepotentialtovastlyimprovedecisionsandoutcomesinhealthcareandyettheseimprovementshavenotyetbeenfullyrealised.Thereasonmaybeinpartrelatedtoanissuethatfacesmanydatascientistsandresearchersinthearea:thelimitedavailabilityoforaccesstodata,orthereadinessforhealthcareinstitutionstosharedata.Privacyconcernsoverpersonaldata,andinparticularhealthcaredata,meansthatalthoughthedataexists,itisdeemedtoosensitiveforpublicrelease[1],eveninthecaseofseriousresearch.
Onewaytoovercometheissueofdataavailabilityistousefullysyntheticdataasanalternativetorealdata.Theexploitationofsyntheticdatainhealthcareisatanearlystageandisgainingincreasingattention.Syntheticdataisdatathatissimulatedfromrealdatabyusingtheunderlyingstatisticalpropertiesoftherealdatatoproducesyntheticdatasetsthatexhibitthesesamestatisticalproperties.Syntheticdatacanrepresentthepopulationintheoriginaldatawhilstavoidinganydivulgenceofreal,potentiallypersonal,confidentialandsensitivedata.Inthecaseofhealth-relateddata,thiswouldensurethatactualpatientrecordsarenotdisclosedthusavoidinggovernanceandconfidentialityissues.Therearethreetypesofsyntheticdata:fullysynthetic,partiallysynthetic,andhybridsynthetic.Thisworkconsidersfullysyntheticdatawhichdoesnotcontainoriginaldata.
Syntheticdatacanbeusedintwoways:toaugmentanexistingdatasetthusincreasingitssize,fortimeswhenadatasetisunbalancedduetothelimitedoccurrenceofaneventorwhenmoreexamplesarerequired[2,3];andtogenerateafullysyntheticdatasetthatisrepresentativeoftheoriginaldataset,fortimeswhendataisnotavailableduetoitssensitivenature[4].Thelatterisconsideredinthisworkasakeyrequirementforhealthcaredatasharing.
Traditionally,dataperturbationtechniquessuchasdataswapping,datamasking,cellsuppressionandaddingnoise,havebeenappliedtorealdatatomodifyandthusprotectthedatafromdisclosurepriortoreleasingit.However,suchmethodsdonoteliminatedisclosureriskandcanimpacttheutilityofthedata,particularlyifmultivariaterelationshipsarenotconsidered[5].SyntheticdatawasfirstproposedbyRubin[6]andLittle[7].Raghunathan,ReiterandRubin[8]implementedandextendeduponthis,pioneeringthemultipleimputationapproachtosyntheticdatageneration,exemplifiedinarangeofstudies[9-14].Reiter[15]thenintroducedanalternativemethodofsynthesisingdatathroughanon-parametrictree-basedtechniquethatutilisesClassificationandRegressionTrees(CART).AmorerecenttechniqueproposesaBayesiannetworkapproachforsyntheticdatageneration[16].Syntheticdataisconsideredasecureapproachforenablingpublicreleaseofsensitivedataasitgoesbeyondtraditionalde-identificationmethodsbygeneratingafakedatasetthatdoesnotcontainanyoftheoriginal,identifiableinformationfromwhichitwasgenerated,whilstretainingthevalidstatisticalpropertiesoftherealdata.Therefore,theriskofdisclosureofarealpersonorreverseengineeringisconsideredtobeunlikely[17].
Whilstanumberofsyntheticdatageneratorshavebeendeveloped,empiricalevidenceoftheirefficacyhasnotbeenfullyexplored.Thisworkextendsapreliminarystudy[18]andinvestigateswhetherfullysyntheticdatacanpreservethehiddencomplexpatternsthatsupervisedMLcanuncoverfromrealdata,andthereforewhetheritcanbeusedasavalidalternativetorealdatawhendevelopingeHealthapplicationsandhealthcarepolicymakingsolutions.Thiswillbeachievedbyexperimentingwitharangeofopenhealthcaredatasets.Syntheticdatawillbegeneratedusingthreewellknownsyntheticdatagenerationtechniques.SupervisedMLalgorithmswillbeusedtovalidatetheperformanceofthesyntheticdatasets.Statisticaldisclosurecontrol(SDC)methodsthatcanfurtherdecreasethedisclosureriskassociatedwithsyntheticdatawillalsobeconsidered.
Overview
Toinformtheviabilityoftheuseofsyntheticdataasavalidandreliablealternativetorealdatainthehealthcaredomainwewillanswerthefollowingresearchquestions:
WhatisthedifferentialinperformancewhenusingsyntheticdataversusrealdatafortrainingandtestingsupervisedMLmodels?
WhatisthevarianceofabsolutedifferenceofaccuraciesbetweenMLmodelstrainingonrealandsyntheticdatasets?
HowoftendoesthewinningMLtechniquechangewhentrainingusingrealdatatotrainingusingsyntheticdata?
Whatistheimpactofstatisticaldisclosurecontrol(i.e.privacyprotection)measuresontheutilityofsyntheticdata(i.e.similaritytorealdata)?
Toanswerthesequestions,19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentation[19].Syntheticdatasetsaregeneratedforeachofthese19datasetsusingthreepopularsyntheticdatageneratorsthatapplyCART[15,17],parametric[8,17]andBayesiannetwork[16]approaches,respectively,toenablearobustcomparisonofthethreesyntheticdatagenerationtechniquesacrossabroadrangeofdata.
Initiallyweanalysewhetherthemultivariaterelationshipsthatexistintherealdataarepreservedinthesyntheticversionsofthedata,fordatageneratedusingeachofthethreesyntheticdatagenerationtechniques,bycomputingpairwisemutualinformationscoresforeachvariablepaircombinationineachdataset[16].Itisimportantthatsuchrelationshipsareretainedwhendataissynthesised.
ToevaluatetheutilityofsyntheticdataforMachineLearning,wetheninvestigatetheperformanceofsupervisedMLmodelstrainedonsyntheticdataandtestedonrealdata,comparedwithmodelstrainedonrealdataandalsotestedontherealdata.Thisallowsustodetermineifamodeldevelopedusingsyntheticdatacanclassifyrealdataexamplesasaccuratelyandreliablyasamodeldevelopedusingrealdata.Weconsiderfivedifferentsupervisedmachinelearningmodelstocompareperformanceanddetermineiftherearedifferencesinrobustnessacrosseachofthesemodels.Standardevaluationmetricsarecomputedformodelstrainedonrealandsyntheticdata,foreachMLmodel,andforeachdataset[20].Thedifferencesinaccuracyformodelstrainedonsyntheticdataversusmodelstrainedonrealdataarecomputedtoanalysetheextenttowhichsyntheticdatacausesadegradationinmodelperformance,ifany.
ItispertinentthattheoptimalMLmodelbuiltusingsyntheticdatamatchestheoptimalMLmodelthatwouldbeselectedifrealdatawereusedinthemodeltrainingprocess.Thiswouldprovidestakeholdersinhealthcarewithconfidenceintheuseofsyntheticdataformodeldevelopment.Thus,weconsiderhowoftenthebestMLclassifierbuiltusingsyntheticdatamatchesthebestMLmodelbuiltusingrealdata.
Finally,theimpactofanumberofstatisticaldisclosurecontrolmethodsonmodelperformanceisassessed.Statisticaldisclosurecontrolmethodsseektofurtherenhancedataprivacy;however,thiscanleadtoalossinusefulnessofthedata[21]andweconsidertheextenttowhichperformancedegradationoccursasaresultofSDC.
Thislarge-scaleassessmentofthereliabilityofsyntheticdatawhenusedforsupervisedML,utilising19healthcaredatasetsand3syntheticdatagenerationtechniques,providesanimportantcontributioninrelationtothetrustandconfidencethatstakeholdersinhealthcarecanhaveinsyntheticdata.Wealsoproposeapipelinetoillustratehowsyntheticdatacanpotentiallyfitwithinthehealthcareprovidercontext.Thisworkdemonstratesthepromisingperformanceofsyntheticdatawhilsthighlightingitslimitationsandfutureworkdirectionstoovercomethem.
SyntheticData:PresentandFutureUse
ThevalidityanddisclosureriskassociatedwithsyntheticdatahasbeenunderinvestigationbytheU.S.CensusBureausince2003forthepurposeofcreatingpublicusedatafromacombinationofsensitivedatafromtheCensusBureau’sSurveyofIncomeandProgramParticipation(SIPP),theInternalRevenueService’s(IRS)individuallifetimeearningsdata,andtheSocialSecurityAdministration’s(SSA)individualbenefitdata[22,23].Thegoalwastoenablethereleaseofsynthesisedperson-levelrecordscontainingpersonalandfinancialcharacteristicsfromconfidentialdatasets,whilstpreservingprivacy.Successfulresultshaveledtothereleaseofpublicusesyntheticdatafiles.ResearcherscanhavetheirworkvalidatedagainsttheGoldStandard(real)databytheCensusBureau,thusenablingthemtodeterminetheimpactofsyntheticdataontheirexploratoryanalysesandmodeldevelopmentandhaveconfidenceintheirresults,whilstalsoallowingtheCensusBureautocontinuouslyimprovetheirsynthesistechniques.Thepublicreleaseofthisdatahasprovidedsignificantbenefittotheresearchcommunityandgeneralpopulation,enablingmoreextensiveeconomicpolicyresearchtobeperformedbygroupswhocouldnotpreviouslyaccessusefuldata[24-29].ThisworkledtothereleaseoffurthersyntheticdatasetsbytheCensusBureau.TheSyntheticLongitudinalBusinessDatabase(SynLBD)comprisesdatafromanannualeconomiccensusofestablishmentsintheU.S.[30].Thisdatasetprovidesbroadaccesstorichdatathatsupportstheresearchandpolicy-makingcommunitiesinbusinessandemploymentrelatedtopics.OnTheMapisatoolutilisingsyntheticdatatoprovideworkforcerelatedmaps,demographicprofilesandreportsofU.S.citizens,aswellasdisastereventinformationandtheimpactofsucheventsonworkersandemployers[31].Similarly,syntheticdatahasalsobeenunderinvestigationintheUKasameanstoprovidepublicaccesstorichdatafromUKLongitudinalStudies[32-34]thatcontainhighlysensitivedatalinkingnationalcensusdatatoadministrativedataforindividualsandtheirfamilies.
Thesedatasetsenableresearcherstoexploredataanddevelopandtestcodeandmodelsoutsidethesecureenvironmentwhererealdataresideswithnorestrictions,whilstthedataownersprovideavalidationmechanismwhereresults,codeandmodelscanbevalidatedonbehalfofresearchersontherealdatawithinthesecureenvironmentandfeedbackprovided.Thisprocessincreasesresearchproductivitywhilstensuringthedevelopmentofrobustandvalidmodels[35].
Whilstsyntheticdatahasbeenusedtoaccelerateanddemocratisebusinessandeconomicpolicyresearch[22-35],itisnotcurrentlyinuseforhealthcareresearch,anareathatcouldbenefitenormously.Withadvancementsintechnology,particularlyMLandartificialintelligence(AI),thepotentialtodevelopdiagnostictoolsforcliniciansanddatadrivendecision-makingplatformsforhealthpolicy-makersisever-increasing[36,37].Suchtoolsrequireaccesstohealthcaredata,forexample,totrainAIalgorithmsandproducemodelsthatcanidentifyhealthconditionsandhealth-relatedpatternsacrossthepopulation.Currentlyitcantakealengthyperiodoftimeforresearcherstogainaccesstohealthcaredata,arichandunder-utilisedresource,duetoprivacyconcerns[38-42].Forexample,inthecaseofthe40monthMIDASProject[36,43]developingadata-drivendecisionmakingtoolforhealthcarepolicymakers,ittookmorethan20monthstoobtainaccesstotherequireddataduetolegalandethicalconstraints.Inaddition,anumberofimportantdatavariablescouldnotmadeavailablewhichrestrictedtheutilityoftheplatformunderdevelopment.Withthehelpofsyntheticdata,suchdata,withmoreorallvariablesincluded,couldhavebeenmadeavailableinamatterofweeksthusprovidingmoretimefordevelopmentandevaluationoftheplatform.Theplatformcouldthenhavebeeninstalledinhealthcaresitesmorequicklyandconnectedtorealdataforvalidationandcomparisonofperformanceforsyntheticversusrealdata,enablingperformancetweakstomitigatebiasintroducedbysyntheticdata,ifany.Syntheticdatacouldalsoenablecross-siteanalyticsacrossvarioushealthregions,thatwouldenablepolicymakerstoconnecttheirhealthspacesandpotentiallyprovidesignificantenhancementstocross-nationalhealthpolicy.
Theultimategoalofthisworkistofurtherassessthevalidityanddisclosureriskofsyntheticdataunderthestringentconditionsassociatedwithhealthcaredata,withtheviewtosuccessfullydevelopingapipelineforuseinhealthcarethatenablessyntheticdatasetstobereleasedpubliclytoresearcherswhowouldotherwisenotbeabletoaccessthedata,oraccessitinatimelyfashion,inordertoaccelerateresearchbyenablingthewiderresearchcommunitytousethedataforanalysisandmodeldevelopment.Theresultsofsuchanalysesandthemodelsandcodedevelopedcanthenbegiventohealthcaredepartmentsforvalidationontherealdata,andifeffectivecanbeputintousebycliniciansandhealthpolicy-makers.
SyntheticDataPipelineforHealthcare
TounderstandhowhealthcaredepartmentscanbenefitfromsyntheticdataweproposeapipelineshowninFigure1.Thisisaproposedsyntheticdatasharingpipelineprovidedasanillustrationofhowsyntheticdatacanpotentiallyworkwithinarealhealthcaresettingtoexpeditedataanalytics.Infuturework,weplantotestthispipelineinarealsetting.InthispipelinerealdataresideswithintheNationalHealthcareDepartmentinfrastructure.Thedatacannotbesharedexternallyduetoitssensitiveandprivatenature.HealthcaredepartmentsmayonlyhaveasmallnumberofdatasciencestaffwiththeexpertisenecessarytoapplyMLtechniquestomanyoftheirdatasets,andsotheycannotmaximisetheuseoftheirdatanordiscovertheiruseduetolackofresources.ByapplyingasyntheticdatagenerationtechniquetotherealdataalongwithSDCmeasures,asyntheticdatasetcanbeproducedandmadeavailabletotheexternalresearchcommunityinplaceoftherealdata.Externalresearchers,inlargenumbersandwithwiderangingexpertise,canpotentiallydevelopoptimalMLmodelstrainedonthesyntheticdataandsharetheperformanceoftheMLmodel,themodelitselfandthemodelspecificationwiththeNationalHealthcareDepartment.ThehealthcaredepartmentcanthentesttheMLmodelonrealdataorin-housetechnicalstaffcanrebuildthemodelaccordingtothespecificationprovidedbyresearcherswherethespecificationcanincludetheprogramcodewrittenbyresearchers,detailsoftheMLalgorithmtouse,e.g.decisiontree,supportvectormachineetc.,andtheoptimalhyperparametersettingsdeterminedduringdevelopment.Usingthesesettings,themodelcanthenberebuilt,thistimebytrainingontherealdatainsteadofsyntheticdata,whichin-housestaffhaveaccessto.
Figure1Proposedsyntheticdatasharingpipelinetoillustratehowsyntheticdatacouldbeimplementedtoexpeditehealthcaredataanalytics.
Methods
DatasetSelection
Forexperimentation,19openhealthcaredatasetshavebeenselectedfromtheUCIMachineLearningRepository[19].Missingvalueshavebeenremovedfromthedatasetseitherbyremovingfeatureswithahighnumberofmissingvaluesorremovingobservationswhereafeaturecontainsamissingvalue.TheexperimentaldatasetsandtheirpropertiesaresummarisedinTable1.Thesedatasetswereselectedtoenableananalysisofsyntheticdataperformancewhenappliedtodatasetsofdifferingvolumeanddatatypes(categoricalandnumerical).
Table1.Summaryofexperimentaldatasets.a
Dataset
No.ofAttributes
No.ofCategoricalAttributes
No.ofNumericalAttributes
No.Classes/Labels
No.ofObservations
A
BreastCancerWisconsin(Original)
9
0
9
2
683
B
BreastCancer
9
9
0
2
277
C
BreastCancerCoimbra
9
0
9
2
116
D
BreastTissue
9
0
9
6
106
E
ChronicKidneyDisease
21
12
9
2
209
F
Cardiotocography(3Class)
21
0
21
3
2126
G
Cardiotocography(10Class)
21
0
21
10
2126
H
Dermatology
34
33
1
6
358
I
DiabeticRetinopathy
19
3
16
2
1151
J
Echocardiogram
10
2
8
3
106
K
EEGEyeState
14
0
14
2
14980
L
HeartDisease
13
8
5
2
303
M
Lymphography
18
18
0
4
148
N
Post-OperativePatientData
8
8
0
3
87
O
PrimaryTumor
15
15
0
21
336
P
Stroke
10
7
3
2
29072
Q
ThoracicSurgery
16
13
3
2
470
R
ThyroidDisease
22
16
6
28
5786
S
ThyroidDisease(New)
5
0
5
3
215
Total
283
144
139
105
58,655
aEachdatasethasbeenencodedwithaletter(column1)andwillbereferencedusingthisletterfortheremainderofthepaper.
GeneratingSyntheticData
Inthiswork,weanalyseandassesstheperformanceofthreepubliclyavailablesyntheticdatagenerationtechniquesthatarebasedonwell-known,seminalworkinthearea[6-10,15,16].Thesemethodsareaparametricdatasynthesistechnique,anon-parametrictree-basedsynthesistechniquethatutilisesCART[15],andasynthesistechniquethatutilisesBayesiannetworks[16].Whilstotherapproachesexist,somearedevelopedforspecificdatasetsandproblems,e.g.SimPopsimulatespopulationsurveydata[44],andSyntheasimulatespatientpopulationandelectronichealthrecorddata[45],whereasthesetechniquesareconsideredtobemoregeneral.TheRpackage,Synthpop,developedbyNowak,RaabandDibben[17],providesapubliclyavailableimplementationoftheparametricandCARTbasedsyntheticdatagenerators.TheDataSynthesizerpythonimplementation,developedbyPing,StoyanovichandHowe[16],providesapubliclyavailableimplementationoftheBayesiannetworkbasedsyntheticdatagenerator.Theseimplementationshavebeenutilisedinthisexperimentalwork.
AttributesaresynthesisedsequentiallyinboththeparametricandCARTmethods.Thesyntheticvaluesforthefirstattributearesynthesisedusingarandomsamplefromtheoriginalobserveddatasinceithasnopredictorsfrompreviouslysynthesisedattributesinthedataset.Whensynthesisingattributes,bothcategoricalandnumerical,withthenon-parametricmethod,theCARTmethodisapplied.CARTisappliedtoallvariablesthathavepredictors,i.e.attributespriortotheminthesequence,anddrawsfromtheconditionaldistributionsfittedtotheoriginaldatausingCARTmodels.Theparametricmethodsynthesisesattributebasedondatatype.Numericalattributesaresynthesisedusingnormallinearregression.Categoricalattributesaresynthesisedusingpolytomouslogisticregressionwheretheattributehasmorethantwolevels,whilstlogisticregressionisappliedtosynthesisebinarycategoricalvariables[17].TheBayesiannetworkmethodofsynthesisingdatalearnsadifferentiallyprivateBayesiannetworkthatcapturescorrelationstructurebetweenattributesintherealdataanddrawssamplesfromthismodeltoproducesyntheticdata[16].
SupervisedMachineLearningwithRealandSyntheticData
AkeymeasureofdatautilityofasyntheticdatasetforthepurposeofMListodeterminehowwellasupervisedMLmodeltrainedonsyntheticdata,performswhentaskedwithclassifyingrealdata.ThiswilldeterminewhethersupervisedMLmodelswillberobustenoughtoclassifyrealdataexamplesifonlysyntheticdataisprovidedforthetrainingofthesemodels.
ToevaluatewhethersyntheticdatasetscanbeusedasavalidalternativetorealdatasetsinML,foreachofthe19datasets(Table1),fivedifferentclassificationmodelsweretrained.Initiallythemodelsweretrainedandtestedontherealdatatoobtainaperformancebenchmark.Subsequently,aclassifierwastrainedoneachofthesyntheticdatasets,generatedusingparametric,CARTandBayesiannetworktechniques,andthentestedwiththerealdata.Modelsaretestedonrealdataonly,todeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.
Therangeofmodelsappliedtoeachdatasetwere:stochasticgradientdescent(SDG)decisiontree(DT),k-nearestneighbors(KNN),randomforest(RF),andsupportvectormachine(SVM).Thisselectionofalgorithmswasappliedtodeterminehowwelleachperformedwhentrainedwiththerealdatacomparedwiththesyntheticdata,withbothtestedonrealdata.
TheclassifierswereimplementedusingPython’sScikit-Learn0.21.3machinelearninglibraryandareasfollows:
StochasticgradientdescentclassificationwasimplementedusingSGDClassifier,asimplelinearclassifier,withloss=“hinge”,random_state=0andallotherparameterssettotheirdefaults.
DecisiontreeclassificationwasimplementedusingDecisionTreeClassifier,anoptimisedversionofCART,withcriterion=“gini”,max_depth=10andrandom_state=0andallotherparameterssettotheirdefaults.
K-NearestNeighborsclassificationwasimplementedusingKNeighborsClassifierwithn_neighbors=10,weights=‘uniform’,leaf_size=30,p=2,metric=‘minkowski’,n_jobs=2andallotherparameterssettotheirdefaults.
RandomForestclassificationwasimplementedusingRandomForestClassifierwithcriterion=“gini”,max_depth=10,min_samples_split=2,n_estimators=10,random_state=1andallotherparameterssettotheirdefaults.
SupportVectorMachineclassificationwasimplementedusingSVCwithC=1.0,degree=3,kernel=‘rbf’,probability=True,random_state=Noneandallotherparameterssettotheirdefaults.
Fortrainingandtesting,Python’sScikit-Learn0.21.3ShuffleSplitrandompermutationcross-validatorwasusedwith10splittingiterationsandatrain/testsplitof75/25.Categoricalattributesweretransformedintoindicatorattributesusingone-hotencoding.
StatisticalDisclosureControl
Syntheticdataisconsiderednottocontainrealunitsandthereforetheriskofdisclosureofarealpersonisconsideredtobeunlikely[46].Whilstunlikely,thescenariowheresomeofthegeneratedsyntheticdataisverysimilartotherealdata,resultinginpotentialdisclosurerisk,mustbeconsideredandwhereadditionalprotectionscanbeappliedtosyntheticdataitisplausibletodoso.Additionalstatisticaldisclosurecontrol(SDC)measures,beyonddatasynthesis,canbeappliedasaprecautionarymeasuretoaddfurtherprotectionstosyntheticdatabyreducingtheriskofreproducingrealpersonrecordsandreplicatingoutlierdata,thus
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年中國工商銀行江西宜春支行春季校招筆試題帶答案
- 2024年中國工商銀行云南迪慶支行春季校招筆試題帶答案
- 2024年中國工商銀行山西長治支行春季校招筆試題帶答案
- 2024年中國工商銀行遼寧營口支行春季校招筆試題帶答案
- 2025關(guān)于建筑工程項目的施工合同
- 2025創(chuàng)新技術(shù)專利許可合同范本
- 2025汽車買賣合同書參考范文
- 個人與企業(yè)租車合同協(xié)議書
- 電工技能培訓(xùn)-21繼電接觸控制系統(tǒng)
- 【安全員臺賬】企業(yè)消防安全標準化管理
- 【MOOC】戲曲鑒賞-揚州大學(xué) 中國大學(xué)慕課MOOC答案
- 《初中生物實驗教學(xué)的創(chuàng)新與實踐》
- 企業(yè)合規(guī)管理體系建設(shè)與運行機制研究
- 寫字樓項目招商方案
- 期中檢測卷(試題)-2023-2024學(xué)年人教PEP版英語六年級下冊
- 擋墻橋墩沖刷計算表
- 胸痛基層診療指南
- 有限空間作業(yè)安全技術(shù)交底表
- 《如何有效組織幼兒開展體能大循環(huán)活動》課件
- 2024焊接工藝規(guī)程
- 市政夜景亮化施工方案
評論
0/150
提交評論