में जटिल सर्वेक्षण डिज़ाइन का उपयोग करके प्रश्नों की गति में सुधार करें मेरे पास एक बड़ा डेटा सेट (20 मिलियन से अधिक जुनून) है जिसका मैं survey पैकेज के साथ विश्लेषण करता हूं और यह मुझे सरल प्रश्नों को चलाने के लिए उम्र ले रहा है। मैंने अपने कोड को तेज़ करने का एक तरीका खोजने का प्रयास किया है, लेकिन मैं यह जानना चाहता हूं कि इसे और अधिक कुशल बनाने के बेहतर तरीके हैं या नहीं। foreachdopar 7 कोर आर

विकल्प का एक संकलित संस्करण का उपयोग कर के साथ

सरल कमांड svyby/svytotal
समानांतर कंप्यूटिंग:

मेरी बेंचमार्क में, मैं का उपयोग कर तीन आदेशों की गति svyby/svytotal तुलना 2

स्पोइलर: विकल्प 3 पहले विकल्प के रूप में तेज़ी से दोगुना से अधिक है लेकिन यह बड़े डेटा सेट के लिए उपयुक्त नहीं है क्योंकि यह समांतर कंप्यूटिंग पर निर्भर करता है, जो बड़े डेटा सेट से निपटने के दौरान त्वरित रूप से स्मृति सीमाओं को दबाता है। मुझे 16 जीबी रैम के बावजूद भी इस समस्या का सामना करना पड़ता है। कुछ solutions to this memory limitation हैं, लेकिन उनमें से कोई सर्वेक्षण डिज़ाइन ऑब्जेक्ट्स पर लागू नहीं है।

मेमोरी सीमाओं के कारण इसे तेज बनाने और क्रैश न करने के बारे में कोई विचार?

एक प्रतिलिपि प्रस्तुत करने योग्य उदाहरण के साथ मेरे कोड:

# Load Packages 
library(survey) 
library(data.table) 
library(compiler) 
library(foreach) 
library(doParallel) 
options(digits=3) 

# Load Data 
data(api) 

# Convert data to data.table format (mostly to increase speed of the process) 
apiclus1 <- as.data.table(apiclus1) 

# Multiplicate data observations by 1000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 1000), ] 

# create a count variable 
apiclus1[, Vcount := 1] 

# create survey design 
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

1) सरल कोड

t1 <- Sys.time() 
table1 <- svyby(~Vcount, 
       ~stype+dnum+cname, 
       design = dclus1, 
       svytotal) 
T1 <- Sys.time() - t1

2) 7 कोर

# in this option, I create a list with different subsets of the survey design 
# that will be passed to different CPU cores to work at the same time 

subdesign <- function(i){ subset(dclus1, dnum==i)} 
groups <- unique(apiclus1$dnum) 
list_subsets <- lapply(groups[], subdesign) # apply function and get all  subsets in a list 
i <- NULL 

# Start Parallel 
registerDoParallel(cores=7) 

t2 <- Sys.time() 
table2 <- foreach (i = list_subsets, .combine= rbind, .packages="survey")  %dopar% { 
    options(survey.lonely.psu = "remove") 
    svyby(~Vcount, 
     ~stype+dnum+cname, 
     design = i, 
     svytotal)} 
T2 <- Sys.time() - t2

3 का उपयोग कर foreach dopar साथ समानांतर कंप्यूटिंग। विकल्प 2

का एक संकलित संस्करण

# make a function of the previous query 
query2 <- function (list_subsets) { foreach (i = list_subsets, .combine=  rbind, .packages="survey") %dopar% { 
    svyby(~Vcount, 
     ~stype+dnum+cname, 
     design = i, 
     svytotal)}} 

# Compile the function to increase speed 
query3 <- cmpfun(query2) 

t3 <- Sys.time() 
table3 <- query3 (list_subsets) 
T3 <- Sys.time() - t3

परिणाम इतनी अच्छी तरह से बाहर इस सवाल बिछाने के लिए

>T1: 1.9 secs 
>T2: 1.13 secs 
>T3 0.58 secs 

barplot(c(T1, T2, T3), 
     names.arg = c("1) simple table", "2) parallel", "3) compiled parallel"), 
     ylab="Seconds")

स्रोत

2015-09-03 rafa.pereira

समानांतर प्रसंस्करण के लिए प्रतिलिपि बनाये बिना डेटा को सब्सक्राइब करने के विकल्प के लिए पैकेज 'रेफरी' से 'refdata' देखें। –

मैंने refdata @ A.Webb को आजमाया है लेकिन यह काम नहीं किया है। कोड धीमा हो गया और यह अभी भी स्मृति सीमा को मार रहा है। मैं कुछ गलत कर रहा हूं 'समूह <- अद्वितीय (apiclus1 $ dnum) उपडिज़ाइन <- फ़ंक्शन (i) {refdata (सबसेट (dclus1, dnum == i))} list_subsets <- लापता (समूह [], subdesign) i <- NULL तालिका 3 <- foreach (i = 1: लंबाई (समूह), .combine = rbind, .packages = c ("सर्वेक्षण", "ref"))% dopar% { विकल्प (सर्वेक्षण .lonely.psu = "हटाने") svyby (~ Vcount, ~ stype + dnum + CNAME, डिजाइन = derefdata (list_subsets [[मैं]]), svytotal)} ' –

@RafaelPereira का उपयोग' MonetDB.R' और 'सर्वेक्षण' एक साथ। उदाहरण के लिए, https://github.com/ajdamico/asdfree/search?utf8=%E2%9C%93&q=MonetDB.R –

धन्यवाद। आर में बड़े सर्वेक्षण डेटासेट के साथ कुशलता से काम करने के लिए शायद कुछ बुनियादी एसक्यूएल वाक्यविन्यास की आवश्यकता होती है (जो आर से सीखना बहुत आसान है)। मॉनेट डीबी survey पैकेज के साथ संगत एकमात्र बड़ा डेटा विकल्प है, अन्य उच्च प्रदर्शन पैकेजों की खोज करना संभवतः (संभवतः) उपयोगी नहीं होगा। आम तौर पर जब मैं एक विशाल डेटासेट की खोज कर रहा हूं, तो मैं सर्वेक्षण पैकेज का उपयोग करने के बजाय सीधे SQL क्वेरी में लिखता हूं क्योंकि मानक त्रुटि गणना कम्प्यूटेशनल-गहन होती है (और भिन्नताएं इंटरैक्टिव डेटा अन्वेषण के दौरान उपयोगी नहीं होती हैं)। ध्यान दें कि अंतिम एसक्यूएल टाइमस्टैम्प अन्य सभी विकल्पों को कैसे उड़ाता है। एक त्वरित भारित मतलब की गणना करने के तरह "SELECT by_column , SUM(your_column * the_weight)/SUM(the_weight) FROM yourdata GROUP BY by_column"

कुछ समय जब आप मानक त्रुटियों सहभागी, linearization (svydesign) की आवश्यकता है अक्सर अधिक computationally गहन प्रतिकृति (svrepdesign) की तुलना में उपयोग करने की है, लेकिन कभी कभी प्रतिकृति डिजाइन बनाने (जैसा कि मैंने किया है नीचे jk1w_dclus1 के साथ) कुछ उपयोगकर्ताओं के साथ सहजता से अधिक सर्वेक्षण विधियों की परिचितता की आवश्यकता होती है।

# Load Packages 
library(MonetDB.R) 
library(MonetDBLite) 
library(DBI) # suggested in comments and needed on OSX 
library(survey) 

# Load Data 
data(api) 

# Multiplicate data observations by 10000 
apiclus1 <- apiclus1[rep(seq_len(nrow(apiclus1)), 10000), ] 

# create a count variable 
apiclus1$vcount <- 1 

# create survey design 
dclus1 <- svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc) 


dbfolder <- tempdir() 

db <- dbConnect(MonetDBLite() , dbfolder) 
dbWriteTable(db , 'apiclus1' , apiclus1) 


db_dclus1 <- 
    svydesign(
     weight = ~pw , 
     id = ~dnum , 
     data = "apiclus1" , 
     dbtype = "MonetDBLite" , 
     dbname = dbfolder , 
     fpc = ~fpc 
    ) 

# you provided a design without strata, 
# so type="JK1" matches that most closely. 
# but see survey:::as.svrepdesign for other linearization-to-replication options 
jk1w <- jk1weights(psu = apiclus1$dnum , fpc = apiclus1$fpc) 

# after the replicate-weights have been constructed, 
# here's the `svrepdesign` call.. 
jk1w_dclus1 <- 
    svrepdesign(
     weight = ~pw , 
     type = "JK1" , 
     repweights = jk1w$repweights , 
     combined.weights = FALSE , 
     scale = jk1w$scale , 
     rscales = jk1w$rscales , 
     data = 'apiclus1' , 
     dbtype = "MonetDBLite" , 
     dbname = dbfolder 
    ) 

# slow 
system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal)) 
# > system.time(res1 <- svyby(~vcount,~stype+dnum+cname,design = dclus1,svytotal)) 
    # user system elapsed 
    # 17.40 2.86 20.27 


# faster 
system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal)) 
# > system.time(res2 <- svyby(~vcount,~stype+dnum+cname,design = db_dclus1,svytotal)) 
    # user system elapsed 
    # 13.00 1.20 14.18 


# fastest 
system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal)) 
# > system.time(res3 <- svyby(~vcount,~stype+dnum+cname,design = jk1w_dclus1,svytotal)) 
    # user system elapsed 
    # 10.75 1.19 11.96 

# same standard errors across the board 
all.equal(SE(res1) , SE(res2)) 
all.equal(SE(res2) , SE(res3)) 
# NOTE: the replicate-weighted design will be slightly different 
# for certain designs. however this technique is defensible 
# and gets used in 
# https://github.com/ajdamico/asdfree/tree/master/Censo%20Demografico 


# at the point you do not care about standard errors, 
# learn some sql: 
system.time(res4 <- dbGetQuery(db , "SELECT stype , dnum , cname , SUM(pw) FROM apiclus1 GROUP BY stype , dnum , cname")) 
# because this is near-instantaneous, no matter how much data you have. 

# same numbers as res1: 
all.equal(as.numeric(sort(coef(res1))) , sort(res4$L1)) 
# > system.time(res4 <- dbGetQuery(db , "SELECT stype , dnum , cname , SUM(pw) FROM apiclus1 GROUP BY stype , dnum , cname")) 
    # user system elapsed 
    # 0.15 0.20 0.23

स्रोत

2016-02-21 04:12:41

हाय। मैं आपको जवाब दोहराना नहीं कर सका। जब मैं लाइन 'db <- dbConnect (MonetDBLite(), dbfolder) चलाता हूं, तो मुझे निम्न त्रुटि मिलती है:' MonetDBLite में त्रुटि :: monetdb_embedded_startup (एम्बेडेड,! GetOption ("monetdb.debug.embedded", अप्रयुक्त तर्क (getOption ("monetdb.sequential", TRUE))। क्या हो रहा है पर कोई विचार? मैं आर 3.2.4 का उपयोग कर रहा हूं नवीनतम Rstudio 0.99.893 और विंडोज 10 –

@RafaelPereira 'लाइब्रेरी (डीबीआई)' के साथ प्रयास करें और यदि वह अभी भी काम नहीं करता है, तो कम से कम पुन: उत्पन्न उदाहरण के साथ एक अलग स्टैक ओवरफ्लो प्रश्न खोलें - मॉनेटडब्लेट शुरू करने में असमर्थ होने के कारण कुछ और है, धन्यवाद –

मैंने अभी सवाल बनाया है (सुनिश्चित नहीं है कि यह अच्छी तरह से तैयार है) [है एक नज़र] (http://stackoverflow.com/questions/36175255/create-a-connection-to-a-dbms-in-r) –

आर

एक प्रतिलिपि प्रस्तुत करने योग्य उदाहरण के साथ मेरे कोड:

1) सरल कोड

2) 7 कोर

3 का उपयोग कर foreach dopar साथ समानांतर कंप्यूटिंग। विकल्प 2

परिणाम इतनी अच्छी तरह से बाहर इस सवाल बिछाने के लिए

उत्तर

संबंधित मुद्दे