variograms (gstat) with large data sets

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

variograms (gstat) with large data sets

gruenewald


Dear all,
I am aiming to calculate variograms using variogram() from gstat.
The problem is that some of my data-sets are very large (>  400000 points).
Running the command takes some hours of time and does not give any
error-message.
Nevertheless the result seem not to be appropriate - the first few bins
are ok (up to a distance of about 300) but then it gains lags which are
much larger than the spatial extent of the data and the bins are not
continuous any more. Running the code on smaller areas gives correct
results.
That's why I think that the problem is the memory.

I am running the code with R 2.10.1 on a linux grid (Intel(R)Core(TM)
i7-2600CPU@3.40GHz; 32 bit).

So my questions:
- is there a better way to calculate variograms with such large data
sets or do I have to reduce the data?
- Could parallel computation (on multiple cores) be a solution? And if
yes, how could that be done?

Here is the code I am using:
"scans" is a 3 column vector containing x, y, and z values resulting
from a high resolution (1 m) digital elevation model. The extent of the
data is about 600*600 m, the

#define 50 bins log-scaled and with a maximum of 600
x = seq(1,50,1);
a = exp(log(600)/50);
logwidth = a^x;

#variogram
coordinates(scans) = ~V1+V2;
v = variogram(V3~1, scans,  boundaries = logwidth);

Thank you very much,
Tom

--
Thomas Grünewald
WSL Institute for Snow and Avalanche Research SLF
Research Unit Snow and Permafrost
Team Snow Cover and Micrometeorology
Flüelastr. 11
CH-7260 Davos Dorf
Tel. +41/81/417 0365
Fax. +41/81/417 0110
[hidden email]
http://www.slf.ch


Reply | Threaded
Open this post in threaded view
|

Re: variograms (gstat) with large data sets

Edzer Pebesma
Administrator


On 11/17/2011 01:59 PM, gruenewald wrote:

>
>
> Dear all,
> I am aiming to calculate variograms using variogram() from gstat.
> The problem is that some of my data-sets are very large (>  400000 points).
> Running the command takes some hours of time and does not give any
> error-message.
> Nevertheless the result seem not to be appropriate - the first few bins
> are ok (up to a distance of about 300) but then it gains lags which are
> much larger than the spatial extent of the data and the bins are not
> continuous any more. Running the code on smaller areas gives correct
> results.
> That's why I think that the problem is the memory.
>
> I am running the code with R 2.10.1 on a linux grid (Intel(R)Core(TM)
> i7-2600CPU@3.40GHz; 32 bit).
>
> So my questions:
> - is there a better way to calculate variograms with such large data
> sets or do I have to reduce the data?

well, you could as well take smaller samples of the data. Most likely,
variograms of 40.000 observations will give you enough information;
maybe even 4000 -- it all depends a bit on how the spatial distribution
of points is.

> - Could parallel computation (on multiple cores) be a solution? And if
> yes, how could that be done?

difficult -- if you split the data 10 slices, and compute variograms for
each of them, and average those, it is not the same as the variogram of
the full data set, as point pairs accros slices are not considered.

>
> Here is the code I am using:
> "scans" is a 3 column vector containing x, y, and z values resulting
> from a high resolution (1 m) digital elevation model. The extent of the
> data is about 600*600 m, the
>
> #define 50 bins log-scaled and with a maximum of 600
> x = seq(1,50,1);
> a = exp(log(600)/50);
> logwidth = a^x;
>
> #variogram
> coordinates(scans) = ~V1+V2;
> v = variogram(V3~1, scans,  boundaries = logwidth);
>
> Thank you very much,
> Tom
>

--
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763  http://ifgi.uni-muenster.de
http://www.52north.org/geostatistics      [hidden email]
_______________________________________________
Geostatistics mailing list
[hidden email]
http://list.52north.org/mailman/listinfo/geostatistics
http://geostatistics.forum.52north.org
Reply | Threaded
Open this post in threaded view
|

Re: variograms (gstat) with large data sets

Paul Hiemstra-2
On 11/21/2011 08:22 PM, Edzer Pebesma wrote:

>
> On 11/17/2011 01:59 PM, gruenewald wrote:
>>
>> Dear all,
>> I am aiming to calculate variograms using variogram() from gstat.
>> The problem is that some of my data-sets are very large (>  400000 points).
>> Running the command takes some hours of time and does not give any
>> error-message.
>> Nevertheless the result seem not to be appropriate - the first few bins
>> are ok (up to a distance of about 300) but then it gains lags which are
>> much larger than the spatial extent of the data and the bins are not
>> continuous any more. Running the code on smaller areas gives correct
>> results.
>> That's why I think that the problem is the memory.
>>
>> I am running the code with R 2.10.1 on a linux grid (Intel(R)Core(TM)
>> i7-2600CPU@3.40GHz; 32 bit).
>>
>> So my questions:
>> - is there a better way to calculate variograms with such large data
>> sets or do I have to reduce the data?
> well, you could as well take smaller samples of the data. Most likely,
> variograms of 40.000 observations will give you enough information;
> maybe even 4000 -- it all depends a bit on how the spatial distribution
> of points is.
>
>> - Could parallel computation (on multiple cores) be a solution? And if
>> yes, how could that be done?
> difficult -- if you split the data 10 slices, and compute variograms for
> each of them, and average those, it is not the same as the variogram of
> the full data set, as point pairs accros slices are not considered.

First generating the points pairs and then calculating the semi-variance
in parallel might work. But I agree with Edzer that you probably do not
need the full dataset to get a good variogram model. You can generate
variograms for different amounts of data and check if it makes a
difference. Probably the variogram model will converge after a given
amount of data points.

regards,
Paul

>> Here is the code I am using:
>> "scans" is a 3 column vector containing x, y, and z values resulting
>> from a high resolution (1 m) digital elevation model. The extent of the
>> data is about 600*600 m, the
>>
>> #define 50 bins log-scaled and with a maximum of 600
>> x = seq(1,50,1);
>> a = exp(log(600)/50);
>> logwidth = a^x;
>>
>> #variogram
>> coordinates(scans) = ~V1+V2;
>> v = variogram(V3~1, scans,  boundaries = logwidth);
>>
>> Thank you very much,
>> Tom
>>


--
Paul Hiemstra, Ph.D.
Global Climate Division
Royal Netherlands Meteorological Institute (KNMI)
Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39
P.O. Box 201 | 3730 AE | De Bilt
tel: +31 30 2206 494

http://intamap.geo.uu.nl/~paul
http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770

_______________________________________________
Geostatistics mailing list
[hidden email]
http://list.52north.org/mailman/listinfo/geostatistics
http://geostatistics.forum.52north.org