Using Cesium for display of remote parquet (Wide Format).

parquet
spatial
recipe
wide

This page renders points from an iSamples wide-format parquet file on Cesium using point primitives.

NoteWide Format Advantages

This page uses the wide parquet schema which:

  • Is 60% smaller (275 MB vs 691 MB)
  • Has 79% fewer rows (~2.5M vs ~11.6M)
  • Uses simpler queries (direct column access via p__* columns instead of edge row JOINs)
  • Provides 2-4x faster query performance over HTTP

See Narrow vs Wide Performance for benchmarks.

DuckDB-WASM running in the browser cannot access local files via file:// URLs due to browser security restrictions. However, you can use a local cached file when running quarto preview:

Local Development (recommended)

The repository includes a cached parquet file. To use it:

  1. Ensure the file exists in docs/assets/oc_isamples_pqg_wide.parquet (275MB)

    • The file must be in Quarto’s output directory docs/assets/, not just the source assets/ directory
    • If needed, copy: cp assets/oc_isamples_pqg_wide.parquet docs/assets/
  2. When running quarto preview, use the full localhost URL:

    http://localhost:4979/assets/oc_isamples_pqg_wide.parquet

    (Replace 4979 with your actual preview port)

Alternative: Python HTTP server

# In the directory containing your parquet file:
cd /Users/raymondyee/Data/iSample
python3 -m http.server 8000

Then use: http://localhost:8000/oc_isamples_pqg_wide.parquet

Benefits of wide format file: - 60% smaller than narrow format (275 MB vs 691 MB) - Much faster initial load (less network transfer) - Simpler queries with direct column access - Works offline once cached

Limitation: Only works during local development, not on published GitHub Pages.

WarningHeads up: first interaction may be slow

The first click or query can take a few seconds while the in‑browser database engine initializes and the remote Parquet file is fetched and indexed. Subsequent interactions are much faster because both the browser and DuckDB cache metadata and column chunks, so later queries reuse what was already loaded.

Code
db = {
  const instance = await DuckDBClient.of();
  await instance.query(`create view nodes as select * from read_parquet('${parquet_path}')`)
  return instance;
}


async function loadData(query, params = [], waiting_id = null, key = "default") {
    // latest-only guard per key
    loadData._latest = loadData._latest || new Map();
    const requestToken = Symbol();
    loadData._latest.set(key, requestToken);

    // Get loading indicator
    const waiter = waiting_id ? document.getElementById(waiting_id) : null;
    if (waiter) waiter.hidden = false;

    try {
        // Run the (slow) query
        const _results = await db.query(query, params);
        // Ignore stale responses
        if (loadData._latest.get(key) !== requestToken) return null;
        return _results;
    } catch (error) {
        if (waiter && loadData._latest.get(key) === requestToken) {
            waiter.innerHTML = `<pre>${error}</pre>`;
        }
        return null;
    } finally {
        // Hide the waiter (if there is one) only if latest
        if (waiter && loadData._latest.get(key) === requestToken) {
            waiter.hidden = true;
        }
    }
}

locations = {
    // Performance telemetry
    performance.mark('locations-start');

    // Get loading indicator element for progress updates
    const loadingDiv = document.getElementById('loading_1');
    if (loadingDiv) {
        loadingDiv.hidden = false;
        loadingDiv.innerHTML = 'Loading geocodes...';
    }

    // Fast query: just get all distinct geocodes (no classification!)
    const query = `
        SELECT DISTINCT
            pid,
            latitude,
            longitude
        FROM nodes
        WHERE otype = 'GeospatialCoordLocation'
    `;

    performance.mark('query-start');
    const data = await loadData(query, [], "loading_1", "locations");
    performance.mark('query-end');
    performance.measure('locations-query', 'query-start', 'query-end');
    const queryTime = performance.getEntriesByName('locations-query')[0].duration;
    console.log(`Query executed in ${queryTime.toFixed(0)}ms - retrieved ${data.length} locations`);

    // Clear the existing PointPrimitiveCollection
    content.points.removeAll();

    // Single color for all points (blue)
    const defaultColor = Cesium.Color.fromCssColorString('#2E86AB');
    const defaultSize = 4;

    // Render points in chunks to keep UI responsive
    const CHUNK_SIZE = 500;
    const scalar = new Cesium.NearFarScalar(1.5e2, 2, 8.0e6, 0.2);

    performance.mark('render-start');
    for (let i = 0; i < data.length; i += CHUNK_SIZE) {
        const chunk = data.slice(i, i + CHUNK_SIZE);
        const endIdx = Math.min(i + CHUNK_SIZE, data.length);

        // Update progress indicator
        if (loadingDiv) {
            const pct = Math.round((endIdx / data.length) * 100);
            loadingDiv.innerHTML = `Rendering geocodes... ${endIdx.toLocaleString()}/${data.length.toLocaleString()} (${pct}%)`;
        }

        // Add points for this chunk
        for (const row of chunk) {
            content.points.add({
                id: row.pid,
                position: Cesium.Cartesian3.fromDegrees(
                    row.longitude,  //longitude
                    row.latitude,   //latitude
                    0               //elevation, m
                ),
                pixelSize: defaultSize,
                color: defaultColor,
                scaleByDistance: scalar,
            });
        }

        // Yield to browser between chunks to keep UI responsive
        if (i + CHUNK_SIZE < data.length) {
            await new Promise(resolve => setTimeout(resolve, 0));
        }
    }
    performance.mark('render-end');
    performance.measure('locations-render', 'render-start', 'render-end');
    const renderTime = performance.getEntriesByName('locations-render')[0].duration;

    // Hide loading indicator
    if (loadingDiv) {
        loadingDiv.hidden = true;
    }

    performance.mark('locations-end');
    performance.measure('locations-total', 'locations-start', 'locations-end');
    const totalTime = performance.getEntriesByName('locations-total')[0].duration;

    console.log(`Rendering completed in ${renderTime.toFixed(0)}ms`);
    console.log(`Total time (query + render): ${totalTime.toFixed(0)}ms`);

    content.enableTracking();
    return data;
}


function createShowPrimitive(viewer) {
    return function(movement) {
        // Get the point at the mouse end position
        const selectPoint = viewer.viewer.scene.pick(movement.endPosition);

        // Clear the current selection, if there is one and it is different to the selectPoint
        if (viewer.currentSelection !== null) {
            //console.log(`selected.p ${viewer.currentSelection}`)
            if (Cesium.defined(selectPoint) && selectPoint !== viewer.currentSelection) {
                console.log(`selected.p 2 ${viewer.currentSelection}`)
                viewer.currentSelection.primitive.pixelSize = 4;
                viewer.currentSelection.primitive.outlineColor = Cesium.Color.TRANSPARENT;
                viewer.currentSelection.outlineWidth = 0;
                viewer.currentSelection = null;
            }
        }

        // If selectPoint is valid and no currently selected point
        if (Cesium.defined(selectPoint) && selectPoint.hasOwnProperty("primitive")) {
            //console.log(`showPrimitiveId ${selectPoint.id}`);
            //const carto = Cesium.Cartographic.fromCartesian(selectPoint.primitive.position)
            viewer.pointLabel.position = selectPoint.primitive.position;
            viewer.pointLabel.label.show = true;
            //viewer.pointLabel.label.text = `id:${selectPoint.id}, ${carto}`;
            viewer.pointLabel.label.text = `${selectPoint.id}`;
            selectPoint.primitive.pixelSize = 20;
            selectPoint.primitive.outlineColor = Cesium.Color.YELLOW;
            selectPoint.primitive.outlineWidth = 3;
            viewer.currentSelection = selectPoint;
        } else {
            viewer.pointLabel.label.show = false;
        }
    }
}

class CView {
    constructor(target) {
        this.viewer = new Cesium.Viewer(
            target, {
                timeline: false,
                animation: false,
                baseLayerPicker: false,
                fullscreenElement: target,
                terrain: Cesium.Terrain.fromWorldTerrain()
            });
        this.currentSelection = null;
        this.point_size = 1;
        this.n_points = 0;
        // https://cesium.com/learn/cesiumjs/ref-doc/PointPrimitiveCollection.html
        this.points = new Cesium.PointPrimitiveCollection();
        this.viewer.scene.primitives.add(this.points);

        this.pointLabel = this.viewer.entities.add({
            label: {
            show: false,
            showBackground: true,
            font: "14px monospace",
            horizontalOrigin: Cesium.HorizontalOrigin.LEFT,
            verticalOrigin: Cesium.VerticalOrigin.BOTTOM,
            pixelOffset: new Cesium.Cartesian2(15, 0),
            // this attribute will prevent this entity clipped by the terrain
            disableDepthTestDistance: Number.POSITIVE_INFINITY,
            text:"",
            },
        });

        this.pickHandler = new Cesium.ScreenSpaceEventHandler(this.viewer.scene.canvas);
        // Can also do this rather than wait for the points to be generated
        //this.pickHandler.setInputAction(createShowPrimitive(this), Cesium.ScreenSpaceEventType.MOUSE_MOVE);

        this.selectHandler = new Cesium.ScreenSpaceEventHandler(this.viewer.scene.canvas);
        this.selectHandler.setInputAction((e) => {
            const selectPoint = this.viewer.scene.pick(e.position);
            if (Cesium.defined(selectPoint) && selectPoint.hasOwnProperty("primitive")) {
                mutable clickedPointId = selectPoint.id;
            }
        },Cesium.ScreenSpaceEventType.LEFT_CLICK);

    }

    enableTracking() {
        this.pickHandler.setInputAction(createShowPrimitive(this), Cesium.ScreenSpaceEventType.MOUSE_MOVE);
    }
}

content = new CView("cesiumContainer");

async function getGeoRecord(pid) {
    if (pid === null || pid ==="" || pid == "unset") {
        return "unset";
    }
    const q = `SELECT row_id, pid, otype, latitude, longitude FROM nodes WHERE otype='GeospatialCoordLocation' AND pid=?`;
    const rows = await loadData(q, [pid], "loading_geo", "geo");
    return rows && rows.length ? rows[0] : null;
}

// WIDE FORMAT: Direct Location (p__sample_location / EVENT_SAMPLE_LOCATION)
// Uses p__sample_location column instead of edge row JOINs
async function get_samples_1(pid) {
    if (pid === null || pid ==="" || pid == "unset") {
        return [];
    }
    // Direct Location (EVENT_SAMPLE_LOCATION) - WIDE FORMAT version
    // Uses p__* columns instead of edge rows
    const q = `
        SELECT
            geo.latitude,
            geo.longitude,
            site.label AS sample_site_label,
            site.pid AS sample_site_pid,
            samp.pid AS sample_pid,
            samp.alternate_identifiers AS sample_alternate_identifiers,
            samp.label AS sample_label,
            samp.description AS sample_description,
            samp.thumbnail_url AS sample_thumbnail_url,
            samp.thumbnail_url IS NOT NULL as has_thumbnail,
            'direct_event_location' as location_path
        FROM nodes AS geo
        -- Wide format: SamplingEvent has p__sample_location column with geo row_ids
        JOIN nodes AS se ON (
            se.otype = 'SamplingEvent'
            AND list_contains(se.p__sample_location, geo.row_id)
        )
        -- Wide format: SamplingEvent has p__sampling_site column with site row_ids
        JOIN nodes AS site ON (
            site.otype = 'SamplingSite'
            AND list_contains(se.p__sampling_site, site.row_id)
        )
        -- Wide format: MaterialSampleRecord has p__produced_by column with event row_ids
        JOIN nodes AS samp ON (
            samp.otype = 'MaterialSampleRecord'
            AND list_contains(samp.p__produced_by, se.row_id)
        )
        WHERE geo.pid = ?
          AND geo.otype = 'GeospatialCoordLocation'
        ORDER BY has_thumbnail DESC
    `;
    performance.mark('samples1-start');
    const result = await loadData(q, [pid], "loading_s1", "samples_1");
    performance.mark('samples1-end');
    performance.measure('samples1-query', 'samples1-start', 'samples1-end');
    const queryTime = performance.getEntriesByName('samples1-query')[0].duration;
    console.log(`Direct location query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`);
    return result ?? [];
}

// WIDE FORMAT: Site-Mediated Location (p__sampling_site → p__site_location)
// Uses p__site_location and p__sampling_site columns
async function get_samples_2(pid) {
    if (pid === null || pid ==="" || pid == "unset") {
        return [];
    }
    // Site-Mediated Location (SITE_LOCATION) - WIDE FORMAT version
    const q = `
        SELECT
            geo.latitude,
            geo.longitude,
            site.label AS sample_site_label,
            site.pid AS sample_site_pid,
            samp.pid AS sample_pid,
            samp.alternate_identifiers AS sample_alternate_identifiers,
            samp.label AS sample_label,
            samp.description AS sample_description,
            samp.thumbnail_url AS sample_thumbnail_url,
            samp.thumbnail_url IS NOT NULL as has_thumbnail,
            'via_site_location' as location_path
        FROM nodes AS geo
        -- Wide format: SamplingSite has p__site_location column with geo row_ids
        JOIN nodes AS site ON (
            site.otype = 'SamplingSite'
            AND list_contains(site.p__site_location, geo.row_id)
        )
        -- Wide format: SamplingEvent has p__sampling_site column with site row_ids
        JOIN nodes AS se ON (
            se.otype = 'SamplingEvent'
            AND list_contains(se.p__sampling_site, site.row_id)
        )
        -- Wide format: MaterialSampleRecord has p__produced_by column with event row_ids
        JOIN nodes AS samp ON (
            samp.otype = 'MaterialSampleRecord'
            AND list_contains(samp.p__produced_by, se.row_id)
        )
        WHERE geo.pid = ?
          AND geo.otype = 'GeospatialCoordLocation'
        ORDER BY has_thumbnail DESC
    `;
    performance.mark('samples2-start');
    const result = await loadData(q, [pid], "loading_s2", "samples_2");
    performance.mark('samples2-end');
    performance.measure('samples2-query', 'samples2-start', 'samples2-end');
    const queryTime = performance.getEntriesByName('samples2-query')[0].duration;
    console.log(`Site-mediated location query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`);
    return result ?? [];
}

// WIDE FORMAT: Eric Kansa's authoritative query (p__sample_location + p__sampling_site)
// This is the wide format equivalent of get_samples_at_geo_cord_location_via_sample_event
async function get_samples_at_geo_cord_location_via_sample_event(pid) {
    if (pid === null || pid ==="" || pid == "unset") {
        return [];
    }
    // Eric Kansa's authoritative query - WIDE FORMAT version
    // Source pattern: https://github.com/ekansa/open-context-py
    const q = `
        SELECT
            geo.latitude,
            geo.longitude,
            site.label AS sample_site_label,
            site.pid AS sample_site_pid,
            samp.pid AS sample_pid,
            samp.alternate_identifiers AS sample_alternate_identifiers,
            samp.label AS sample_label,
            samp.description AS sample_description,
            samp.thumbnail_url AS sample_thumbnail_url,
            samp.thumbnail_url IS NOT NULL as has_thumbnail
        FROM nodes AS geo
        -- Wide format: SamplingEvent.p__sample_location contains geo row_ids
        JOIN nodes AS se ON (
            se.otype = 'SamplingEvent'
            AND list_contains(se.p__sample_location, geo.row_id)
        )
        -- Wide format: SamplingEvent.p__sampling_site contains site row_ids
        JOIN nodes AS site ON (
            site.otype = 'SamplingSite'
            AND list_contains(se.p__sampling_site, site.row_id)
        )
        -- Wide format: MaterialSampleRecord.p__produced_by contains event row_ids
        JOIN nodes AS samp ON (
            samp.otype = 'MaterialSampleRecord'
            AND list_contains(samp.p__produced_by, se.row_id)
        )
        WHERE geo.pid = ?
          AND geo.otype = 'GeospatialCoordLocation'
        ORDER BY has_thumbnail DESC
    `;
    performance.mark('eric-query-start');
    const result = await loadData(q, [pid], "loading_combined", "samples_combined");
    performance.mark('eric-query-end');
    performance.measure('eric-query', 'eric-query-start', 'eric-query-end');
    const queryTime = performance.getEntriesByName('eric-query')[0].duration;
    console.log(`Eric's query (wide) executed in ${queryTime.toFixed(0)}ms - retrieved ${result?.length || 0} samples`);
    return result ?? [];
}

// WIDE FORMAT: Get full sample data via sample PID
async function get_sample_data_via_sample_pid(sample_pid) {
    if (sample_pid === null || sample_pid === "" || sample_pid === "unset") {
        return null;
    }
    // Wide format: Uses p__produced_by, p__sample_location, p__sampling_site columns
    const q = `
        SELECT
            samp.row_id,
            samp.pid AS sample_pid,
            samp.alternate_identifiers AS sample_alternate_identifiers,
            samp.label AS sample_label,
            samp.description AS sample_description,
            samp.thumbnail_url AS sample_thumbnail_url,
            samp.thumbnail_url IS NOT NULL as has_thumbnail,
            geo.latitude,
            geo.longitude,
            site.label AS sample_site_label,
            site.pid AS sample_site_pid
        FROM nodes AS samp
        -- Wide format: use p__produced_by column
        JOIN nodes AS se ON (
            se.otype = 'SamplingEvent'
            AND list_contains(samp.p__produced_by, se.row_id)
        )
        -- Wide format: use p__sample_location column
        JOIN nodes AS geo ON (
            geo.otype = 'GeospatialCoordLocation'
            AND list_contains(se.p__sample_location, geo.row_id)
        )
        -- Wide format: use p__sampling_site column
        JOIN nodes AS site ON (
            site.otype = 'SamplingSite'
            AND list_contains(se.p__sampling_site, site.row_id)
        )
        WHERE samp.pid = ?
          AND samp.otype = 'MaterialSampleRecord'
    `;
    const result = await loadData(q, [sample_pid], "loading_sample_data", "sample_data");
    return result && result.length ? result[0] : null;
}

// WIDE FORMAT: Get agent info (who collected/registered)
async function get_sample_data_agents_sample_pid(sample_pid) {
    if (sample_pid === null || sample_pid === "" || sample_pid === "unset") {
        return [];
    }
    // Wide format: Uses p__produced_by and p__responsibility/p__registrant columns
    const q = `
        WITH event_agents AS (
            SELECT
                samp.pid AS sample_pid,
                samp.label AS sample_label,
                samp.description AS sample_description,
                samp.thumbnail_url AS sample_thumbnail_url,
                samp.thumbnail_url IS NOT NULL as has_thumbnail,
                'responsibility' AS predicate,
                unnest(se.p__responsibility) AS agent_row_id
            FROM nodes AS samp
            JOIN nodes AS se ON (
                se.otype = 'SamplingEvent'
                AND list_contains(samp.p__produced_by, se.row_id)
            )
            WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord'

            UNION ALL

            SELECT
                samp.pid AS sample_pid,
                samp.label AS sample_label,
                samp.description AS sample_description,
                samp.thumbnail_url AS sample_thumbnail_url,
                samp.thumbnail_url IS NOT NULL as has_thumbnail,
                'registrant' AS predicate,
                unnest(samp.p__registrant) AS agent_row_id
            FROM nodes AS samp
            WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord'
        )
        SELECT
            ea.sample_pid,
            ea.sample_label,
            ea.sample_description,
            ea.sample_thumbnail_url,
            ea.has_thumbnail,
            ea.predicate,
            agent.pid AS agent_pid,
            agent.name AS agent_name,
            agent.alternate_identifiers AS agent_alternate_identifiers
        FROM event_agents ea
        JOIN nodes AS agent ON (
            agent.row_id = ea.agent_row_id
            AND agent.otype = 'Agent'
        )
    `;
    const result = await loadData(q, [sample_pid, sample_pid], "loading_agents", "agents");
    return result ?? [];
}

// WIDE FORMAT: Get classification keywords and types
async function get_sample_types_and_keywords_via_sample_pid(sample_pid) {
    if (sample_pid === null || sample_pid === "" || sample_pid === "unset") {
        return [];
    }
    // Wide format: Sample has p__keywords, p__has_sample_object_type, p__has_material_category columns
    const q = `
        WITH sample_concepts AS (
            SELECT
                samp.pid AS sample_pid,
                samp.label AS sample_label,
                'keywords' AS predicate,
                unnest(samp.p__keywords) AS concept_row_id
            FROM nodes AS samp
            WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord'

            UNION ALL

            SELECT
                samp.pid AS sample_pid,
                samp.label AS sample_label,
                'has_sample_object_type' AS predicate,
                unnest(samp.p__has_sample_object_type) AS concept_row_id
            FROM nodes AS samp
            WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord'

            UNION ALL

            SELECT
                samp.pid AS sample_pid,
                samp.label AS sample_label,
                'has_material_category' AS predicate,
                unnest(samp.p__has_material_category) AS concept_row_id
            FROM nodes AS samp
            WHERE samp.pid = ? AND samp.otype = 'MaterialSampleRecord'
        )
        SELECT
            sc.sample_pid,
            sc.sample_label,
            sc.predicate,
            kw.pid AS keyword_pid,
            kw.label AS keyword
        FROM sample_concepts sc
        JOIN nodes AS kw ON (
            kw.row_id = sc.concept_row_id
            AND kw.otype = 'IdentifiedConcept'
        )
    `;
    const result = await loadData(q, [sample_pid, sample_pid, sample_pid], "loading_keywords", "keywords");
    return result ?? [];
}

async function locationUsedBy(rowid){
    if (rowid === undefined || rowid === null) {
        return [];
    }
    // Wide format: Check which entities reference this location via p__* columns
    const q = `
        SELECT pid, otype FROM nodes
        WHERE list_contains(p__sample_location, ?)
           OR list_contains(p__site_location, ?)
    `;
    return db.query(q, [rowid, rowid]);
}

mutable clickedPointId = "unset";
// Loading flags to control UI clearing while fetching
mutable geoLoading = false;
mutable s1Loading = false;
mutable s2Loading = false;
mutable combinedLoading = false;

// Precompute selection-driven data with loading flags
selectedGeoRecord = {
    mutable geoLoading = true;
    try {
        return await getGeoRecord(clickedPointId);
    } finally {
        mutable geoLoading = false;
    }
}

selectedSamples1 = {
    mutable s1Loading = true;
    try {
        return await get_samples_1(clickedPointId);
    } finally {
        mutable s1Loading = false;
    }
}

selectedSamples2 = {
    mutable s2Loading = true;
    try {
        return await get_samples_2(clickedPointId);
    } finally {
        mutable s2Loading = false;
    }
}

selectedSamplesCombined = {
    mutable combinedLoading = true;
    try {
        return await get_samples_at_geo_cord_location_via_sample_event(clickedPointId);
    } finally {
        mutable combinedLoading = false;
    }
}

md`Retrieved ${pointdata.length} locations from ${parquet_path}.`;
Loading…
Code
viewof pointdata = {
    const data_table = Inputs.table(locations, {
        header: {
            pid: "PID",
            latitude: "Latitude",
            longitude: "Longitude",
            location_type: "Location Type"
        },
    });
    return data_table;
}

The click point ID is “”.

1 getGeoRecord (selected)

Code
pid = clickedPointId
testrecord = selectedGeoRecord;

2 Samples at Location via Sampling Event (Eric Kansa’s Query - Wide Format)

This query implements Eric Kansa’s authoritative get_samples_at_geo_cord_location_via_sample_event function from open-context-py, translated to wide format.

NoteWide Format Query Advantage

Narrow format requires 7+ JOINs through edge rows:

JOIN nodes AS rel_se ON (rel_se.p = 'sample_location' AND list_contains(rel_se.o, geo.row_id))
JOIN nodes AS se ON (rel_se.s = se.row_id ...)

Wide format uses direct column access (3 JOINs):

JOIN nodes AS se ON (se.otype = 'SamplingEvent' AND list_contains(se.p__sample_location, geo.row_id))

This is typically 2-4x faster over HTTP.

**Query Strategy (p__sample_location + p__sampling_site): - Starts at a GeospatialCoordLocation (clicked point) - Walks backward** via p__sample_location column to find SamplingEvents that reference this location - From those events, finds MaterialSampleRecords via p__produced_by column - Requires site context (INNER JOIN on p__sampling_site → SamplingSite)

Returns: - Geographic coordinates: latitude, longitude - Sample metadata: sample_pid, sample_label, sample_description, sample_alternate_identifiers - Site context: sample_site_label, sample_site_pid - Media: sample_thumbnail_url, has_thumbnail

Ordering: Prioritizes samples with images (ORDER BY has_thumbnail DESC)

Important: This query only returns samples whose sampling events directly reference this geolocation via p__sample_location (EVENT_SAMPLE_LOCATION). Samples that reach this location only through their site’s p__site_location (SITE_LOCATION) are not included. This means site marker locations may return 0 results if no events were recorded at that exact coordinate.

3 Understanding Location Traversals Using Typed Edges (Wide Format)

3.1 The 14 iSamples Typed Edge Types

The iSamples property graph uses 14 typed edges to express all relationships. In wide format, these are represented as p__* columns on entity rows rather than separate edge rows:

Edge Type Wide Format Column From → To
MSR_PRODUCED_BY p__produced_by MaterialSampleRecord → SamplingEvent
MSR_REGISTRANT p__registrant MaterialSampleRecord → Agent
MSR_KEYWORDS p__keywords MaterialSampleRecord → IdentifiedConcept
MSR_HAS_CONTEXT_CATEGORY p__has_context_category MaterialSampleRecord → IdentifiedConcept
MSR_HAS_MATERIAL_CATEGORY p__has_material_category MaterialSampleRecord → IdentifiedConcept
MSR_HAS_SAMPLE_OBJECT_TYPE p__has_sample_object_type MaterialSampleRecord → IdentifiedConcept
EVENT_SAMPLING_SITE p__sampling_site SamplingEvent → SamplingSite
EVENT_SAMPLE_LOCATION p__sample_location SamplingEvent → GeospatialCoordLocation
EVENT_RESPONSIBILITY p__responsibility SamplingEvent → Agent
SITE_LOCATION p__site_location SamplingSite → GeospatialCoordLocation

3.2 Two Ways to Get Geographic Coordinates (Wide Format)

**Direct Location (via p__sample_location / EVENT_SAMPLE_LOCATION)**

MaterialSampleRecord
  → p__produced_by →
SamplingEvent
  → p__sample_location →
GeospatialCoordLocation

**Site-Mediated Location (via p__sampling_site + p__site_location)**

MaterialSampleRecord
  → p__produced_by →
SamplingEvent
  → p__sampling_site →
SamplingSite
  → p__site_location →
GeospatialCoordLocation

Key Differences: - Direct uses p__sample_location: 2 edges from Sample to Location - Site-Mediated uses p__sampling_site + p__site_location: 3 edges from Sample to Location - Direct = “Where was this specific sample collected?” - Site-Mediated = “What named site is this sample from, and where is that site?”

Wide Format Advantage: Instead of JOINing through separate edge rows (otype=‘edge’), we directly access the p__* columns on entity rows using list_contains().

Important: The queries below use INNER JOIN for both traversals, meaning samples must have connections through both to appear in results. Samples with only one traversal will be excluded.

3.3 Full Relationship Map Using Typed Edges (Wide Format)

                                    Agent
                                      ↑
                                      | p__responsibility, p__registrant
                                      |
MaterialSampleRecord ─p__produced_by─→ SamplingEvent ─p__sample_location─→ GeospatialCoordLocation
    |                                       |                                         ↑
    |                                       |                                         |
    | p__keywords,                          └─p__sampling_site─→ SamplingSite ─p__site_location─┘
    | p__has_sample_object_type,
    | p__has_material_category
    |
    └──→ IdentifiedConcept

Edge Categories (Wide Format Columns): - DIRECT LOCATION: p__produced_byp__sample_location - SITE-MEDIATED LOCATION: p__produced_byp__sampling_sitep__site_location - AGENT EDGES: p__registrant, p__responsibility - CONCEPT EDGES: p__keywords, p__has_*_category, p__has_sample_object_type

Key Insight: SamplingEvent is the central hub for most relationships, except concept edges which attach directly to MaterialSampleRecord.

3.4 Query Pattern Analysis (Wide Format)

The following analysis shows Eric’s query functions in wide format:

3.4.1 1. get_sample_data_via_sample_pid - Uses BOTH Location Traversals

MaterialSampleRecord (WHERE pid = ?)
  → p__produced_by → SamplingEvent
    ├─→ p__sample_location → GeospatialCoordLocation [Direct]
    └─→ p__sampling_site → SamplingSite [Site-Mediated]

Returns: sample metadata + lat/lon + site label/pid
Required: BOTH traversals must exist (INNER JOIN)

3.4.2 2. get_sample_data_agents_sample_pid - Uses Agent Edges

MaterialSampleRecord (WHERE pid = ?)
  → p__produced_by → SamplingEvent
    → p__responsibility → Agent

Returns: sample metadata + agent info (who collected/registered)
Independent of: Location traversals (no geographic data)

3.4.3 3. get_sample_types_and_keywords_via_sample_pid - Uses Concept Edges

MaterialSampleRecord (WHERE pid = ?)
  → p__keywords → IdentifiedConcept
  → p__has_sample_object_type → IdentifiedConcept
  → p__has_material_category → IdentifiedConcept

Returns: sample metadata + classification keywords/types
Independent of: Location traversals and SamplingEvent!

3.4.4 4. get_samples_at_geo_cord_location_via_sample_event - Eric’s Canonical Query

GeospatialCoordLocation (WHERE pid = ?)  ← START HERE (reverse!)
  ← p__sample_location ← SamplingEvent [Direct, reversed]
    ├─→ p__sampling_site → SamplingSite [Site context - provenance!]
    └─← p__produced_by ← MaterialSampleRecord [complete chain]

Returns: all samples at a given location + site info
Direction: geo → samples (opposite of other queries)

Key Insight (from Eric Kansa): This canonical query requires BOTH p__sample_location AND p__sampling_site - it’s not just “find samples at this location” but “find samples at this location that have proper site provenance.”

Summary Table:

Function Direct Location Site-Mediated Direction Notes
get_sample_data_via_sample_pid ✅ Required ✅ Required Forward INNER JOIN - no row if either missing
get_sample_data_agents_sample_pid ❌ N/A ❌ N/A N/A Uses agent edges instead
get_sample_types_and_keywords_via_sample_pid ❌ N/A ❌ N/A N/A Direct concept edges
get_samples_at_geo_cord_location_via_sample_event ✅ Required ✅ Required Reverse Geo → samples with site provenance

6 Geographic Location Classification

Tip✅ IMPLEMENTED - Differentiated Geographic Visualization

Current implementation: GeospatialCoordLocations are now color-coded by their semantic role in the property graph:

  • 🔵 Blue (small) - sample_location_only: Precise field collection points (EVENT_SAMPLE_LOCATION)
  • 🟣 Purple (large) - site_location_only: Administrative site markers (SITE_LOCATION)
  • 🟠 Orange (medium) - both: Dual-purpose locations (used for both direct and site-mediated location)

Discovery: Analysis of the OpenContext parquet data reveals that geos fall into three distinct categories based on their usage:

  1. sample_location_only: Precise field collection points (EVENT_SAMPLE_LOCATION)
    • Most common category
    • Represents exact GPS coordinates where sampling events occurred
    • Varies per event, even within the same site
  2. site_location_only: Administrative site markers (SITE_LOCATION)
    • Represents general/reference locations for named archaeological sites
    • One coordinate per site
    • May not correspond to any actual collection point
  3. both: 10,346 geos (5.2%) - Dual-purpose locations
    • Used as BOTH p__sample_location AND p__site_location
    • Primarily single-location sites (85% of all sites)
    • Occasionally one of many locations at multi-location sites (e.g., PKAP)

Site spatial patterns: - 85.4% of sites are compact (single location) - all events at one coordinate - Example: Suberde - 384 events at one location - 14.6% of sites are distributed (multiple locations) - events spread across space - Example: PKAP Survey Area - 15,446 events across 544 different coordinates - Poggio Civitate - 29,985 events across 11,112 coordinates

6.1 Benefits of Current Implementation

  1. Educational: Makes EVENT_SAMPLE_LOCATION vs SITE_LOCATION distinction visually concrete
    • Users can SEE the semantic difference between precise and administrative locations
    • Blue points show where samples were actually collected (EVENT_SAMPLE_LOCATION)
    • Purple points show administrative site markers (SITE_LOCATION)
    • Demonstrates the complementary nature of the two geographic paths
  2. Exploratory: Enables visual understanding of spatial patterns
    • Archaeological sites appear as purple markers (large points)
    • Field collection points appear as blue markers (small points)
    • Dual-purpose locations appear as orange markers (medium points)
    • No UI filters required - the colors provide immediate visual differentiation
  3. Analytical: Reveals site spatial structure at a glance
    • Compact sites: tight cluster of blue points around purple marker
    • Survey areas: purple marker with cloud of blue points spread across region
    • Identifies sampling strategies and field methodologies by visual inspection

6.2 Wide Format Advantage

The classification query is simpler in wide format because it directly checks p__sample_location and p__site_location columns instead of querying through edge rows.

Narrow format (edge rows):

JOIN nodes e ON (geo.row_id = e.o[1])
WHERE e.p IN ('sample_location', 'site_location')

Wide format (direct columns):

LEFT JOIN nodes AS se ON (list_contains(se.p__sample_location, geo.row_id))
LEFT JOIN nodes AS site ON (list_contains(site.p__site_location, geo.row_id))

6.3 Implementation Status

Status: ✅ IMPLEMENTED (Basic color-coding by location type)

What’s implemented: - ✅ Classification query using p__sample_location and p__site_location columns - ✅ Conditional styling by location_type - ✅ Color-coded points: Blue (sample_location), Purple (site_location), Orange (both) - ✅ Size differentiation: 3px (field points), 6px (sites), 5px (dual-purpose)

Future enhancements (not yet implemented): - ⬜ UI filter controls (checkbox toggles for each location type) - ⬜ Site Explorer Mode (click site → highlight all sample_locations) - ⬜ Convex hull/region drawing for distributed sites - ⬜ Dynamic statistics display on site selection

This implementation transforms the visualization from uniform points into a pedagogical tool that visually demonstrates the EVENT_SAMPLE_LOCATION vs SITE_LOCATION distinction in the iSamples metadata model architecture.

7 See Also