OpenContext Parquet Data Analysis - Enhanced Edition

parquet
spatial
property-graph

This document provides an enhanced analysis of the OpenContext iSamples parquet file, demonstrating the property graph structure and how to work with archaeological specimen data.

1 Understanding the Property Graph Structure

The OpenContext iSamples parquet file implements a sophisticated property graph model that combines the flexibility of graph databases with the analytical performance of columnar storage. Unlike traditional relational databases or pure graph databases, this approach stores both entities (nodes) and relationships (edges) in a single table structure.

1.1 Why a Property Graph?

Archaeological and specimen data inherently forms a network:

  • Samples are collected at sites during events
  • Sites have geographic locations
  • Samples have material types from controlled vocabularies
  • People (agents) have various roles in the collection process

This interconnected nature makes a graph model ideal for representing the complex relationships while maintaining query performance.

2 Setup

Code
db = {
    const instance = await DuckDBClient.of();
    await instance.query(`CREATE VIEW nodes AS SELECT * FROM read_parquet('${parquet_path}')`);
    return instance;
}

// Helper function for loading data with visual feedback
async function loadData(query, params=[], waiting_id=null) {
    const waiter = document.getElementById(waiting_id);
    if (waiter) {
        waiter.hidden = false;
    }
    try {
        const _results = await db.query(query, ...params);
        return _results;
    } catch (error) {
        if (waiter) {
            waiter.innerHTML = `<pre>${error}</pre>`;
        }
        return null;
    } finally {
        if (waiter) {
            waiter.hidden = true;
        }
    }
}

3 Data Model Deep Dive

3.1 Entity Types in the Dataset

The parquet file contains 7 distinct object types (otype), each serving a specific purpose in the archaeological data model:

Code
entityTypeDescriptions = {
    return [
        {otype: "_edge_", purpose: "Relationships between entities", icon: "🔗"},
        {otype: "MaterialSampleRecord", purpose: "Physical samples/specimens", icon: "🪨"},
        {otype: "SamplingEvent", purpose: "When/how samples were collected", icon: "📅"},
        {otype: "GeospatialCoordLocation", purpose: "Geographic coordinates", icon: "📍"},
        {otype: "SamplingSite", purpose: "Archaeological sites/dig locations", icon: "🏛️"},
        {otype: "IdentifiedConcept", purpose: "Controlled vocabulary terms", icon: "📚"},
        {otype: "Agent", purpose: "People and organizations", icon: "👤"}
    ];
}

viewof entityTypeTable = Inputs.table(entityTypeDescriptions, {
    header: {
        otype: "Entity Type",
        purpose: "Purpose",
        icon: "Icon"
    }
})

3.2 Entity Distribution

Code
entityStats = {
    const query = `
        SELECT
            otype,
            COUNT(*) as count,
            ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
        FROM nodes
        GROUP BY otype
        ORDER BY count DESC
    `;
    const data = await loadData(query, [], "loading_entity_stats");
    return data;
}
Code
viewof entityStatsTable = Inputs.table(entityStats, {
    header: {
        otype: "Entity Type",
        count: "Count",
        percentage: "Percentage"
    },
    format: {
        count: d => d.toLocaleString(),
        percentage: d => d + "%"
    }
})

Total records:

3.3 How Entities Connect: The Edge Model

Edges use a triple structure inspired by RDF:

  • Subject (s): The source entity’s row_id
  • Predicate (p): The relationship type
  • Object (o): Array of target entity row_ids

This allows representing both simple (1:1) and complex (1:many) relationships efficiently.

Code
// Visualize common relationship patterns
relationshipPatterns = {
    const query = `
        SELECT
            p as relationship,
            COUNT(*) as usage_count,
            COUNT(DISTINCT s) as unique_subjects
        FROM nodes
        WHERE otype = '_edge_'
          AND p IS NOT NULL
        GROUP BY p
        ORDER BY usage_count DESC
        LIMIT 15
    `;
    const data = await loadData(query, [], "loading_relationships");
    return data;
}
Code
viewof relationshipTable = Inputs.table(relationshipPatterns, {
    header: {
        relationship: "Relationship Type",
        usage_count: "Total Uses",
        unique_subjects: "Unique Subjects"
    },
    format: {
        usage_count: d => d.toLocaleString(),
        unique_subjects: d => d.toLocaleString()
    }
})

4 🚨 Critical Discovery: Correct Relationship Paths

Before you query this data, understand this key insight:

Common Mistake: Assuming direct Sample → Location relationships ✅ Reality: All location queries require multi-hop traversal through SamplingEvent

4.1 The Correct Paths Discovered

Path 1: Direct Event Location

MaterialSampleRecord → produced_by → SamplingEvent → sample_location → GeospatialCoordLocation

Path 2: Via Site Location

MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite → site_location → GeospatialCoordLocation

This discovery unlocked 1,096,274 samples that were previously inaccessible due to incorrect query patterns!

5 Working with the Graph: Query Patterns

5.1 Finding Samples with Locations (CORRECTED)

The most common need is connecting samples to their geographic coordinates. This requires traversing the graph through edges:

Code
// Example: Get samples with direct location assignments (CORRECTED)
// Path: Sample -> produced_by -> SamplingEvent -> sample_location -> GeospatialCoordLocation
sampleLocationExample = {
    const query = `
        WITH sample_locations AS (
            SELECT
                s.pid as sample_id,
                s.label as sample_label,
                g.latitude,
                g.longitude,
                'direct_event_location' as location_relationship
            FROM nodes s
            JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN nodes event ON e1.o[1] = event.row_id
            JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
            JOIN nodes g ON e2.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
            LIMIT 5
        )
        SELECT * FROM sample_locations
    `;
    const data = await loadData(query, [], "loading_sample_loc_example");
    return data;
}
Code
viewof sampleLocationTable = Inputs.table(sampleLocationExample, {
    layout: "auto"
})

5.2 ⚠️ Why Previous Queries Failed

Many existing examples tried this incorrect pattern:

-- ❌ BROKEN: This relationship doesn't exist!
FROM MaterialSampleRecord s
JOIN edge e ON s.row_id = e.s AND e.p = 'sample_location'
JOIN GeospatialCoordLocation g ON e.o[1] = g.row_id

Result: 0 samples found

The correct pattern requires going through SamplingEvent:

-- ✅ CORRECT: Multi-hop traversal
FROM MaterialSampleRecord s
JOIN edge e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
JOIN SamplingEvent event ON e1.o[1] = event.row_id
JOIN edge e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
JOIN GeospatialCoordLocation g ON e2.o[1] = g.row_id

Result: 1,096,274 samples found!

5.3 Multi-Hop Traversal: Sample → Event → Site → Location

Many samples don’t have direct coordinates but are linked through their collection event and site:

Code
// Trace the full chain from sample to site location
siteChainExample = {
    const query = `
        SELECT
            samp.pid as sample_id,
            event.pid as event_id,
            site.label as site_name,
            loc.latitude,
            loc.longitude
        FROM nodes samp
        JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
        JOIN nodes event ON e1.o[1] = event.row_id
        JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
        JOIN nodes site ON e2.o[1] = site.row_id
        JOIN nodes e3 ON site.row_id = e3.s AND e3.p = 'site_location'
        JOIN nodes loc ON e3.o[1] = loc.row_id
        WHERE samp.otype = 'MaterialSampleRecord'
          AND event.otype = 'SamplingEvent'
          AND site.otype = 'SamplingSite'
          AND loc.otype = 'GeospatialCoordLocation'
        LIMIT 5
    `;
    const data = await loadData(query, [], "loading_chain_example");
    return data;
}
Code
viewof siteChainTable = Inputs.table(siteChainExample, {
    layout: "auto",
    width: {
        sample_id: 150,
        event_id: 150,
        site_name: 200
    }
})

6 Site Analysis

6.1 Top Archaeological Sites by Sample Count

Code
topSites = {
    const query = `
        WITH site_samples AS (
            SELECT
                site.label as site_name,
                site.pid as site_id,
                COUNT(DISTINCT samp.row_id) as sample_count
            FROM nodes samp
            JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
            JOIN nodes event ON e1.o[1] = event.row_id
            JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
            JOIN nodes site ON e2.o[1] = site.row_id
            WHERE samp.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND site.otype = 'SamplingSite'
            GROUP BY site.label, site.pid
        )
        SELECT * FROM site_samples
        ORDER BY sample_count DESC
        LIMIT 20
    `;
    const data = await loadData(query, [], "loading_top_sites");
    return data;
}
Code
viewof topSitesTable = Inputs.table(topSites, {
    header: {
        site_name: "Site Name",
        site_id: "Site ID",
        sample_count: "Sample Count"
    },
    format: {
        sample_count: d => d.toLocaleString()
    }
})

7 Material Analysis

7.1 Material Type Distribution

Understanding what types of materials are found across the dataset:

Code
materialTypes = {
    const query = `
        SELECT
            mat.label as material_type,
            mat.name as category,
            COUNT(DISTINCT samp.row_id) as sample_count
        FROM nodes samp
        JOIN nodes e ON samp.row_id = e.s AND e.p = 'has_material_category'
        JOIN nodes mat ON e.o[1] = mat.row_id
        WHERE samp.otype = 'MaterialSampleRecord'
          AND e.otype = '_edge_'
          AND mat.otype = 'IdentifiedConcept'
        GROUP BY mat.label, mat.name
        ORDER BY sample_count DESC
        LIMIT 30
    `;
    const data = await loadData(query, [], "loading_materials");
    return data;
}
Code
viewof materialTable = Inputs.table(materialTypes, {
    header: {
        material_type: "Material Type",
        category: "Category",
        sample_count: "Sample Count"
    },
    format: {
        sample_count: d => d.toLocaleString()
    }
})

8 Spatial Distribution

8.1 Geographic Coverage

Code
spatialStats = {
    const query = `
        WITH coord_stats AS (
            SELECT
                MIN(latitude) as min_lat,
                MAX(latitude) as max_lat,
                MIN(longitude) as min_lon,
                MAX(longitude) as max_lon,
                AVG(latitude) as avg_lat,
                AVG(longitude) as avg_lon,
                COUNT(*) as total_locations,
                COUNT(CASE WHEN obfuscated THEN 1 END) as obfuscated_count
            FROM nodes
            WHERE otype = 'GeospatialCoordLocation'
              AND latitude IS NOT NULL
              AND longitude IS NOT NULL
        )
        SELECT * FROM coord_stats
    `;
    const data = await loadData(query, [], "loading_spatial");
    return data;
}
Code
viewof spatialDisplay = {
    const stats = spatialStats[0];
    return html`<div style="padding: 1rem; background: #f0f9ff; border-radius: 8px;">
        <h4 style="margin-top: 0;">Geographic Coverage</h4>
        <p>Total locations: <strong>${stats.total_locations.toLocaleString()}</strong></p>
        <p>Obfuscated locations: <strong>${stats.obfuscated_count.toLocaleString()}</strong>
           (${(stats.obfuscated_count / stats.total_locations * 100).toFixed(1)}%)</p>
        <p>Latitude range: <strong>${stats.min_lat.toFixed(2)}° to ${stats.max_lat.toFixed(2)}°</strong></p>
        <p>Longitude range: <strong>${stats.min_lon.toFixed(2)}° to ${stats.max_lon.toFixed(2)}°</strong></p>
        <p>Center point: <strong>${stats.avg_lat.toFixed(2)}°, ${stats.avg_lon.toFixed(2)}°</strong></p>
    </div>`;
}

8.2 Handling Sensitive Location Data

Archaeological sites often require location protection:

Code
obfuscationStats = {
    const query = `
        SELECT
            obfuscated,
            COUNT(*) as location_count,
            AVG(CASE WHEN latitude IS NOT NULL THEN 1 ELSE 0 END) * 100 as pct_with_coords
        FROM nodes
        WHERE otype = 'GeospatialCoordLocation'
        GROUP BY obfuscated
    `;
    const data = await loadData(query, [], "loading_obfusc_stats");
    return data;
}
Code
viewof obfuscationTable = Inputs.table(obfuscationStats, {
    header: {
        obfuscated: "Location Protection",
        location_count: "Count",
        pct_with_coords: "% With Coordinates"
    },
    format: {
        obfuscated: d => d ? "🔒 Protected" : "📍 Precise",
        location_count: d => d.toLocaleString(),
        pct_with_coords: d => d.toFixed(1) + "%"
    }
})
ImportantData Usage Note

When visualizing archaeological data, always respect location sensitivity flags. Obfuscated coordinates are intentionally imprecise to protect archaeological sites from looting.

9 🔍 Debugging Methodology: How We Found the Correct Paths

9.1 Step 1: Verify Relationship Existence

Code
// Debug: What relationships actually exist FROM MaterialSampleRecord?
debugRelationships = {
    const query = `
        SELECT DISTINCT e.p as predicate, COUNT(*) as count
        FROM nodes s
        JOIN nodes e ON s.row_id = e.s
        WHERE s.otype = 'MaterialSampleRecord'
          AND e.otype = '_edge_'
        GROUP BY e.p
        ORDER BY count DESC
    `;
    const data = await loadData(query, [], "loading_debug_rels");
    return data;
}
Code
viewof debugTable = Inputs.table(debugRelationships, {
    header: {
        predicate: "Relationship Type",
        count: "Usage Count"
    },
    format: {
        count: d => d.toLocaleString()
    }
})

Notice: No direct sample_location relationship! This confirms why direct queries failed.

9.2 Step 2: Trace the Path Through SamplingEvent

Code
// Debug: What relationships exist FROM SamplingEvent?
debugEventRelationships = {
    const query = `
        SELECT DISTINCT e.p as predicate, COUNT(*) as count
        FROM nodes s
        JOIN nodes e ON s.row_id = e.s
        WHERE s.otype = 'SamplingEvent'
          AND e.otype = '_edge_'
        GROUP BY e.p
        ORDER BY count DESC
    `;
    const data = await loadData(query, [], "loading_debug_events");
    return data;
}
Code
viewof debugEventTable = Inputs.table(debugEventRelationships, {
    header: {
        predicate: "Event Relationship",
        count: "Count"
    },
    format: {
        count: d => d.toLocaleString()
    }
})

Key Discovery: SamplingEvent has both sample_location AND sampling_site relationships!

9.3 Step 3: Validate the Complete Chain

Code
// Test: How many samples can we locate using the corrected path?
locationValidation = {
    const query = `
        WITH validation_stats AS (
            -- Direct path count
            SELECT 'Direct Event Location' as path_type, COUNT(*) as sample_count
            FROM nodes s
            JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN nodes event ON e1.o[1] = event.row_id
            JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
            JOIN nodes g ON e2.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL

            UNION ALL

            -- Site path count
            SELECT 'Via Site Location' as path_type, COUNT(*) as sample_count
            FROM nodes s
            JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
            JOIN nodes event ON e1.o[1] = event.row_id
            JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
            JOIN nodes site ON e2.o[1] = site.row_id
            JOIN nodes e3 ON site.row_id = e3.s AND e3.p = 'site_location'
            JOIN nodes g ON e3.o[1] = g.row_id
            WHERE s.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND site.otype = 'SamplingSite'
              AND g.otype = 'GeospatialCoordLocation'
              AND g.latitude IS NOT NULL
        )
        SELECT * FROM validation_stats
    `;
    const data = await loadData(query, [], "loading_validation");
    return data;
}
Code
viewof validationTable = Inputs.table(locationValidation, {
    header: {
        path_type: "Query Path",
        sample_count: "Samples Found"
    },
    format: {
        sample_count: d => d.toLocaleString()
    }
})

🎉 Success! Both paths yield over 1M samples each.

9.4 Debugging Lessons Learned

  1. Never assume direct relationships exist - always verify the graph structure first
  2. Trace step-by-step - build from simple entity counts to complex joins
  3. Test multiple paths - property graphs often have alternative routes
  4. Validate results - sanity check your numbers against known entity counts

10 Performance & Optimization Strategies

10.1 Query Performance Guidelines

When working with this 11.6M row dataset:

  1. Filter Early: Always apply otype filters first

    -- Good: Reduces to ~1M rows immediately
    WHERE otype = 'MaterialSampleRecord'
    
    -- Avoid: Scans all 11M rows
    WHERE label LIKE '%pottery%'
  2. Use Views for Complex Patterns: Pre-compute common joins

    CREATE VIEW samples_with_coords AS
    SELECT ... -- complex join query
  3. Leverage DuckDB’s Columnar Format: Aggregate before detailed analysis

10.2 Data Loading Strategies

For web applications:

Code
// Progressive loading pattern for large datasets
progressiveLoadExample = {
    // Start with aggregated overview
    const overview = await db.query(`
        SELECT
            ROUND(latitude/10)*10 as lat_bucket,
            ROUND(longitude/10)*10 as lon_bucket,
            COUNT(*) as point_count
        FROM nodes
        WHERE otype = 'GeospatialCoordLocation'
          AND latitude IS NOT NULL
        GROUP BY lat_bucket, lon_bucket
    `);

    return {
        strategy: "Progressive Loading",
        initial_points: overview.length,
        full_dataset: 198433,
        reduction_factor: Math.round(198433 / overview.length)
    };
}
Code
viewof loadStrategyDisplay = {
    const stats = await progressiveLoadExample;
    return html`<div style="padding: 1rem; background: #e0f2fe; border-radius: 8px;">
        <h4 style="margin-top: 0;">Loading Strategy Impact</h4>
        <p>Initial load: <strong>${stats.initial_points.toLocaleString()}</strong> aggregated points</p>
        <p>Full dataset: <strong>${stats.full_dataset.toLocaleString()}</strong> individual locations</p>
        <p>Reduction factor: <strong>${stats.reduction_factor}x</strong> faster initial load</p>
    </div>`;
}

11 Data Quality Metrics

Code
dataQuality = {
    const query = `
        WITH quality_checks AS (
            SELECT
                'Total Rows' as metric,
                COUNT(*) as value
            FROM nodes

            UNION ALL

            SELECT
                'Unique PIDs' as metric,
                COUNT(DISTINCT pid) as value
            FROM nodes

            UNION ALL

            SELECT
                'Samples with Direct Location' as metric,
                COUNT(DISTINCT s.row_id) as value
            FROM nodes s
            JOIN nodes e ON s.row_id = e.s AND e.p = 'sample_location'
            WHERE s.otype = 'MaterialSampleRecord'

            UNION ALL

            SELECT
                'Samples with Site Location' as metric,
                COUNT(DISTINCT s.row_id) as value
            FROM nodes s
            JOIN nodes e ON s.row_id = e.s AND e.p = 'produced_by'
            WHERE s.otype = 'MaterialSampleRecord'
        )
        SELECT * FROM quality_checks
    `;
    const data = await loadData(query, [], "loading_quality");
    return data;
}
Code
viewof qualityTable = Inputs.table(dataQuality, {
    header: {
        metric: "Quality Metric",
        value: "Count"
    },
    format: {
        value: d => d.toLocaleString()
    }
})

12 Archaeological Data Insights

12.1 Top Archaeological Sites by Sample Count

Code
topSitesByCount = {
    const query = `
        WITH sample_to_site AS (
            SELECT
                site.label as site_name,
                COUNT(DISTINCT samp.row_id) as sample_count
            FROM nodes samp
            JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
            JOIN nodes event ON e1.o[1] = event.row_id
            JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
            JOIN nodes site ON e2.o[1] = site.row_id
            WHERE samp.otype = 'MaterialSampleRecord'
              AND event.otype = 'SamplingEvent'
              AND site.otype = 'SamplingSite'
            GROUP BY site.label
        )
        SELECT * FROM sample_to_site
        ORDER BY sample_count DESC
        LIMIT 10
    `;
    const data = await loadData(query, [], "loading_top_sites");
    return data;
}
Code
viewof topSitesTable = Inputs.table(topSitesByCount, {
    header: {
        site_name: "Archaeological Site",
        sample_count: "Sample Count"
    },
    format: {
        sample_count: d => d.toLocaleString()
    }
})

12.2 Material Type Distribution

Code
materialDistribution = {
    const query = `
        SELECT
            mat.label as material_type,
            COUNT(DISTINCT samp.row_id) as sample_count
        FROM nodes samp
        JOIN nodes e ON samp.row_id = e.s AND e.p = 'has_material_category'
        JOIN nodes mat ON e.o[1] = mat.row_id
        WHERE samp.otype = 'MaterialSampleRecord'
          AND e.otype = '_edge_'
          AND mat.otype = 'IdentifiedConcept'
        GROUP BY mat.label
        ORDER BY sample_count DESC
        LIMIT 10
    `;
    const data = await loadData(query, [], "loading_materials");
    return data;
}
Code
viewof materialTable = Inputs.table(materialDistribution, {
    header: {
        material_type: "Material Type",
        sample_count: "Sample Count"
    },
    format: {
        sample_count: d => d.toLocaleString()
    }
})

Key Insights: - Çatalhöyük leads with 145,900+ samples - one of the world’s largest Neolithic sites - Biogenic non-organic materials dominate (bones, shells) reflecting archaeological preservation - Global coverage spans from Arctic (Finnmark) to temperate zones

13 Summary: Key Lessons for Querying OpenContext Parquet

13.1 🎯 Essential Discoveries

  1. Critical Bug Fix: Direct Sample→Location queries don’t work

    • Problem: Returned 0 results from 1M+ sample dataset
    • Solution: Always traverse through SamplingEvent
    • Impact: Unlocked access to 1,096,274 located samples
  2. Correct Relationship Paths:

    ✅ Sample → produced_by → SamplingEvent → sample_location → Location
    ✅ Sample → produced_by → SamplingEvent → sampling_site → Site → site_location → Location
  3. Property Graph Structure:

    • 79% edges, 21% entities in 11.6M rows
    • Multi-hop traversal required for meaningful queries
    • No shortcuts exist - respect the graph model

13.2 🔧 Debugging Methodology

  1. Verify relationships exist before building complex queries
  2. Trace step-by-step from simple counts to complex joins
  3. Test multiple paths - graphs often have alternative routes
  4. Validate results against known entity counts

13.3 ⚡ Performance Guidelines

  1. Filter by otype first - reduces 11M rows to manageable subsets
  2. Use CTEs for complex multi-hop queries
  3. Aggregate before filtering when possible
  4. Respect obfuscated coordinates for site protection

13.4 🏛️ Archaeological Context

  • Major sites: Çatalhöyük, Petra, Polis Chrysochous dominate sample counts
  • Material types: Biogenic non-organic materials most common
  • Global reach: Arctic to Antarctic coverage with sensitive location protection
  • Research value: 1M+ precisely located specimens for spatial analysis

13.5 🚀 Advanced Applications

This corrected understanding enables: - Spatial clustering analysis of archaeological finds - Temporal pattern recognition through sampling events - Site similarity studies via material type distributions - Collection bias analysis through agent and responsibility networks

The key to success: Understand the graph model first, query second. This property graph structure reflects the real-world complexity of archaeological data collection and enables sophisticated analysis when queried correctly.

14 Next Steps

Ready to analyze this data? Remember: 1. Start with entity relationship exploration 2. Build queries incrementally 3. Validate results at each step 4. Respect archaeological site sensitivities

Happy querying! 🏺