OpenContext Parquet Data Analysis - Enhanced Edition
parquet
spatial
property-graph
This document provides an enhanced analysis of the OpenContext iSamples parquet file, demonstrating the property graph structure and how to work with archaeological specimen data.
1 Understanding the Property Graph Structure
The OpenContext iSamples parquet file implements a sophisticated property graph model that combines the flexibility of graph databases with the analytical performance of columnar storage. Unlike traditional relational databases or pure graph databases, this approach stores both entities (nodes) and relationships (edges) in a single table structure.
1.1 Why a Property Graph?
Archaeological and specimen data inherently forms a network:
- Samples are collected at sites during events
- Sites have geographic locations
- Samples have material types from controlled vocabularies
- People (agents) have various roles in the collection process
This interconnected nature makes a graph model ideal for representing the complex relationships while maintaining query performance.
2 Setup
Code
db = {
const instance = await DuckDBClient.of();
await instance.query(`CREATE VIEW nodes AS SELECT * FROM read_parquet('${parquet_path}')`);
return instance;
}
// Helper function for loading data with visual feedback
async function loadData(query, params=[], waiting_id=null) {
const waiter = document.getElementById(waiting_id);
if (waiter) {
waiter.hidden = false;
}
try {
const _results = await db.query(query, ...params);
return _results;
} catch (error) {
if (waiter) {
waiter.innerHTML = `<pre>${error}</pre>`;
}
return null;
} finally {
if (waiter) {
waiter.hidden = true;
}
}
}3 Data Model Deep Dive
3.1 Entity Types in the Dataset
The parquet file contains 7 distinct object types (otype), each serving a specific purpose in the archaeological data model:
Code
entityTypeDescriptions = {
return [
{otype: "_edge_", purpose: "Relationships between entities", icon: "🔗"},
{otype: "MaterialSampleRecord", purpose: "Physical samples/specimens", icon: "🪨"},
{otype: "SamplingEvent", purpose: "When/how samples were collected", icon: "📅"},
{otype: "GeospatialCoordLocation", purpose: "Geographic coordinates", icon: "📍"},
{otype: "SamplingSite", purpose: "Archaeological sites/dig locations", icon: "🏛️"},
{otype: "IdentifiedConcept", purpose: "Controlled vocabulary terms", icon: "📚"},
{otype: "Agent", purpose: "People and organizations", icon: "👤"}
];
}
viewof entityTypeTable = Inputs.table(entityTypeDescriptions, {
header: {
otype: "Entity Type",
purpose: "Purpose",
icon: "Icon"
}
})3.2 Entity Distribution
Code
entityStats = {
const query = `
SELECT
otype,
COUNT(*) as count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 2) as percentage
FROM nodes
GROUP BY otype
ORDER BY count DESC
`;
const data = await loadData(query, [], "loading_entity_stats");
return data;
}
Loading entity statistics…
Code
viewof entityStatsTable = Inputs.table(entityStats, {
header: {
otype: "Entity Type",
count: "Count",
percentage: "Percentage"
},
format: {
count: d => d.toLocaleString(),
percentage: d => d + "%"
}
})Total records:
3.3 How Entities Connect: The Edge Model
Edges use a triple structure inspired by RDF:
- Subject (s): The source entity’s
row_id - Predicate (p): The relationship type
- Object (o): Array of target entity
row_ids
This allows representing both simple (1:1) and complex (1:many) relationships efficiently.
Code
// Visualize common relationship patterns
relationshipPatterns = {
const query = `
SELECT
p as relationship,
COUNT(*) as usage_count,
COUNT(DISTINCT s) as unique_subjects
FROM nodes
WHERE otype = '_edge_'
AND p IS NOT NULL
GROUP BY p
ORDER BY usage_count DESC
LIMIT 15
`;
const data = await loadData(query, [], "loading_relationships");
return data;
}
Loading relationship patterns…
Code
viewof relationshipTable = Inputs.table(relationshipPatterns, {
header: {
relationship: "Relationship Type",
usage_count: "Total Uses",
unique_subjects: "Unique Subjects"
},
format: {
usage_count: d => d.toLocaleString(),
unique_subjects: d => d.toLocaleString()
}
})4 🚨 Critical Discovery: Correct Relationship Paths
Before you query this data, understand this key insight:
❌ Common Mistake: Assuming direct Sample → Location relationships ✅ Reality: All location queries require multi-hop traversal through SamplingEvent
4.1 The Correct Paths Discovered
Path 1: Direct Event Location
MaterialSampleRecord → produced_by → SamplingEvent → sample_location → GeospatialCoordLocation
Path 2: Via Site Location
MaterialSampleRecord → produced_by → SamplingEvent → sampling_site → SamplingSite → site_location → GeospatialCoordLocation
This discovery unlocked 1,096,274 samples that were previously inaccessible due to incorrect query patterns!
5 Working with the Graph: Query Patterns
5.1 Finding Samples with Locations (CORRECTED)
The most common need is connecting samples to their geographic coordinates. This requires traversing the graph through edges:
Code
// Example: Get samples with direct location assignments (CORRECTED)
// Path: Sample -> produced_by -> SamplingEvent -> sample_location -> GeospatialCoordLocation
sampleLocationExample = {
const query = `
WITH sample_locations AS (
SELECT
s.pid as sample_id,
s.label as sample_label,
g.latitude,
g.longitude,
'direct_event_location' as location_relationship
FROM nodes s
JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
JOIN nodes g ON e2.o[1] = g.row_id
WHERE s.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND g.otype = 'GeospatialCoordLocation'
AND g.latitude IS NOT NULL
LIMIT 5
)
SELECT * FROM sample_locations
`;
const data = await loadData(query, [], "loading_sample_loc_example");
return data;
}
Loading example…
Code
viewof sampleLocationTable = Inputs.table(sampleLocationExample, {
layout: "auto"
})5.2 ⚠️ Why Previous Queries Failed
Many existing examples tried this incorrect pattern:
-- ❌ BROKEN: This relationship doesn't exist!
FROM MaterialSampleRecord s
JOIN edge e ON s.row_id = e.s AND e.p = 'sample_location'
JOIN GeospatialCoordLocation g ON e.o[1] = g.row_idResult: 0 samples found
The correct pattern requires going through SamplingEvent:
-- ✅ CORRECT: Multi-hop traversal
FROM MaterialSampleRecord s
JOIN edge e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
JOIN SamplingEvent event ON e1.o[1] = event.row_id
JOIN edge e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
JOIN GeospatialCoordLocation g ON e2.o[1] = g.row_idResult: 1,096,274 samples found!
5.3 Multi-Hop Traversal: Sample → Event → Site → Location
Many samples don’t have direct coordinates but are linked through their collection event and site:
Code
// Trace the full chain from sample to site location
siteChainExample = {
const query = `
SELECT
samp.pid as sample_id,
event.pid as event_id,
site.label as site_name,
loc.latitude,
loc.longitude
FROM nodes samp
JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
JOIN nodes site ON e2.o[1] = site.row_id
JOIN nodes e3 ON site.row_id = e3.s AND e3.p = 'site_location'
JOIN nodes loc ON e3.o[1] = loc.row_id
WHERE samp.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND site.otype = 'SamplingSite'
AND loc.otype = 'GeospatialCoordLocation'
LIMIT 5
`;
const data = await loadData(query, [], "loading_chain_example");
return data;
}
Loading traversal example…
Code
viewof siteChainTable = Inputs.table(siteChainExample, {
layout: "auto",
width: {
sample_id: 150,
event_id: 150,
site_name: 200
}
})6 Site Analysis
6.1 Top Archaeological Sites by Sample Count
Code
topSites = {
const query = `
WITH site_samples AS (
SELECT
site.label as site_name,
site.pid as site_id,
COUNT(DISTINCT samp.row_id) as sample_count
FROM nodes samp
JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
JOIN nodes site ON e2.o[1] = site.row_id
WHERE samp.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND site.otype = 'SamplingSite'
GROUP BY site.label, site.pid
)
SELECT * FROM site_samples
ORDER BY sample_count DESC
LIMIT 20
`;
const data = await loadData(query, [], "loading_top_sites");
return data;
}
Loading site statistics…
Code
viewof topSitesTable = Inputs.table(topSites, {
header: {
site_name: "Site Name",
site_id: "Site ID",
sample_count: "Sample Count"
},
format: {
sample_count: d => d.toLocaleString()
}
})7 Material Analysis
7.1 Material Type Distribution
Understanding what types of materials are found across the dataset:
Code
materialTypes = {
const query = `
SELECT
mat.label as material_type,
mat.name as category,
COUNT(DISTINCT samp.row_id) as sample_count
FROM nodes samp
JOIN nodes e ON samp.row_id = e.s AND e.p = 'has_material_category'
JOIN nodes mat ON e.o[1] = mat.row_id
WHERE samp.otype = 'MaterialSampleRecord'
AND e.otype = '_edge_'
AND mat.otype = 'IdentifiedConcept'
GROUP BY mat.label, mat.name
ORDER BY sample_count DESC
LIMIT 30
`;
const data = await loadData(query, [], "loading_materials");
return data;
}
Analyzing materials…
Code
viewof materialTable = Inputs.table(materialTypes, {
header: {
material_type: "Material Type",
category: "Category",
sample_count: "Sample Count"
},
format: {
sample_count: d => d.toLocaleString()
}
})8 Spatial Distribution
8.1 Geographic Coverage
Code
spatialStats = {
const query = `
WITH coord_stats AS (
SELECT
MIN(latitude) as min_lat,
MAX(latitude) as max_lat,
MIN(longitude) as min_lon,
MAX(longitude) as max_lon,
AVG(latitude) as avg_lat,
AVG(longitude) as avg_lon,
COUNT(*) as total_locations,
COUNT(CASE WHEN obfuscated THEN 1 END) as obfuscated_count
FROM nodes
WHERE otype = 'GeospatialCoordLocation'
AND latitude IS NOT NULL
AND longitude IS NOT NULL
)
SELECT * FROM coord_stats
`;
const data = await loadData(query, [], "loading_spatial");
return data;
}
Loading spatial statistics…
Code
viewof spatialDisplay = {
const stats = spatialStats[0];
return html`<div style="padding: 1rem; background: #f0f9ff; border-radius: 8px;">
<h4 style="margin-top: 0;">Geographic Coverage</h4>
<p>Total locations: <strong>${stats.total_locations.toLocaleString()}</strong></p>
<p>Obfuscated locations: <strong>${stats.obfuscated_count.toLocaleString()}</strong>
(${(stats.obfuscated_count / stats.total_locations * 100).toFixed(1)}%)</p>
<p>Latitude range: <strong>${stats.min_lat.toFixed(2)}° to ${stats.max_lat.toFixed(2)}°</strong></p>
<p>Longitude range: <strong>${stats.min_lon.toFixed(2)}° to ${stats.max_lon.toFixed(2)}°</strong></p>
<p>Center point: <strong>${stats.avg_lat.toFixed(2)}°, ${stats.avg_lon.toFixed(2)}°</strong></p>
</div>`;
}8.2 Handling Sensitive Location Data
Archaeological sites often require location protection:
Code
obfuscationStats = {
const query = `
SELECT
obfuscated,
COUNT(*) as location_count,
AVG(CASE WHEN latitude IS NOT NULL THEN 1 ELSE 0 END) * 100 as pct_with_coords
FROM nodes
WHERE otype = 'GeospatialCoordLocation'
GROUP BY obfuscated
`;
const data = await loadData(query, [], "loading_obfusc_stats");
return data;
}
Analyzing location sensitivity…
Code
viewof obfuscationTable = Inputs.table(obfuscationStats, {
header: {
obfuscated: "Location Protection",
location_count: "Count",
pct_with_coords: "% With Coordinates"
},
format: {
obfuscated: d => d ? "🔒 Protected" : "📍 Precise",
location_count: d => d.toLocaleString(),
pct_with_coords: d => d.toFixed(1) + "%"
}
})
ImportantData Usage Note
When visualizing archaeological data, always respect location sensitivity flags. Obfuscated coordinates are intentionally imprecise to protect archaeological sites from looting.
9 🔍 Debugging Methodology: How We Found the Correct Paths
9.1 Step 1: Verify Relationship Existence
Code
// Debug: What relationships actually exist FROM MaterialSampleRecord?
debugRelationships = {
const query = `
SELECT DISTINCT e.p as predicate, COUNT(*) as count
FROM nodes s
JOIN nodes e ON s.row_id = e.s
WHERE s.otype = 'MaterialSampleRecord'
AND e.otype = '_edge_'
GROUP BY e.p
ORDER BY count DESC
`;
const data = await loadData(query, [], "loading_debug_rels");
return data;
}
Debugging relationships…
Code
viewof debugTable = Inputs.table(debugRelationships, {
header: {
predicate: "Relationship Type",
count: "Usage Count"
},
format: {
count: d => d.toLocaleString()
}
})Notice: No direct sample_location relationship! This confirms why direct queries failed.
9.2 Step 2: Trace the Path Through SamplingEvent
Code
// Debug: What relationships exist FROM SamplingEvent?
debugEventRelationships = {
const query = `
SELECT DISTINCT e.p as predicate, COUNT(*) as count
FROM nodes s
JOIN nodes e ON s.row_id = e.s
WHERE s.otype = 'SamplingEvent'
AND e.otype = '_edge_'
GROUP BY e.p
ORDER BY count DESC
`;
const data = await loadData(query, [], "loading_debug_events");
return data;
}
Debugging event relationships…
Code
viewof debugEventTable = Inputs.table(debugEventRelationships, {
header: {
predicate: "Event Relationship",
count: "Count"
},
format: {
count: d => d.toLocaleString()
}
})Key Discovery: SamplingEvent has both sample_location AND sampling_site relationships!
9.3 Step 3: Validate the Complete Chain
Code
// Test: How many samples can we locate using the corrected path?
locationValidation = {
const query = `
WITH validation_stats AS (
-- Direct path count
SELECT 'Direct Event Location' as path_type, COUNT(*) as sample_count
FROM nodes s
JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sample_location'
JOIN nodes g ON e2.o[1] = g.row_id
WHERE s.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND g.otype = 'GeospatialCoordLocation'
AND g.latitude IS NOT NULL
UNION ALL
-- Site path count
SELECT 'Via Site Location' as path_type, COUNT(*) as sample_count
FROM nodes s
JOIN nodes e1 ON s.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
JOIN nodes site ON e2.o[1] = site.row_id
JOIN nodes e3 ON site.row_id = e3.s AND e3.p = 'site_location'
JOIN nodes g ON e3.o[1] = g.row_id
WHERE s.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND site.otype = 'SamplingSite'
AND g.otype = 'GeospatialCoordLocation'
AND g.latitude IS NOT NULL
)
SELECT * FROM validation_stats
`;
const data = await loadData(query, [], "loading_validation");
return data;
}
Validating corrected paths…
Code
viewof validationTable = Inputs.table(locationValidation, {
header: {
path_type: "Query Path",
sample_count: "Samples Found"
},
format: {
sample_count: d => d.toLocaleString()
}
})🎉 Success! Both paths yield over 1M samples each.
9.4 Debugging Lessons Learned
- Never assume direct relationships exist - always verify the graph structure first
- Trace step-by-step - build from simple entity counts to complex joins
- Test multiple paths - property graphs often have alternative routes
- Validate results - sanity check your numbers against known entity counts
10 Performance & Optimization Strategies
10.1 Query Performance Guidelines
When working with this 11.6M row dataset:
Filter Early: Always apply
otypefilters first-- Good: Reduces to ~1M rows immediately WHERE otype = 'MaterialSampleRecord' -- Avoid: Scans all 11M rows WHERE label LIKE '%pottery%'Use Views for Complex Patterns: Pre-compute common joins
CREATE VIEW samples_with_coords AS SELECT ... -- complex join queryLeverage DuckDB’s Columnar Format: Aggregate before detailed analysis
10.2 Data Loading Strategies
For web applications:
Code
// Progressive loading pattern for large datasets
progressiveLoadExample = {
// Start with aggregated overview
const overview = await db.query(`
SELECT
ROUND(latitude/10)*10 as lat_bucket,
ROUND(longitude/10)*10 as lon_bucket,
COUNT(*) as point_count
FROM nodes
WHERE otype = 'GeospatialCoordLocation'
AND latitude IS NOT NULL
GROUP BY lat_bucket, lon_bucket
`);
return {
strategy: "Progressive Loading",
initial_points: overview.length,
full_dataset: 198433,
reduction_factor: Math.round(198433 / overview.length)
};
}Code
viewof loadStrategyDisplay = {
const stats = await progressiveLoadExample;
return html`<div style="padding: 1rem; background: #e0f2fe; border-radius: 8px;">
<h4 style="margin-top: 0;">Loading Strategy Impact</h4>
<p>Initial load: <strong>${stats.initial_points.toLocaleString()}</strong> aggregated points</p>
<p>Full dataset: <strong>${stats.full_dataset.toLocaleString()}</strong> individual locations</p>
<p>Reduction factor: <strong>${stats.reduction_factor}x</strong> faster initial load</p>
</div>`;
}11 Data Quality Metrics
Code
dataQuality = {
const query = `
WITH quality_checks AS (
SELECT
'Total Rows' as metric,
COUNT(*) as value
FROM nodes
UNION ALL
SELECT
'Unique PIDs' as metric,
COUNT(DISTINCT pid) as value
FROM nodes
UNION ALL
SELECT
'Samples with Direct Location' as metric,
COUNT(DISTINCT s.row_id) as value
FROM nodes s
JOIN nodes e ON s.row_id = e.s AND e.p = 'sample_location'
WHERE s.otype = 'MaterialSampleRecord'
UNION ALL
SELECT
'Samples with Site Location' as metric,
COUNT(DISTINCT s.row_id) as value
FROM nodes s
JOIN nodes e ON s.row_id = e.s AND e.p = 'produced_by'
WHERE s.otype = 'MaterialSampleRecord'
)
SELECT * FROM quality_checks
`;
const data = await loadData(query, [], "loading_quality");
return data;
}
Checking data quality…
Code
viewof qualityTable = Inputs.table(dataQuality, {
header: {
metric: "Quality Metric",
value: "Count"
},
format: {
value: d => d.toLocaleString()
}
})12 Archaeological Data Insights
12.1 Top Archaeological Sites by Sample Count
Code
topSitesByCount = {
const query = `
WITH sample_to_site AS (
SELECT
site.label as site_name,
COUNT(DISTINCT samp.row_id) as sample_count
FROM nodes samp
JOIN nodes e1 ON samp.row_id = e1.s AND e1.p = 'produced_by'
JOIN nodes event ON e1.o[1] = event.row_id
JOIN nodes e2 ON event.row_id = e2.s AND e2.p = 'sampling_site'
JOIN nodes site ON e2.o[1] = site.row_id
WHERE samp.otype = 'MaterialSampleRecord'
AND event.otype = 'SamplingEvent'
AND site.otype = 'SamplingSite'
GROUP BY site.label
)
SELECT * FROM sample_to_site
ORDER BY sample_count DESC
LIMIT 10
`;
const data = await loadData(query, [], "loading_top_sites");
return data;
}
Loading top archaeological sites…
Code
viewof topSitesTable = Inputs.table(topSitesByCount, {
header: {
site_name: "Archaeological Site",
sample_count: "Sample Count"
},
format: {
sample_count: d => d.toLocaleString()
}
})12.2 Material Type Distribution
Code
materialDistribution = {
const query = `
SELECT
mat.label as material_type,
COUNT(DISTINCT samp.row_id) as sample_count
FROM nodes samp
JOIN nodes e ON samp.row_id = e.s AND e.p = 'has_material_category'
JOIN nodes mat ON e.o[1] = mat.row_id
WHERE samp.otype = 'MaterialSampleRecord'
AND e.otype = '_edge_'
AND mat.otype = 'IdentifiedConcept'
GROUP BY mat.label
ORDER BY sample_count DESC
LIMIT 10
`;
const data = await loadData(query, [], "loading_materials");
return data;
}
Loading material types…
Code
viewof materialTable = Inputs.table(materialDistribution, {
header: {
material_type: "Material Type",
sample_count: "Sample Count"
},
format: {
sample_count: d => d.toLocaleString()
}
})Key Insights: - Çatalhöyük leads with 145,900+ samples - one of the world’s largest Neolithic sites - Biogenic non-organic materials dominate (bones, shells) reflecting archaeological preservation - Global coverage spans from Arctic (Finnmark) to temperate zones
13 Summary: Key Lessons for Querying OpenContext Parquet
13.1 🎯 Essential Discoveries
Critical Bug Fix: Direct Sample→Location queries don’t work
- Problem: Returned 0 results from 1M+ sample dataset
- Solution: Always traverse through SamplingEvent
- Impact: Unlocked access to 1,096,274 located samples
Correct Relationship Paths:
✅ Sample → produced_by → SamplingEvent → sample_location → Location ✅ Sample → produced_by → SamplingEvent → sampling_site → Site → site_location → LocationProperty Graph Structure:
- 79% edges, 21% entities in 11.6M rows
- Multi-hop traversal required for meaningful queries
- No shortcuts exist - respect the graph model
13.2 🔧 Debugging Methodology
- Verify relationships exist before building complex queries
- Trace step-by-step from simple counts to complex joins
- Test multiple paths - graphs often have alternative routes
- Validate results against known entity counts
13.3 ⚡ Performance Guidelines
- Filter by
otypefirst - reduces 11M rows to manageable subsets - Use CTEs for complex multi-hop queries
- Aggregate before filtering when possible
- Respect obfuscated coordinates for site protection
13.4 🏛️ Archaeological Context
- Major sites: Çatalhöyük, Petra, Polis Chrysochous dominate sample counts
- Material types: Biogenic non-organic materials most common
- Global reach: Arctic to Antarctic coverage with sensitive location protection
- Research value: 1M+ precisely located specimens for spatial analysis
13.5 🚀 Advanced Applications
This corrected understanding enables: - Spatial clustering analysis of archaeological finds - Temporal pattern recognition through sampling events - Site similarity studies via material type distributions - Collection bias analysis through agent and responsibility networks
The key to success: Understand the graph model first, query second. This property graph structure reflects the real-world complexity of archaeological data collection and enables sophisticated analysis when queried correctly.
14 Next Steps
Ready to analyze this data? Remember: 1. Start with entity relationship exploration 2. Build queries incrementally 3. Validate results at each step 4. Respect archaeological site sensitivities
Happy querying! 🏺