Skip to main content

Working with attributes and data types

While geometry defines where features are located, attributes define what features are and their characteristics. Effective attribute design is crucial for creating GeoPackage databases that are both functional and maintainable. This section explores how to choose appropriate data types, structure attributes effectively, and implement best practices for attribute management in Danish municipal applications.

Understanding SQLite Data Types in GeoPackage

GeoPackage uses SQLite as its underlying database engine, which has a flexible but specific approach to data types. Understanding SQLite's type system is essential for proper attribute design.

SQLite Type Affinity

SQLite uses "type affinity" rather than strict typing. Each column has a recommended type, but SQLite will attempt to convert values to the appropriate type when possible.

The five storage classes:

NULL - Represents missing or unknown values.

INTEGER - Signed integers stored in 1, 2, 3, 4, 6, or 8 bytes depending on magnitude.

REAL - Floating-point numbers stored as 8-byte IEEE floating point values.

TEXT - Text strings stored using the database encoding (UTF-8, UTF-16BE, or UTF-16LE).

BLOB - Binary data stored exactly as input (used internally for GeoPackage geometries).

Declared Types and Type Affinity

When you declare a column type in SQLite, it determines the column's "affinity" - how values are stored and compared.

INTEGER affinity:

building_id INTEGER
floor_count INTEGER
construction_year INTEGER
parcel_number INTEGER

REAL affinity:

area_m2 REAL
latitude REAL
temperature REAL
percentage REAL

TEXT affinity:

street_name TEXT
owner_name TEXT
postal_code TEXT  -- Even though it contains digits
cadastral_number TEXT

NUMERIC affinity - Prefers numeric storage but falls back to TEXT:

measurement NUMERIC  -- Could store as INTEGER, REAL, or TEXT

BLOB affinity:

document_pdf BLOB
signature_image BLOB

Choosing Appropriate Data Types

Selecting the right data type for each attribute ensures data integrity, optimal storage, and correct behavior in queries and analysis.

Text Data

Use TEXT for any alphanumeric content, identifiers, codes, names, and descriptions.

Examples from Danish municipalities:

CREATE TABLE properties (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    
    -- Identifiers and codes
    property_id TEXT UNIQUE NOT NULL,
    bbr_id TEXT,  -- Danish Building and Dwelling Register ID
    cvr_number TEXT,  -- Central Business Register number
    
    -- Names and descriptions
    street_name TEXT NOT NULL,
    street_number TEXT,  -- "123", "123A", "123-125"
    floor TEXT,  -- "st", "1", "2", "kld"
    side TEXT,  -- "tv", "th", "mf"
    city TEXT,
    postal_code TEXT,  -- "8000", not numeric because no math operations
    
    -- Longer text content
    legal_description TEXT,
    owner_notes TEXT,
    restrictions TEXT,
    comments TEXT
);

Best practices for TEXT columns:

  • Always use TEXT for postal codes, even though they're numeric - you never perform arithmetic on them
  • Use TEXT for identifiers that may contain letters or leading zeros
  • Consider maximum length requirements in application logic, though SQLite doesn't enforce VARCHAR limits
  • Store multi-line content using newline characters within TEXT columns
  • Use consistent character encoding (UTF-8 is standard for GeoPackage)

Common mistake to avoid:

-- WRONG: Using INTEGER for postal codes
postal_code INTEGER  -- Loses leading zeros: "0800" becomes 800

-- CORRECT: Using TEXT
postal_code TEXT  -- Preserves "0800", "8000", "DK-8000" exactly

Numeric Data

Choose between INTEGER and REAL based on whether values should be whole numbers or can have decimal precision.

INTEGER - Use for:

CREATE TABLE buildings (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    
    -- Counts and quantities
    floor_count INTEGER,
    room_count INTEGER,
    apartment_count INTEGER,
    parking_spaces INTEGER,
    
    -- Years
    construction_year INTEGER,  -- 2015, 1987
    renovation_year INTEGER,
    demolition_year INTEGER,
    
    -- Whole number measurements
    max_building_height_m INTEGER,  -- When precision isn't needed
    
    -- Codes and identifiers (when purely numeric)
    use_code INTEGER,
    
    -- Status and priority values
    priority_level INTEGER CHECK(priority_level BETWEEN 1 AND 10),
    inspection_status INTEGER
);

REAL - Use for:

CREATE TABLE parcels (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    
    -- Measurements requiring precision
    area_m2 REAL,
    perimeter_m REAL,
    frontage_width_m REAL,
    
    -- Percentages and ratios
    building_coverage_pct REAL CHECK(building_coverage_pct >= 0 AND building_coverage_pct <= 100),
    floor_area_ratio REAL,
    green_space_ratio REAL,
    
    -- Coordinates (when stored as attributes)
    centroid_x REAL,
    centroid_y REAL,
    
    -- Financial values
    assessed_value_dkk REAL,
    tax_amount_dkk REAL,
    
    -- Precise measurements
    elevation_m REAL,
    slope_degrees REAL,
    temperature_c REAL
);

Choosing between INTEGER and REAL:

Example 1: Building height

-- If you only need whole meters:
max_height_m INTEGER  -- 15, 23, 8

-- If you need precision:
max_height_m REAL  -- 15.7, 23.25, 8.9

Example 2: Area measurements

-- Area always needs decimal precision:
area_m2 REAL  -- 145.67, 1203.89

-- WRONG: Using INTEGER for area loses precision
area_m2 INTEGER  -- 145.67 becomes 145

Working with currency:

-- For Danish kroner, REAL is appropriate
property_value_dkk REAL,
annual_tax_dkk REAL,

-- For precise financial calculations, consider storing as INTEGER øre:
amount_oere INTEGER,  -- 10050 øre = 100.50 DKK
-- Then convert: amount_dkk = amount_oere / 100.0

Boolean Values

SQLite doesn't have a native BOOLEAN type, but the convention is to use INTEGER with values 0 (false) and 1 (true).

CREATE TABLE infrastructure (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom LINESTRING NOT NULL,
    
    -- Boolean flags using INTEGER
    active INTEGER DEFAULT 1,  -- Active/inactive
    public_access INTEGER DEFAULT 0,  -- Public/private
    winter_maintenance INTEGER,  -- Yes/no
    has_lighting INTEGER DEFAULT 0,
    requires_permit INTEGER DEFAULT 1,
    heritage_protected INTEGER DEFAULT 0,
    
    -- Alternative: use CHECK constraint for clarity
    accessible INTEGER CHECK(accessible IN (0, 1)),
    
    -- Can also use TEXT with CHECK constraint
    operational TEXT CHECK(operational IN ('yes', 'no')) DEFAULT 'yes'
);

Best practices for boolean attributes:

-- Good: Clear column names that read as questions
heritage_protected INTEGER  -- "Is this heritage protected?"
requires_permit INTEGER     -- "Does this require a permit?"
has_sidewalk INTEGER        -- "Does this have a sidewalk?"

-- Good: Explicit default values
active INTEGER DEFAULT 1,
temporary INTEGER DEFAULT 0,

-- Good: CHECK constraint ensures only 0 or 1
status INTEGER CHECK(status IN (0, 1)) DEFAULT 1,

-- Alternative: TEXT with named values for readability
active TEXT CHECK(active IN ('yes', 'no')) DEFAULT 'yes'

Querying boolean values:

-- Find all heritage-protected buildings
SELECT * FROM buildings WHERE heritage_protected = 1;

-- Find all non-active infrastructure
SELECT * FROM infrastructure WHERE active = 0;

-- Boolean logic in WHERE clauses
SELECT * FROM roads 
WHERE has_bike_lane = 1 AND has_lighting = 1;

Date and Time Data

SQLite doesn't have dedicated date/time types but provides date and time functions that work with TEXT, REAL, or INTEGER storage.

TEXT format (ISO 8601) - Recommended:

CREATE TABLE inspections (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    asset_fid INTEGER NOT NULL,
    
    -- Date only (YYYY-MM-DD)
    inspection_date DATE,  -- '2025-01-15'
    scheduled_date DATE,   -- '2025-02-20'
    
    -- Date and time (YYYY-MM-DD HH:MM:SS)
    created_timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,  -- '2025-01-15 14:30:00'
    updated_timestamp DATETIME,
    
    -- With timezone information
    reported_at TEXT,  -- '2025-01-15T14:30:00Z' or '2025-01-15T14:30:00+01:00'
    
    FOREIGN KEY (asset_fid) REFERENCES assets(fid)
);

Working with dates:

-- Insert current date
INSERT INTO inspections (inspection_date) VALUES (date('now'));

-- Insert current timestamp
INSERT INTO inspections (created_timestamp) VALUES (datetime('now'));

-- Date calculations
SELECT * FROM inspections 
WHERE inspection_date > date('now', '-30 days');

-- Extract date parts
SELECT 
    inspection_date,
    strftime('%Y', inspection_date) AS year,
    strftime('%m', inspection_date) AS month,
    strftime('%d', inspection_date) AS day,
    strftime('%w', inspection_date) AS day_of_week
FROM inspections;

-- Date ranges
SELECT * FROM projects
WHERE start_date BETWEEN '2024-01-01' AND '2024-12-31';

-- Calculate duration
SELECT 
    project_name,
    julianday(end_date) - julianday(start_date) AS duration_days
FROM projects;

Danish date considerations:

CREATE TABLE permits (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON,
    
    application_date DATE NOT NULL,
    approval_date DATE,
    expiry_date DATE,
    
    -- For display in Danish format (dd-mm-yyyy), convert in application layer
    -- Store in ISO format (yyyy-mm-dd) in database
    
    -- Check expiry status
    CHECK(expiry_date IS NULL OR expiry_date >= approval_date)
);

-- Query for expired permits
SELECT * FROM permits 
WHERE expiry_date IS NOT NULL 
  AND expiry_date < date('now');

-- Query for permits expiring within 30 days
SELECT * FROM permits 
WHERE expiry_date BETWEEN date('now') AND date('now', '+30 days');

Common date patterns:

-- Validity period tracking
CREATE TABLE zoning (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom MULTIPOLYGON NOT NULL,
    zone_code TEXT,
    
    effective_from DATE NOT NULL,
    effective_to DATE,  -- NULL means still effective
    
    -- Query current zones:
    -- WHERE effective_from <= date('now') 
    --   AND (effective_to IS NULL OR effective_to > date('now'))
);

-- Temporal versioning
CREATE TABLE parcels (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    parcel_id TEXT NOT NULL,
    
    valid_from DATE NOT NULL,
    valid_to DATE,
    
    owner_name TEXT,
    area_m2 REAL,
    
    -- Composite uniqueness: parcel can appear multiple times with different date ranges
    UNIQUE(parcel_id, valid_from)
);

The Schema Extension: gpkg_data_columns

The GeoPackage standard includes an optional Schema Extension that provides additional metadata and constraints for attribute columns. The core of this extension is the gpkg_data_columns table, which allows you to document column purposes, specify value constraints, and define enumerated types.

Understanding gpkg_data_columns

The gpkg_data_columns table stores additional information about columns in your feature tables, including descriptions, MIME types, and most importantly, constraint definitions.

Table structure:

CREATE TABLE gpkg_data_columns (
  table_name TEXT NOT NULL,
  column_name TEXT NOT NULL,
  name TEXT,
  title TEXT,
  description TEXT,
  mime_type TEXT,
  constraint_name TEXT,
  CONSTRAINT pk_gdc PRIMARY KEY (table_name, column_name),
  CONSTRAINT fk_gdc_tn FOREIGN KEY (table_name) REFERENCES gpkg_contents(table_name)
);

Column definitions:

  • table_name - Name of the feature table
  • column_name - Name of the column being described
  • name - A short, machine-readable identifier for the column
  • title - A human-readable title for the column
  • description - Detailed description of the column's purpose and content
  • mime_type - MIME type for the data (e.g., 'text/plain', 'application/json')
  • constraint_name - Reference to a constraint in gpkg_data_column_constraints

The gpkg_data_column_constraints Table

This companion table defines the actual constraints, particularly for enumerated values.

CREATE TABLE gpkg_data_column_constraints (
  constraint_name TEXT NOT NULL,
  constraint_type TEXT NOT NULL, -- 'range', 'enum', or 'glob'
  value TEXT,
  min NUMERIC,
  min_is_inclusive BOOLEAN,
  max NUMERIC,
  max_is_inclusive BOOLEAN,
  description TEXT,
  CONSTRAINT gdcc_ntv UNIQUE (constraint_name, constraint_type, value)
);

Constraint types:

  • enum - Enumerated list of valid values
  • range - Numeric range constraints
  • glob - Pattern matching (SQLite GLOB syntax)

Implementing Enumerated Values with gpkg_data_columns

Using the Schema Extension provides better documentation and validation than simple CHECK constraints.

Example 1: Road classification with full documentation

-- 1. Create the feature table
CREATE TABLE roads (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom LINESTRING NOT NULL,
    road_id TEXT UNIQUE NOT NULL,
    road_name TEXT,
    road_class TEXT,
    surface_type TEXT,
    condition TEXT,
    ownership TEXT
);

-- 2. Register in gpkg_contents (as before)
INSERT INTO gpkg_contents (
    table_name, data_type, identifier, description,
    srs_id
) VALUES (
    'roads', 'features', 'Road Network',
    'Municipal road centerlines with classification and condition',
    25832
);

-- 3. Register geometry column (as before)
INSERT INTO gpkg_geometry_columns (
    table_name, column_name, geometry_type_name,
    srs_id, z, m
) VALUES (
    'roads', 'geom', 'LINESTRING', 25832, 0, 0
);

-- 4. Define enumerated constraints for road_class
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('road_class_enum', 'enum', 'Motorway', 'High-speed divided highway (Motorvej)'),
    ('road_class_enum', 'enum', 'Primary', 'Major through roads (Primærrute)'),
    ('road_class_enum', 'enum', 'Secondary', 'Regional connectors (Sekundærrute)'),
    ('road_class_enum', 'enum', 'Residential', 'Local residential streets (Lokalvej)'),
    ('road_class_enum', 'enum', 'Service', 'Service roads and driveways (Adgangsvej)'),
    ('road_class_enum', 'enum', 'Track', 'Unpaved tracks (Sti)');

-- 5. Define enumerated constraints for surface_type
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('surface_type_enum', 'enum', 'Asphalt', 'Asphalt surface (Asfalt)'),
    ('surface_type_enum', 'enum', 'Concrete', 'Concrete surface (Beton)'),
    ('surface_type_enum', 'enum', 'Paved', 'Other paved surface (Befæstet)'),
    ('surface_type_enum', 'enum', 'Gravel', 'Gravel surface (Grus)'),
    ('surface_type_enum', 'enum', 'Dirt', 'Dirt or earth (Jord)'),
    ('surface_type_enum', 'enum', 'Grass', 'Grass surface (Græs)');

-- 6. Define enumerated constraints for condition
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('condition_enum', 'enum', 'Excellent', 'No defects, like new'),
    ('condition_enum', 'enum', 'Good', 'Minor wear, fully functional'),
    ('condition_enum', 'enum', 'Fair', 'Moderate wear, some maintenance needed'),
    ('condition_enum', 'enum', 'Poor', 'Significant defects, major maintenance needed'),
    ('condition_enum', 'enum', 'Critical', 'Severe defects, immediate attention required');

-- 7. Define enumerated constraints for ownership
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('ownership_enum', 'enum', 'Municipal', 'Owned by municipality (Kommunal)'),
    ('ownership_enum', 'enum', 'State', 'Owned by state (Statslig)'),
    ('ownership_enum', 'enum', 'Private', 'Private ownership (Privat)'),
    ('ownership_enum', 'enum', 'Other', 'Other ownership type (Anden)');

-- 8. Document columns in gpkg_data_columns
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description, constraint_name
) VALUES 
    ('roads', 'road_class', 'road_class', 'Road Classification',
     'Functional classification of road based on traffic importance and connectivity',
     'road_class_enum'),
    
    ('roads', 'surface_type', 'surface_type', 'Surface Type',
     'Type of road surface material',
     'surface_type_enum'),
    
    ('roads', 'condition', 'condition', 'Condition Rating',
     'Overall condition assessment based on latest inspection',
     'condition_enum'),
    
    ('roads', 'ownership', 'ownership', 'Ownership Type',
     'Entity responsible for maintenance and ownership',
     'ownership_enum'),
    
    ('roads', 'road_name', 'road_name', 'Road Name',
     'Official name of the road (Danish)',
     NULL);

-- 9. Query to see all valid values for a column
SELECT 
    dc.table_name,
    dc.column_name,
    dc.title,
    dcc.value,
    dcc.description
FROM gpkg_data_columns dc
JOIN gpkg_data_column_constraints dcc 
    ON dc.constraint_name = dcc.constraint_name
WHERE dc.table_name = 'roads' 
  AND dc.column_name = 'road_class'
ORDER BY dcc.value;

Example 2: Building types with hierarchical classification

CREATE TABLE buildings (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    building_id TEXT UNIQUE NOT NULL,
    building_type TEXT,
    building_subtype TEXT,
    usage_category TEXT
);

-- Define building type enumerations
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('building_type_enum', 'enum', 'Residential', 'Residential building (Bolig)'),
    ('building_type_enum', 'enum', 'Commercial', 'Commercial building (Erhverv)'),
    ('building_type_enum', 'enum', 'Industrial', 'Industrial building (Industri)'),
    ('building_type_enum', 'enum', 'Agricultural', 'Agricultural building (Landbrug)'),
    ('building_type_enum', 'enum', 'Public', 'Public/institutional building (Offentlig)'),
    ('building_type_enum', 'enum', 'Mixed', 'Mixed use building (Blandet)');

-- Define residential subtypes
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('res_subtype_enum', 'enum', 'Detached', 'Single-family detached house (Parcelhus)'),
    ('res_subtype_enum', 'enum', 'Semi-detached', 'Semi-detached house (Dobbelthus)'),
    ('res_subtype_enum', 'enum', 'Terraced', 'Terraced/row house (Rækkehus)'),
    ('res_subtype_enum', 'enum', 'Apartment', 'Apartment building (Etagebolig)'),
    ('res_subtype_enum', 'enum', 'Farm', 'Farmhouse (Stuehus)');

-- Define usage categories
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('usage_enum', 'enum', 'Permanent', 'Permanent residence (Helårsbolig)'),
    ('usage_enum', 'enum', 'Seasonal', 'Seasonal/summer house (Fritidsbolig)'),
    ('usage_enum', 'enum', 'Office', 'Office use (Kontor)'),
    ('usage_enum', 'enum', 'Retail', 'Retail use (Butik)'),
    ('usage_enum', 'enum', 'Warehouse', 'Warehouse/storage (Lager)'),
    ('usage_enum', 'enum', 'Educational', 'School or educational (Undervisning)'),
    ('usage_enum', 'enum', 'Healthcare', 'Healthcare facility (Sundhed)'),
    ('usage_enum', 'enum', 'Cultural', 'Cultural facility (Kultur)');

-- Document columns
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description, constraint_name
) VALUES 
    ('buildings', 'building_type', 'building_type', 'Building Type',
     'Primary classification of building use',
     'building_type_enum'),
    
    ('buildings', 'building_subtype', 'building_subtype', 'Building Subtype',
     'Detailed subtype classification (primarily for residential buildings)',
     'res_subtype_enum'),
    
    ('buildings', 'usage_category', 'usage_category', 'Usage Category',
     'Specific usage category of the building',
     'usage_enum');

Example 3: Energy labels for buildings

CREATE TABLE building_energy (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    building_fid INTEGER NOT NULL,
    energy_label TEXT,
    assessment_date DATE,
    valid_until DATE,
    FOREIGN KEY (building_fid) REFERENCES buildings(fid)
);

-- Define energy label enumeration (Danish system)
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('energy_label_enum', 'enum', 'A2020', 'Best - Very low energy consumption (2020 standard)'),
    ('energy_label_enum', 'enum', 'A2015', 'Excellent - Very low energy consumption (2015 standard)'),
    ('energy_label_enum', 'enum', 'A2010', 'Excellent - Very low energy consumption (2010 standard)'),
    ('energy_label_enum', 'enum', 'B', 'Good - Low energy consumption'),
    ('energy_label_enum', 'enum', 'C', 'Average - Moderate energy consumption'),
    ('energy_label_enum', 'enum', 'D', 'Below average - Moderate-high energy consumption'),
    ('energy_label_enum', 'enum', 'E', 'Poor - High energy consumption'),
    ('energy_label_enum', 'enum', 'F', 'Very poor - Very high energy consumption'),
    ('energy_label_enum', 'enum', 'G', 'Worst - Extremely high energy consumption');

-- Document the column
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description, constraint_name
) VALUES 
    ('building_energy', 'energy_label', 'energy_label', 'Energy Performance Label',
     'Danish energy performance certificate rating (Energimærke)',
     'energy_label_enum');

Using Range Constraints

The Schema Extension also supports numeric range constraints.

CREATE TABLE inspections (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    asset_fid INTEGER NOT NULL,
    inspection_score INTEGER,
    confidence_level REAL,
    temperature_c REAL
);

-- Define range constraint for inspection score (0-100)
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, min, min_is_inclusive, max, max_is_inclusive, description
) VALUES 
    ('score_range', 'range', 0, 1, 100, 1, 'Inspection score must be between 0 and 100 inclusive');

-- Define range constraint for confidence level (0.0-1.0)
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, min, min_is_inclusive, max, max_is_inclusive, description
) VALUES 
    ('confidence_range', 'range', 0.0, 1, 1.0, 1, 'Confidence level as decimal between 0 and 1');

-- Define range constraint for temperature (-40 to +50 °C)
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, min, min_is_inclusive, max, max_is_inclusive, description
) VALUES 
    ('temp_range', 'range', -40.0, 1, 50.0, 1, 'Temperature in Celsius, typical Danish range');

-- Document columns
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description, constraint_name
) VALUES 
    ('inspections', 'inspection_score', 'inspection_score', 'Inspection Score',
     'Overall condition score from 0 (worst) to 100 (best)',
     'score_range'),
    
    ('inspections', 'confidence_level', 'confidence_level', 'Confidence Level',
     'Inspector confidence in assessment as decimal 0.0-1.0',
     'confidence_range'),
    
    ('inspections', 'temperature_c', 'temperature_c', 'Temperature (°C)',
     'Ambient temperature during inspection',
     'temp_range');

Using GLOB Pattern Constraints

GLOB constraints use SQLite's pattern matching for text validation.

CREATE TABLE cadastral_parcels (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    cadastral_number TEXT,
    postal_code TEXT
);

-- Danish cadastral numbers follow pattern: XX-XXXXX (e.g., 12-34567)
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('cadastral_pattern', 'glob', '[0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]',
     'Cadastral number format: two digits, hyphen, five digits');

-- Danish postal codes are exactly 4 digits
INSERT INTO gpkg_data_column_constraints (
    constraint_name, constraint_type, value, description
) VALUES 
    ('postal_code_pattern', 'glob', '[0-9][0-9][0-9][0-9]',
     'Danish postal code: exactly four digits');

-- Document columns
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description, constraint_name
) VALUES 
    ('cadastral_parcels', 'cadastral_number', 'cadastral_number', 'Cadastral Number',
     'Official cadastral parcel identifier (Matrikelnummer)',
     'cadastral_pattern'),
    
    ('cadastral_parcels', 'postal_code', 'postal_code', 'Postal Code',
     'Danish postal code (Postnummer)',
     'postal_code_pattern');

Querying Schema Extension Tables

Get all documented columns:

SELECT 
    dc.table_name,
    dc.column_name,
    dc.title,
    dc.description,
    dc.constraint_name
FROM gpkg_data_columns dc
ORDER BY dc.table_name, dc.column_name;

Get enumerated values for a specific column:

SELECT 
    dcc.value,
    dcc.description
FROM gpkg_data_columns dc
JOIN gpkg_data_column_constraints dcc 
    ON dc.constraint_name = dcc.constraint_name
WHERE dc.table_name = 'roads'
  AND dc.column_name = 'road_class'
  AND dcc.constraint_type = 'enum'
ORDER BY dcc.value;

Get all constraints for a table:

SELECT 
    dc.column_name,
    dc.title,
    dcc.constraint_type,
    dcc.value,
    dcc.min,
    dcc.max,
    dcc.description
FROM gpkg_data_columns dc
JOIN gpkg_data_column_constraints dcc 
    ON dc.constraint_name = dcc.constraint_name
WHERE dc.table_name = 'buildings'
ORDER BY dc.column_name, dcc.constraint_type, dcc.value;

Generate documentation report:

SELECT 
    dc.table_name AS 'Table',
    dc.column_name AS 'Column',
    dc.title AS 'Title',
    dc.description AS 'Description',
    GROUP_CONCAT(dcc.value, ', ') AS 'Valid Values'
FROM gpkg_data_columns dc
LEFT JOIN gpkg_data_column_constraints dcc 
    ON dc.constraint_name = dcc.constraint_name 
    AND dcc.constraint_type = 'enum'
GROUP BY dc.table_name, dc.column_name
ORDER BY dc.table_name, dc.column_name;

Benefits of Using gpkg_data_columns

Self-documenting data - Descriptions and constraints are stored within the GeoPackage, traveling with the data.

Interoperability - Applications that support the Schema Extension can automatically validate data and populate user interfaces with valid values.

Data quality - Enumerated values and range constraints provide clear expectations for data entry.

Internationalization - Descriptions can include translations or references to multiple languages.

Consistency - Standardized constraint definitions across multiple tables ensure uniform data quality.

Discoverability - Users can query the schema to understand what values are valid without consulting external documentation.

Combining CHECK Constraints with gpkg_data_columns

For maximum data integrity, combine SQL CHECK constraints with Schema Extension documentation:

CREATE TABLE roads (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom LINESTRING NOT NULL,
    road_class TEXT CHECK(road_class IN ('Motorway', 'Primary', 'Secondary', 'Residential', 'Service', 'Track')),
    condition TEXT CHECK(condition IN ('Excellent', 'Good', 'Fair', 'Poor', 'Critical'))
);

-- SQL CHECK constraint enforces at database level
-- gpkg_data_columns provides documentation and descriptions
INSERT INTO gpkg_data_column_constraints (...) VALUES (...);

This dual approach ensures:

  • Database-level validation (CHECK constraint)
  • Human-readable documentation (gpkg_data_columns)
  • Application-level discovery (gpkg_data_column_constraints)

NULL Values and Data Quality

Understanding and properly handling NULL (missing or unknown data) is crucial for data quality.

NULL represents:

  • Unknown information
  • Not applicable
  • Not yet collected
  • Intentionally left blank

Working with NULL:

CREATE TABLE buildings (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,  -- Geometry should never be NULL
    
    building_id TEXT NOT NULL,  -- Required field
    
    construction_year INTEGER,  -- NULL if unknown
    renovation_year INTEGER,    -- NULL if never renovated or unknown
    demolition_year INTEGER,    -- NULL if not demolished
    
    owner_name TEXT,  -- NULL if ownership unknown or not public
    
    energy_label TEXT,  -- NULL if not yet assessed
    
    heritage_status TEXT  -- NULL if not evaluated
);

-- Querying NULL values
SELECT * FROM buildings WHERE construction_year IS NULL;  -- Unknown construction year
SELECT * FROM buildings WHERE renovation_year IS NULL;    -- Never renovated OR unknown

-- Counting NULL values
SELECT 
    COUNT(*) AS total_buildings,
    COUNT(construction_year) AS with_construction_year,  -- COUNT ignores NULLs
    COUNT(*) - COUNT(construction_year) AS missing_construction_year,
    COUNT(energy_label) AS with_energy_label
FROM buildings;

-- NULL-safe comparisons
SELECT * FROM buildings 
WHERE COALESCE(renovation_year, 0) > 2000;  -- Treats NULL as 0

-- Providing defaults for display
SELECT 
    building_id,
    COALESCE(owner_name, 'Unknown') AS owner,
    COALESCE(energy_label, 'Not Assessed') AS energy
FROM buildings;

NULL vs empty string:

-- NULL means unknown/not applicable
owner_name TEXT  -- NULL = ownership unknown

-- Empty string means known to be empty
owner_name TEXT  -- '' = known to have no owner

-- Best practice: distinguish between them
SELECT * FROM buildings WHERE owner_name IS NULL;  -- Unknown owner
SELECT * FROM buildings WHERE owner_name = '';     -- Known to be empty
SELECT * FROM buildings WHERE owner_name IS NULL OR owner_name = '';  -- Either

Default values vs NULL:

-- Use DEFAULT when a sensible default exists
active INTEGER DEFAULT 1,  -- Assume active unless specified otherwise
created_date DATE DEFAULT CURRENT_DATE,  -- Auto-populate

-- Use NULL when there's no sensible default
construction_year INTEGER,  -- No reasonable default
last_inspection DATE,  -- NULL until first inspection

Advanced Attribute Patterns

Multi-language Support

For international data or bilingual requirements:

CREATE TABLE points_of_interest (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POINT NOT NULL,
    name_da TEXT,  -- Danish name
    name_en TEXT,  -- English name
    description_da TEXT,
    description_en TEXT,
    category_da TEXT,
    category_en TEXT
);

-- Document with Schema Extension
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description
) VALUES 
    ('points_of_interest', 'name_da', 'name_da', 'Name (Danish)',
     'Point of interest name in Danish'),
    ('points_of_interest', 'name_en', 'name_en', 'Name (English)',
     'Point of interest name in English');

Measurement Units Documentation

CREATE TABLE sensors (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POINT NOT NULL,
    sensor_id TEXT,
    water_level_cm REAL,
    flow_rate_m3h REAL,
    pressure_bar REAL
);

-- Document units in descriptions
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description
) VALUES 
    ('sensors', 'water_level_cm', 'water_level', 'Water Level',
     'Water level measurement in centimeters (cm)'),
    ('sensors', 'flow_rate_m3h', 'flow_rate', 'Flow Rate',
     'Flow rate in cubic meters per hour (m³/h)'),
    ('sensors', 'pressure_bar', 'pressure', 'Pressure',
     'Pressure measurement in bar');

Coded Values with External References

CREATE TABLE land_parcels (
    fid INTEGER PRIMARY KEY AUTOINCREMENT,
    geom POLYGON NOT NULL,
    land_use_code TEXT,
    planning_code TEXT
);

-- Reference external classification systems
INSERT INTO gpkg_data_columns (
    table_name, column_name, name, title, description
) VALUES 
    ('land_parcels', 'land_use_code', 'land_use_code', 'Land Use Code',
     'Land use classification according to Danish Planning Act §5b. See https://plandata.dk for complete classification.'),
    ('land_parcels', 'planning_code', 'planning_code', 'Planning Zone Code',
     'Planning zone code from municipal plan. Reference: Local Plan XXXX');

Data Validation Strategies

Validation Queries

-- Find features with invalid enum values
SELECT f.fid, f.road_class
FROM roads f
WHERE f.road_class NOT IN (
    SELECT dcc.value
    FROM gpkg_data_columns dc
    JOIN gpkg_data_column_constraints dcc ON dc.constraint_name = dcc.constraint_name
    WHERE dc.table_name = 'roads'
      AND dc.column_name = 'road_class'
      AND dcc.constraint_type = 'enum'
);

-- Find features with out-of-range values
SELECT f.fid, f.inspection_score
FROM inspections f
WHERE f.inspection_score < (
    SELECT min FROM gpkg_data_column_constraints
    WHERE constraint_name = 'score_range'
) OR f.inspection_score > (
    SELECT max FROM gpkg_data_column_constraints
    WHERE constraint_name = 'score_range'
);

-- Check for required attributes
SELECT fid, building_id
FROM buildings
WHERE building_type IS NULL OR building_type = '';

Creating Validation Views

CREATE VIEW data_quality_check AS
SELECT 
    'roads' AS layer,
    'road_class' AS attribute,
    COUNT(*) AS invalid_count
FROM roads
WHERE road_class NOT IN ('Motorway', 'Primary', 'Secondary', 'Residential', 'Service', 'Track')

UNION ALL

SELECT 
    'buildings' AS layer,
    'energy_label' AS attribute,
    COUNT(*) AS invalid_count
FROM buildings
WHERE energy_label IS NOT NULL 
  AND energy_label NOT IN ('A2020', 'A2015', 'A2010', 'B', 'C', 'D', 'E', 'F', 'G');

Best Practices Summary

1. Choose appropriate data types - Use INTEGER for counts and years, REAL for measurements, TEXT for identifiers and codes, DATE for temporal data.

2. Use the Schema Extension - Document columns with gpkg_data_columns and define constraints with gpkg_data_column_constraints for better data quality and self-documentation.

3. Define enumerated values properly - Use the Schema Extension for complex enumerations, CHECK constraints for simple validation, or both for maximum integrity.

4. Handle NULL values consistently - Distinguish between unknown (NULL) and empty values, use defaults where appropriate.

5. Document units and references - Always specify measurement units in column names or descriptions, reference external classification systems where applicable.

6. Implement validation - Create validation queries and views to monitor data quality over time.

7. Consider internationalization - Use separate columns for different languages when supporting multiple languages.

8. Use meaningful column names - Choose names that are self-explanatory and follow consistent conventions.


Knowledge Check Quiz

Test your understanding of working with attributes and data types in GeoPackage:

Question 1: Why should Danish postal codes be stored as TEXT rather than INTEGER?

a) TEXT columns are faster to query
b) Postal codes may have leading zeros that would be lost as INTEGER
c) TEXT columns take less storage space
d) INTEGER doesn't support 4-digit numbers

Question 2: What is the correct data type for storing a building's floor area in square meters?

a) INTEGER - because area is a number
b) TEXT - to preserve decimal precision
c) REAL - because area requires decimal precision
d) NUMERIC - for flexibility

Question 3: In SQLite/GeoPackage, how are boolean values conventionally stored?

a) As TRUE/FALSE keywords
b) As INTEGER values 0 (false) and 1 (true)
c) As TEXT values 'yes' and 'no'
d) As BOOLEAN data type

Question 4: What is the recommended format for storing dates in GeoPackage?

a) DD-MM-YYYY as TEXT for Danish format
b) Unix timestamp as INTEGER
c) YYYY-MM-DD as TEXT (ISO 8601)
d) Julian day as REAL

Question 5: What is the purpose of the gpkg_data_columns table?

a) To store the actual feature data
b) To provide additional metadata and documentation for attribute columns
c) To define coordinate reference systems
d) To create spatial indexes

Question 6: In the Schema Extension, which table stores the actual enumerated values for a constraint?

a) gpkg_data_columns
b) gpkg_contents
c) gpkg_data_column_constraints
d) gpkg_extensions

Question 7: What are the three constraint types supported by gpkg_data_column_constraints?

a) PRIMARY, FOREIGN, UNIQUE
b) range, enum, glob
c) INTEGER, REAL, TEXT
d) NOT NULL, CHECK, DEFAULT

Question 8: When should you use a separate lookup table instead of the Schema Extension's enum constraints?

a) Always - lookup tables are always better
b) Never - Schema Extension is always sufficient
c) When you need additional metadata, hierarchical relationships, or frequently changing values
d) Only for numeric values

Question 9: What does a NULL value represent in a database?

a) The number zero
b) An empty string
c) Missing, unknown, or not applicable data
d) An error in the data

Question 10: For a column storing energy labels (A2020, A2015, B, C, D, E, F, G), what's the best validation approach?

a) No validation needed - trust user input
b) Use INTEGER codes instead (1, 2, 3...)
c) Use Schema Extension enum constraints and optionally a CHECK constraint
d) Store as free text and validate in the application

Question 11: What is the correct way to query for records where renovation_year is NULL?

a) WHERE renovation_year = NULL
b) WHERE renovation_year IS NULL
c) WHERE renovation_year == NULL
d) WHERE ISNULL(renovation_year)

Question 12: When storing currency amounts in Danish kroner, which approach provides the most precision?

a) TEXT to preserve exact values
b) INTEGER storing øre (1 DKK = 100 øre)
c) REAL for decimal values
d) Either INTEGER (øre) or REAL depending on precision needs

Question 13: What pattern would you use in a GLOB constraint to validate Danish postal codes (exactly 4 digits)?

a) [0-9]{4}
b) [0-9][0-9][0-9][0-9]
c) \d\d\d\d
d) ####

Question 14: For a building height column where you need values like 15.7 meters, which data type should you use?

a) INTEGER - store as 157 decimeters
b) TEXT - to preserve decimals
c) REAL - for decimal precision
d) NUMERIC - for flexibility

Question 15: What's the benefit of combining SQL CHECK constraints with Schema Extension constraints?

a) No benefit - it's redundant
b) Database-level enforcement plus human-readable documentation
c) Faster query performance
d) Reduced storage requirements


Answer Key

  1. b) Postal codes may have leading zeros that would be lost as INTEGER - For example, "0800" would become 800 as an INTEGER, losing the leading zero. TEXT preserves the exact format.
  2. c) REAL - because area requires decimal precision - Floor areas like 145.67 m² need decimal values. INTEGER would truncate to 145 m².
  3. b) As INTEGER values 0 (false) and 1 (true) - SQLite doesn't have a native BOOLEAN type, so the convention is to use INTEGER with 0 = false and 1 = true.
  4. c) YYYY-MM-DD as TEXT (ISO 8601) - This format sorts correctly, works with SQLite date functions, and is the international standard.
  5. b) To provide additional metadata and documentation for attribute columns - It stores titles, descriptions, and references to constraints for attribute columns.
  6. c) gpkg_data_column_constraints - This table stores the actual constraint definitions, including enumerated values, ranges, and patterns.
  7. b) range, enum, glob - These are the three constraint types: numeric ranges, enumerated values, and pattern matching.
  8. c) When you need additional metadata, hierarchical relationships, or frequently changing values - Lookup tables provide more flexibility for complex classifications.
  9. c) Missing, unknown, or not applicable data - NULL specifically represents the absence of a value, which is different from zero or empty string.
  10. c) Use Schema Extension enum constraints and optionally a CHECK constraint - This provides both documentation (Schema Extension) and database-level validation (CHECK constraint).
  11. b) WHERE renovation_year IS NULL - NULL requires the IS operator, not the equality operator (=).
  12. d) Either INTEGER (øre) or REAL depending on precision needs - INTEGER øre (10050 = 100.50 DKK) provides exact precision for financial calculations; REAL works for general currency storage.
  13. b) [0-9][0-9][0-9][0-9] - SQLite GLOB uses this syntax for exactly four digits. The regex-style {4} syntax doesn't work in GLOB.
  14. c) REAL - for decimal precision - REAL handles decimal values like 15.7 meters correctly.
  15. b) Database-level enforcement plus human-readable documentation - CHECK constraints enforce rules at the database level, while Schema Extension provides discoverable documentation for applications and users.
Updated on Jan 17, 2026