Logic for inferring the semantic types of Text fields based on their TextFingerprints. These tests only run against Fields that don't have existing semantic types. | (ns metabase.analyze.classifiers.text-fingerprint (:require [metabase.analyze.fingerprint.schema :as fingerprint.schema] [metabase.analyze.schema :as analyze.schema] [metabase.sync.util :as sync-util] [metabase.util.log :as log] [metabase.util.malli :as mu] [metabase.util.malli.schema :as ms])) |
Fields that have at least this percent of values that are satisfy some predicate (such as | (def ^:private ^:const ^Double percent-valid-threshold 0.95) |
Fields that have at least this lower percent of values that satisfy some predicate (such as | (def ^:private ^Double lower-percent-valid-threshold 0.7) |
Is the value of | (mu/defn- percent-key-above-threshold? [threshold :- :double text-fingerprint :- fingerprint.schema/TextFingerprint percent-key :- :keyword] (when-let [percent (get text-fingerprint percent-key)] (>= percent threshold))) |
Map of keys inside the | (def ^:private percent-key->semantic-type {:percent-json [:type/SerializedJSON percent-valid-threshold] :percent-url [:type/URL percent-valid-threshold] :percent-email [:type/Email percent-valid-threshold] :percent-state [:type/State lower-percent-valid-threshold]}) |
(mu/defn- infer-semantic-type-for-text-fingerprint :- [:maybe ms/FieldType] "Check various percentages inside the `text-fingerprint` and return the corresponding semantic type to mark the Field as if the percent passes the threshold." [text-fingerprint :- fingerprint.schema/TextFingerprint] (some (fn [[percent-key [semantic-type threshold]]] (when (percent-key-above-threshold? threshold text-fingerprint percent-key) semantic-type)) percent-key->semantic-type)) | |
We can edit the semantic type if its currently unset or if it was set during the current analysis phase. The original
field might exist in the metadata at | (defn- can-edit-semantic-type? [field] (or (nil? (:semantic_type field)) (let [original (get (meta field) :sync.classify/original)] (and original (nil? (:semantic_type original)))))) |
(mu/defn infer-semantic-type :- [:maybe analyze.schema/Field] "Do classification for `:type/Text` Fields with a valid `TextFingerprint`. Currently this only checks the various recorded percentages, but this is subject to change in the future." [field :- analyze.schema/Field fingerprint :- [:maybe fingerprint.schema/Fingerprint]] (when (and (isa? (:base_type field) :type/Text) (can-edit-semantic-type? field)) (when-let [text-fingerprint (get-in fingerprint [:type :type/Text])] (when-let [inferred-semantic-type (infer-semantic-type-for-text-fingerprint text-fingerprint)] (log/debugf "Based on the fingerprint of %s, we're marking it as %s." (sync-util/name-for-logging field) inferred-semantic-type) (assoc field :semantic_type inferred-semantic-type))))) | |