Logic for inferring the semantic types of Text fields based on their TextFingerprints. These tests only run against Fields that don't have existing semantic types. | (ns metabase.analyze.classifiers.text-fingerprint (:require [metabase.analyze.fingerprint.schema :as fingerprint.schema] [metabase.analyze.schema :as analyze.schema] [metabase.sync.util :as sync-util] [metabase.util.log :as log] [metabase.util.malli :as mu] [metabase.util.malli.schema :as ms])) |
Fields that have at least this percent of values that are satisfy some predicate (such as | (def ^:private ^:const ^Double percent-valid-threshold 0.95) |
Fields that have at least this lower percent of values that satisfy some predicate (such as | (def ^:private ^Double lower-percent-valid-threshold 0.7) |
Is the value of | (mu/defn- percent-key-above-threshold?
[threshold :- :double
text-fingerprint :- fingerprint.schema/TextFingerprint
percent-key :- :keyword]
(when-let [percent (get text-fingerprint percent-key)]
(>= percent threshold))) |
Map of keys inside the | (def ^:private percent-key->semantic-type
{:percent-json [:type/SerializedJSON percent-valid-threshold]
:percent-url [:type/URL percent-valid-threshold]
:percent-email [:type/Email percent-valid-threshold]
:percent-state [:type/State lower-percent-valid-threshold]}) |
(mu/defn- infer-semantic-type-for-text-fingerprint :- [:maybe ms/FieldType]
"Check various percentages inside the `text-fingerprint` and return the corresponding semantic type to mark the Field
as if the percent passes the threshold."
[text-fingerprint :- fingerprint.schema/TextFingerprint]
(some (fn [[percent-key [semantic-type threshold]]]
(when (percent-key-above-threshold? threshold text-fingerprint percent-key)
semantic-type))
percent-key->semantic-type)) | |
We can edit the semantic type if its currently unset or if it was set during the current analysis phase. The original
field might exist in the metadata at | (defn- can-edit-semantic-type?
[field]
(or (nil? (:semantic_type field))
(let [original (get (meta field) :sync.classify/original)]
(and original
(nil? (:semantic_type original)))))) |
(mu/defn infer-semantic-type :- [:maybe analyze.schema/Field]
"Do classification for `:type/Text` Fields with a valid `TextFingerprint`.
Currently this only checks the various recorded percentages, but this is subject to change in the future."
[field :- analyze.schema/Field
fingerprint :- [:maybe fingerprint.schema/Fingerprint]]
(when (and (isa? (:base_type field) :type/Text)
(can-edit-semantic-type? field))
(when-let [text-fingerprint (get-in fingerprint [:type :type/Text])]
(when-let [inferred-semantic-type (infer-semantic-type-for-text-fingerprint text-fingerprint)]
(log/debugf "Based on the fingerprint of %s, we're marking it as %s."
(sync-util/name-for-logging field) inferred-semantic-type)
(assoc field
:semantic_type inferred-semantic-type))))) | |