Analysis sub-step that takes a sample of values for a Field and saving a non-identifying fingerprint used for classification. This fingerprint is saved as a column on the Field it belongs to. | (ns metabase.sync.analyze.fingerprint (:require [clojure.set :as set] [honey.sql.helpers :as sql.helpers] [metabase.analyze :as analyze] [metabase.db.metadata-queries :as metadata-queries] [metabase.db.query :as mdb.query] [metabase.driver :as driver] [metabase.driver.util :as driver.u] [metabase.models.field :as field :refer [Field]] [metabase.models.table :as table] [metabase.query-processor.store :as qp.store] [metabase.sync.interface :as i] [metabase.sync.util :as sync-util] [metabase.util :as u] [metabase.util.log :as log] [metabase.util.malli :as mu] [metabase.util.malli.registry :as mr] [metabase.util.malli.schema :as ms] [redux.core :as redux] [toucan2.core :as t2])) |
(comment metadata-queries/keep-me-for-default-table-row-sample) | |
Key-value pairs corresponding to the state of Fields that have the latest fingerprint, but have not yet
completed analysis. All Fields who get new fingerprints should get marked as having the latest fingerprint
version, but we'll clear their values for | (defn incomplete-analysis-kvs [] {:fingerprint_version i/*latest-fingerprint-version* :last_analyzed nil}) |
(mu/defn- save-fingerprint! [field :- i/FieldInstance fingerprint :- [:maybe analyze/Fingerprint]] (log/debugf "Saving fingerprint for %s" (sync-util/name-for-logging field)) (t2/update! Field (u/the-id field) (merge (incomplete-analysis-kvs) {:fingerprint fingerprint}))) | |
(mr/def ::FingerprintStats [:map [:no-data-fingerprints ms/IntGreaterThanOrEqualToZero] [:failed-fingerprints ms/IntGreaterThanOrEqualToZero] [:updated-fingerprints ms/IntGreaterThanOrEqualToZero] [:fingerprints-attempted ms/IntGreaterThanOrEqualToZero]]) | |
(mu/defn empty-stats-map :- ::FingerprintStats "The default stats before any fingerprints happen" [fields-count :- ms/IntGreaterThanOrEqualToZero] {:no-data-fingerprints 0 :failed-fingerprints 0 :updated-fingerprints 0 :fingerprints-attempted fields-count}) | |
The maximum size of :type/Text to be selected from the database in | (def ^:private ^:dynamic *truncation-size* 1234) |
(mu/defn- fingerprint-table! [table :- i/TableInstance fields :- [:maybe [:sequential i/FieldInstance]]] (let [rff (fn [_metadata] (redux/post-complete (analyze/fingerprint-fields fields) (fn [fingerprints] (reduce (fn [count-info [field fingerprint]] (cond (instance? Throwable fingerprint) (update count-info :failed-fingerprints inc) (some-> fingerprint :global :distinct-count zero?) (update count-info :no-data-fingerprints inc) :else (do (save-fingerprint! field fingerprint) (update count-info :updated-fingerprints inc)))) (empty-stats-map (count fingerprints)) (map vector fields fingerprints))))) driver (driver.u/database->driver (table/database table)) opts {:truncation-size *truncation-size*}] (driver/table-rows-sample driver table fields rff opts))) | |
+----------------------------------------------------------------------------------------------------------------+ | WHICH FIELDS NEED UPDATED FINGERPRINTS? | +----------------------------------------------------------------------------------------------------------------+ | |
Logic for building the somewhat-complicated query we use to determine which Fields need new Fingerprints This ends up giving us a SQL query that looks something like: SELECT * FROM metabase_field WHERE active = true AND (semantictype NOT IN ('type/PK') OR semantictype IS NULL) AND preview_display = true AND visibility_type <> 'retired' AND table_id = 1 AND ((fingerprint_version < 1 AND base_type IN ("type/Longitude", "type/Latitude", "type/Integer")) OR (fingerprint_version < 2 AND base_type IN ("type/Text", "type/SerializedJSON"))) | |
(mu/defn- base-types->descendants :- [:maybe [:set ms/FieldTypeKeywordOrString]] "Given a set of `base-types` return an expanded set that includes those base types as well as all of their descendants. These types are converted to strings so HoneySQL doesn't confuse them for columns." [base-types :- [:set ms/FieldType]] (into #{} (comp (mapcat (fn [base-type] (cons base-type (descendants base-type)))) (map u/qualified-name)) base-types)) | |
It's even cooler if we could generate efficient SQL that looks at what types have already been marked for upgrade so we don't need to generate overly-complicated queries. e.g. instead of doing: WHERE ((version < 2 AND base_type IN ("type/Integer", "type/BigInteger", "type/Text")) OR (version < 1 AND base_type IN ("type/Boolean", "type/Integer", "type/BigInteger"))) we could do: WHERE ((version < 2 AND base_type IN ("type/Integer", "type/BigInteger", "type/Text")) OR (version < 1 AND base_type IN ("type/Boolean"))) (In the example above, something that is a This way we can also completely omit adding clauses for versions that have been "eclipsed" by others. This would keep the SQL query from growing boundlessly as new fingerprint versions are added | (mu/defn- versions-clauses :- [:maybe [:sequential :any]] [] ;; keep track of all the base types (including descendants) for each version, starting from most recent (let [versions+base-types (reverse (sort-by first (seq i/*fingerprint-version->types-that-should-be-re-fingerprinted*))) already-seen (atom #{})] (for [[version base-types] versions+base-types :let [descendants (base-types->descendants base-types) not-yet-seen (set/difference descendants @already-seen)] ;; if all the descendants of any given version have already been seen, we can skip this clause altogether :when (seq not-yet-seen)] ;; otherwise record the newly seen types and generate an appropriate clause (do (swap! already-seen set/union not-yet-seen) [:and [:< :fingerprint_version version] [:in :base_type not-yet-seen]])))) |
Base clause to get fields for fingerprinting. When refingerprinting, run as is. When fingerprinting in analysis, only look for fields without a fingerprint or whose version can be updated. This clauses is added on by [[versions-clauses]]. | (def ^:private fields-to-fingerprint-base-clause [:and [:= :active true] [:or [:not (mdb.query/isa :semantic_type :type/PK)] [:= :semantic_type nil]] [:not-in :visibility_type ["retired" "sensitive"]] [:not-in :base_type (conj (mdb.query/type-keyword->descendants :type/fingerprint-unsupported) (u/qualified-name :type/*))]]) |
Whether we are refingerprinting or doing the normal fingerprinting. Refingerprinting should get fields that already are analyzed and have fingerprints. | (def ^:dynamic *refingerprint?* false) |
(mu/defn- honeysql-for-fields-that-need-fingerprint-updating :- [:map [:where :any]] "Return appropriate WHERE clause for all the Fields whose Fingerprint needs to be re-calculated." ([] {:where (cond-> fields-to-fingerprint-base-clause (not *refingerprint?*) (conj (cons :or (versions-clauses))))}) ([table :- i/TableInstance] (sql.helpers/where (honeysql-for-fields-that-need-fingerprint-updating) [:= :table_id (u/the-id table)]))) | |
+----------------------------------------------------------------------------------------------------------------+ | FINGERPRINTING ALL FIELDS IN A TABLE | +----------------------------------------------------------------------------------------------------------------+ | |
(mu/defn- fields-to-fingerprint :- [:maybe [:sequential i/FieldInstance]] "Return a sequences of Fields belonging to `table` for which we should generate (and save) fingerprints. This should include NEW fields that are active and visible." [table :- i/TableInstance] (seq (t2/select Field (honeysql-for-fields-that-need-fingerprint-updating table)))) | |
Generate and save fingerprints for all the Fields in TODO - | (mu/defn fingerprint-fields! [table :- i/TableInstance] (if-let [fields (fields-to-fingerprint table)] (do (log/infof "Fingerprinting %s fields in table %s" (count fields) (sync-util/name-for-logging table)) (let [stats (sync-util/with-error-handling (format "Error fingerprinting %s" (sync-util/name-for-logging table)) (fingerprint-table! table fields))] (if (instance? Exception stats) (empty-stats-map 0) stats))) (empty-stats-map 0))) |
(def ^:private LogProgressFn [:=> [:cat :string [:schema i/TableInstance]] :any]) | |
Invokes | (mu/defn- fingerprint-fields-for-db!* ([database :- i/DatabaseInstance log-progress-fn :- LogProgressFn] (fingerprint-fields-for-db!* database log-progress-fn (constantly true))) ([database :- i/DatabaseInstance log-progress-fn :- LogProgressFn continue? :- [:=> [:cat ::FingerprintStats] :any]] (qp.store/with-metadata-provider (u/the-id database) (let [tables (if *refingerprint?* (sync-util/refingerprint-reducible-sync-tables database) (sync-util/reducible-sync-tables database))] (reduce (fn [acc table] (log-progress-fn (if *refingerprint?* "refingerprint-fields" "fingerprint-fields") table) (let [new-acc (merge-with + acc (fingerprint-fields! table))] (if (continue? new-acc) new-acc (reduced new-acc)))) (empty-stats-map 0) tables))))) |
Invokes [[fingerprint-fields!]] on every table in | (mu/defn fingerprint-fields-for-db! [database :- i/DatabaseInstance log-progress-fn :- LogProgressFn] (if (driver.u/supports? (:engine database) :fingerprint database) (fingerprint-fields-for-db!* database log-progress-fn) (empty-stats-map 0))) |
Maximum number of fields to refingerprint. Balance updating our fingerprinting values while not spending too much time in the db. | (def ^:private max-refingerprint-field-count 1000) |
Invokes [[fingeprint-fields!]] on every table in | (mu/defn refingerprint-fields-for-db! [database :- i/DatabaseInstance log-progress-fn :- LogProgressFn] (binding [*refingerprint?* true] (fingerprint-fields-for-db!* database log-progress-fn (fn [stats-acc] (< (:fingerprints-attempted stats-acc) max-refingerprint-field-count))))) |
Refingerprint a field | (mu/defn refingerprint-field [field :- i/FieldInstance] (let [table (field/table field)] (fingerprint-table! table [field]))) |