欢迎来到三金的Blog！

大约 21 分钟约 6167 字...

· 【初中级】面试官：你们前端项目的整体稳定性是怎么保证的？异常和性能的监控SDK内核如何实现？ · 【中高级】面试官：有了解过完整监控平台从指标到可视化全链路体系吗？大型实时数据流日志系统怎么设计？ · 【专家级】面试官：请详细介绍一下你们团队前端监控平台的全栈流程设计与实践，要考虑巨量数据流量及存储处理细节

前端初中级、高级、专家面试，最大的区别在于工程化、构建、架构等方面的考察，如果不精进这几部分内容，面试很难与竞争者拉开差距。很多同学现在的项目实在是太简单，很多同学认为前端根本接触不到业务，工作中也不怎么可以去提升业务认知，久而久之核心竞争力就愈发欠缺。面试一个重要核心点就是：“借势”，三流项目能说出二流的感觉，市面上厉害的产品能抽取架构方案与最佳实践为己所用。就比如本项目：监控平台，很多同学可能连前端SDK的基本实现都不清楚，更谈不上全链路设计，这是你的短板，补足它就能补充你的项目重难点。 · 性能异常指标采集与用户行为手动埋点SDK方案设计 · 基于monorepo项目全栈架构设计与最佳实践 · 实时数据流与日志系统经典架构设计：kafka、clickhouse

【初中级】你们前端项目的整体稳定性是怎么保证的？异常和性能的监控SDK内核如何实现？

我们详细介绍了性能优化相关指标概念，这次我们来结合大厂监控平台实战，充分理解指标定义、计算与上报逻辑。前端性能监控通过收集和分析用户端的性能数据来衡量和优化页面加载速度、交互响应时间等关键性能指标的过程。

为什么需要性能监控？ · 提升用户体验：减少页面加载和响应时间，提高用户留存率。 · 发现和优化性能瓶颈：及时发现影响性能的因素并进行优化。 · 数据驱动决策：基于真实数据进行性能优化，而非依赖直觉。

性能指标与采集常见的性能指标核心性能指标加载性能

FP (First Paint) - 首次绘制
- 定义：页面任何像素被渲染所花费的时间。
- 评估工具：Chrome DevTools Performance 面板。
FCP (First Contentful Paint) - 首次内容绘制
- 定义：页面开始渲染任何文本、图片、SVG 的时间。
- 评估工具：Performance 面板、web - vitals 或 Lighthouse。
LCP (Largest Contentful Paint) - 最大内容绘制
- 定义：页面中最大的文本块或图片呈现所花费的时间。
- 理想值：< 2.5秒。
TTFB (Time to First Byte) - 首字节到达时间
- 定义：用户发起请求到接收服务器响应第一个字节的时间。
- 评估工具：Network 面板。

交互性能

INP (Interaction to Next Paint) - 交互到下一次绘制

定义：用户交互（如点击按钮）到界面响应的时间。
理想值：< 200ms。

TBT (Total Blocking Time) - 总阻塞时间

定义：从FCP到TTI (Time to Interactive) 之间，主线程被阻塞的时间总和。
理想值：< 200ms。

CLS (Cumulative Layout Shift) - 累计布局偏移

定义：页面意外的布局移动得分，影响用户体验。
理想值：< 0.1。

补充性能指标

DNS查询时间

定义：从发起请求到DNS查询完成的时间。
评估工具：Network面板。

资源加载时间

定义：所有静态资源（如图片、CSS、JS）的下载时长。

长任务（Long Task）

定义：主线程运行超过50ms的任务。

性能指标定义与获取 4. 开发阶段工具

Chrome DevTools: Performance面板用于捕获FP、FCP、LCP等指标。
Lighthouse: 生成页面性能报告，提供优化建议。
web-vitals: 监控FCP、LCP、CLS等核心Web Vitals指标。

生产环境监控

前端性能监控工具:
- Google Analytics: 配置自定义事件记录指标。
- Web Performance API: 直接从浏览器获取性能数据。

const { timing } = performance;
console.log('TTFB:', timing.responseStart - timing.requestStart);

性能监控平台:
- 使用开源平台（如Prometheus + Grafana）或第三方服务（如New Relic、Datadog）。

用户行为数据

埋点记录页面加载时间、交互延迟等关键性能指标，结合用户行为分析优化方向。

原始采集指标计算（客户端采集）通过Performance API和web-vitals库采集核心指标：

import { getLCP, getCLS, getINP } from 'web-vitals';
// 采集 LCP
getLCP((metric) => {
  console.log('LCP:', metric.value);
  reportMetricToServer('lcp', metric.value);
});
// 采集 CLS
getCLS((metric) => {
  console.log('CLS:', metric.value);
  reportMetricToServer('cls', metric.value);
});
// 采集 INP
getINP((metric) => {
  console.log('INP:', metric.value);
  reportMetricToServer('inp', metric.value);
});
// 上报指标
function reportMetricToServer(name, value) {
  fetch('/api/report-metric', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ name, value, timestamp: Date.now() }),
  });
}

此外，还可以通过 PerformanceObserver 捕获更细粒度的指标（如LCP、CLS的变化过程）。

首屏性能相关

TTFB (Time to First Byte)

获取首字节到达时间：

const { timing } = performance;
const ttf = timing.responseStart - timing.requestStart;
console.log('TTFB:', ttf, 'ms');

FP (First Paint)和FCP (First Contentful Paint)

通过 PerformancePaintTiming 获取：

const paintEntries = performance.getEntriesByType('paint');
paintEntries.forEach((entry) => {
  console.log(
  `${entry.name}: ${entry.startTime} ms
  );
});
// 输出:
// First Paint: xxx ms
// First Contentful Paint: xxx ms

LCP (Largest Contentful Paint)

通过 PerformanceObserver 获取LCP：代码块

const observer = new PerformanceObserver((list) => {
  const entries = list.getEntries();
  entries.forEach((entry) => {
    console.log('LCP:', entry.startTime,'ms');
  });
});
observer.observe({ type: 'largest-contentful-paint', buffered:

CLS (Cumulative Layout Shift)

通过 PerformanceObserver 监听CLS：

let clsValue = 0;
const observer = new PerformanceObserver((list) => {
  list.getEntries().forEach((entry) => {
    if (!entry.hadRecentInput) {
      clsValue += entry.value;
    }
  });
});
observer.observe({ type: 'layout-shift', buffered: true });
window.addEventListener('beforeunload', () => {
  console.log('CLS:', clsValue);
});

交互性能相关

TBT (Total Blocking Time)

获取长任务：

const observer = new PerformanceObserver((list) => {
  const longTasks = list.getEntries();
  longTasks.forEach((task) => {
    const blockingTime = task.duration - 50; // 超过50ms才算阻塞
    if (blockingTime > 0) {
      console.log('Blocking time:', blockingTime,'ms');
    }
  });
});
observer.observe({ type: 'longtask', buffered: true });

INP (Interaction to Next Paint)

监听用户交互并计算延迟时间：

const observer = new PerformanceObserver((list) => {
  list.getEntries().forEach((entry) => {
    console.log('Interaction delay:', entry.processingStart - entry.startTime,'ms');
  });
});
observer.observe({ type: 'event', buffered: true });

拓展Web-vitals采集

FCP

/*
 * Copyright 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     https://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import { onBFCacheRestore } from './lib/bfcache.js'
import { bindReporter } from './lib/bindReporter.js'
import { doubleRAF } from './lib/doubleRAF.js'
import { getActivationStart } from './lib/getActivationStart.js'
import { getVisibilityWatcher } from './lib/getVisibilityWatcher.js'
import { initMetric } from './lib/initMetric.js'
import { observe } from './lib/observe.js'
import { whenActivated } from './lib/whenActivated.js'
import { whenActivated } from './lib/whenActivated.js'
import { FCPMetric, MetricRatingThresholds, ReportOpts } from './types.js'

/** Thresholds for FCP. See https://web.dev/articles/fcp#what_is_a_good_fcp_score */
export const FCPThresholds: MetricRatingThresholds = [1800, 3000]

/**
 * Calculates the [FCP](https://web.dev/articles/fcp) value for the current page and
 * calls the `callback` function once the value is ready, along with the
 * relevant `paint` performance entry used to determine the value. The reported
 * value is a `DOMHighResTimeStamp`.
 */
export const onFCP = (onReport: (metric: FCPMetric) => void, opts?: ReportOpts) => {
    // Set defaults
    opts = opts || {}

    whenActivated(() => {
        const visibilityWatcher = getVisibilityWatcher()
        let metric = initMetric('FCP')
        let report: ReturnType<typeof bindReporter>

        const handleEntries = (entries: FCPMetric['entries']) => {
            entries.forEach(entry => {
                if (entry.name === 'first-contentful-paint') {
    po!.disconnect()

    // Only report if the page wasn't hidden prior to the first paint.
    if (entry.startTime < visibilityWatcher.firstHiddenTime) {
        // The activationStart reference is used because FCP should be
        // relative to page activation rather than navigation start
        // if the
        // page was prerendered. But in cases where `activationStart`
        // occurs
        // after the FCP, this time should be clamped at 0.
        metric.value = Math.max(entry.startTime -
            getActivationStart(), 0)
        metric.entries.push(entry)
        report(true)
    }
}
})
}
const po = observe('paint', handleEntries)
if (po) {
    report = bindReporter(onReport, metric, FCPThresholds,
        opts!.reportAllChanges)

    // Only report after a bfcache restore if the `PerformanceObserver`
    // successfully registered or the `paint` entry exists.
    onBFCacheRestore(event => {
        metric = initMetric('FCP')
        report = bindReporter(onReport, metric, FCPThresholds,
            opts!.reportAllChanges)

        doubleRAF(() => {
            metric.value = performance.now() - event.timeStamp
            report(true)
        })
    })
}
})
}

LCP

/*
 * Copyright 2020 Google LLC
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     https://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import { onBFCacheRestore } from './lib/bfcache.js'
import { bindReporter } from './lib/bindReporter.js'
import { doubleRAF } from './lib/doubleRAF.js'
import { getActivationStart } from './lib/getActivationStart.js'
import { getVisibilityWatcher } from './lib/getVisibilityWatcher.js'
import { initMetric } from './lib/initMetric.js'
import { observe } from './lib/observe.js'
import { onHidden } from './lib/onHidden.js'
import { runOnce } from './lib/runOnce.js'
import { whenActivated } from './lib/whenActivated.js'
import { whenIdle } from './lib/whenIdle.js'
import { LCPMetric, MetricRatingThresholds, ReportOpts } from './types.js'

/** Thresholds for LCP. See https://web.dev/articles/lcp#what_is_a_good_lcp_score */
export const LCPThresholds: MetricRatingThresholds = [2500, 4000]

const reportedMetricIDs: Record<string, boolean> = {}

/**
 * Calculates the [LCP](https://web.dev/articles/lcp) value for the current page and
 * calls the `callback` function once the value is ready (along with the
 * relevant `largest-contentful-paint` performance entry used to determine the
 * value). The reported value is a `DOMHighResTimeStamp`.
 *
 * If the `reportAllChanges` configuration option is set to `true`, the
 * `callback` function will be called any time a new `largest-contentful-paint`
 * performance entry is dispatched, or once the final value of the metric has
 * been determined.
 */
export const onLCP = (onReport: (metric: LCPMetric) => void, opts?: ReportOpts) => {
    // Set defaults
    opts = opts || {}

    whenActivated(() => { 
        const visibilityWatcher = getVisibilityWatcher()
let metric = initMetric('LCP')
let report: ReturnType<typeof bindReporter>

const handleEntries = (entries: LCPMetric['entries']) => {
    // If reportAllChanges is set then call this function for each entry,
    // otherwise only consider the last one.
    if (!opts!.reportAllChanges) {
        entries = entries.slice(-1)
    }

    entries.forEach(entry => {
        // Only report if the page wasn't hidden prior to LCP.
        if (entry.startTime < visibilityWatcher.firstHiddenTime) {
            // The startTime attribute returns the value of the renderTime if
            // it is
            // not 0, and the value of the loadTime otherwise. The
            // activationStart
            // reference is used because LCP should be relative to page
            // activation
            // rather than navigation start if the page was prerendered. But
            // in cases
            // where `activationStart` occurs after the LCP, this time should
            // be
            // clamped at 0. 
            metric.value = Math.max(entry.startTime - getActivationStart(), 0)
metric.entries = [entry]
report()
}
})
}

const po = observe('largest-contentful-paint', handleEntries)

if (po) {
    report = bindReporter(onReport, metric, LCPThresholds,
        opts!.reportAllChanges)

    const stopListening = runOnce(() => {
        if (!reportedMetricIDs[metric.id]) {
            handleEntries(po!.takeRecords() as LCPMetric['entries'])
            po!.disconnect()
            reportedMetricIDs[metric.id] = true
            report(true)
        }
    })

    // Stop listening after input. Note: while scrolling is an input that
    // stops LCP observation, it's unreliable since it can be programmatically
    // generated. See: https://github.com/GoogleChrome/web-vitals/issues/75 
    ;['keydown', 'click'].forEach(type => {
    // Wrap in a setTimeout so the callback is run in a separate task
    // to avoid extending the keyboard/click handler to reduce INP impact
    // https://github.com/GoogleChrome/web-vitals/issues/383
    addEventListener(type, () => whenIdle(stopListening), true)
})

onHidden(stopListening)

// Only report after a bfcache restore if the `PerformanceObserver`
// successfully registered.
onBFCacheRestore(event => {
    metric = initMetric('LCP')
    report = bindReporter(onReport, metric, LCPThresholds,
        opts!.reportAllChanges)

    doubleRAF(() => {
        metric.value = performance.now() - event.timeStamp
        reportedMetricIDs[metric.id] = true
        report(true)
    })
})
}
})
}

汇集Metrics

export { onCLS, CLSThresholds } from './onCLS.js'
export { onFCP, FCPThresholds } from './onFCP.js'
export { onINP, INPThresholds } from './onINP.js'
export { onLCP, LCPThresholds } from './onLCP.js'
export { onTTFB, TTFBThresholds } from './onTTFB.js'

export * from './deprecated.js'
export * from './types.js'

前端异常监控

前端异常监控指的是捕获并报告用户端发生的错误或异常，帮助开发者及时发现和修复问题。最经典的要数sentry了，有兴趣的同学呢，重点了解sentry的源码实现。

JavaScript错误（JS Errors）

定义：运行时JavaScript错误。
捕获方法：

window.onerror = function (message, source, lineno, colno, error) {
    console.log(`Error: ${message}, Source: ${source}, Line: ${lineno}, Column: ${colno}, Error Object: ${error}`);
};

Promise未处理拒绝（Unhandled Promise Rejection）

定义：未处理的Promise拒绝错误。
捕获方法：

window.addEventListener('unhandledrejection', function (event) {
    console.log(`Unhandled Rejection: ${event.reason}`);
});

资源加载错误

定义：静态资源加载失败错误。
捕获方法：

window.addEventListener('error', function (event) {
    if (event.target!== window) {
        console.log(`Resource Load Error: ${event.target.src || event.target.href}`);
    }
}, true);

接口请求失败（API Failure）

定义：API请求失败或超时错误。
捕获方法：

fetch(url)
   .then(response => {
        if (!response.ok) {
            console.error(`API Failure: ${response.status} ${response.statusText}`);
        }
    })
   .catch(error => console.error(`Fetch Error: ${error}`));

数据上报

上报方式设计

实时上报：关键指标（如LCP、CLS）需要实时上报，确保及时监控。
批量上报：对非关键指标，利用定时任务进行批量上报，减少网络开销。

示例：批量上报实现

代码块

const metricsBuffer = [];
function addMetricToBuffer(name, value) {
    metricsBuffer.push({ name, value, timestamp: Date.now() });
    if (metricsBuffer.length >= 10) {
        flushMetrics();
    }
}
function flushMetrics() {
    fetch('/api/report-metrics', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify(metricsBuffer),
    });
    metricsBuffer.length = 0; // 清空缓冲区
}

上报通道

HTTP：通过RESTful API上报。
日志埋点：将性能指标写入日志，后续通过数据分析管道处理。
Kafka：在大型分布式系统中，使用Kafka提高消息处理效率。

数据清洗与汇总

在性能优化中，原始数据可能存在异常值或冗余数据，因此数据清洗是必要的一步。

数据清洗

利用Flink实现流式数据清洗：

过滤异常值

删除超出合理范围的值（如LCP > 10s）。
过滤不完整的日志。

去重

对同一用户的重复上报数据进行去重。

示例：Flink过滤清洗代码

DataStream<String> rawStream = env.addSource(new FlinkKafkaConsumer<>("metrics", new SimpleStringSchema(), properties));
DataStream<Metric> cleanStream = rawStream
   .map(json -> parseMetric(json)) // 转换为 Metric 对象
   .filter(metric -> metric.getValue() > 0 && metric.getValue() < 10000) // 过滤异常值
   .keyBy(Metric::getUserId)
   .distinct(); // 去重

数据汇总与存储

使用ClickHouse存储清洗后的数据，方便高效查询和统计：

数据表设计

CREATE TABLE performance_metrics (
    timestamp DateTime,
    user_id String,
    metric_name String,
    metric_value Float32
) ENGINE = MergeTree()
PARTITION BY toDate(timestamp)
ORDER BY (metric_name, timestamp);

写入数据 Flink通过JDBC将数据写入ClickHouse：

cleanStream.addSink(new JdbcSink<>(
    "INSERT INTO performance_metrics (timestamp, user_id, metric_name, metric_value) VALUES (?,?,?,?)",
    (ps, metric) -> {
        ps.setTimestamp(1, metric.getTimestamp());
        ps.setString(2, metric.getUserId());
        ps.setString(3, metric.getMetricName());
        ps.setFloat(4, metric.getMetricValue());
    }
));

数据分析与反馈

数据统计

利用ClickHouse提供的高性能查询能力，统计关键指标的分布：

-- 查询LCP的P90和P95分布
SELECT
    quantile(0.90)(metric_value) AS P90,
    quantile(0.95)(metric_value) AS P95
FROM performance_metrics
WHERE metric_name = 'lcp';

【中高级】有了解过完整监控平台从指标到可视化全链路体系吗？大型实时数据流日志系统怎么设计？

环境初始化

我推荐同学们以后关于环境准备方面的工作，都首选docker。因为我们需要用到kafka、clickhouse服务，所以通过docker编排服务，定义使用到的镜像和具体细节

Docker 安装：首先，你需要确保已经安装了Docker和Docker Compose。
- Docker：安装Docker
- Docker Compose：安装Docker Compose
环境准备：
- 在生产环境或开发环境中，使用Docker可以很方便地部署和管理服务。尤其是对于Kafka、Clickhouse等大型服务，通过Docker可以轻松进行容器化管理和编排，避免了安装和配置繁琐的步骤。
- Docker Compose是用来定义和运行多容器Docker应用的工具，通过docker-compose.yml文件定义需要的服务。

docker-compose.yml 解析

version: '3'
services:
  miaoma-monitor-kafka:
    image: bitnami/kafka:3.9.0
    container_name: miaoma-monitor-kafka
    ports:
      - '9093:9093'
      - '9094:9094'
    environment:
      - KAFKA_CFG_NODE_ID=1
      - KAFKA_CFG_PROCESS_ROLES=broker,controller
      - KAFKA_CFG_CONTROLLER_QUORUM_VOTERS=1@192.168.31.16:9093
      - KAFKA_CFG_LISTENERS=PLAINTEXT://:9092,CONTROLLER://:9093,EXTERNAL://0.0.0.0:9094
      - KAFKA_CFG_ADVERTISED_LISTENERS=PLAINTEXT://miaoma-monitor-kafka:9092,EXTERNAL://localhost:9094
      - KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP=CONTROLLER:PLAINTEXT,EXTERNAL:PLAINTEXT,PLAINTEXT:PLAINTEXT
      - KAFKA_CFG_CONTROLLER_LISTENER_NAMES=CONTROLLER
      - KAFKA_CFG_INTER_BROKER_LISTENER_NAME=PLAINTEXT
      - ALLOW_PLAINTEXT_LISTENER=yes
  miaoma-monitor-clickhouse:
    image: bitnami/clickhouse:25.3.1
    container_name: miaoma-monitor-clickhouse
    ports:
      - '8123:8123'
      - '9000:9000'
    environment:
      - CLICKHOUSE_USER=default
      - CLICKHOUSE_PASSWORD=heyiclickhouse
      - CLICKHOUSE_DATABASE=default

networks:
  default:
    name: miaoma-monitor-kafka-clickhouse-network
    driver: bridge

服务解析

Kafka 服务 (miaoma-monitor-kafka )：
- 使用Bitnami提供的Kafka镜像 bitnami/kafka:3.9.0。
- 配置了多个端口，支持多个协议类型（PLAINTEXT和CONTROLLER）。
- KAFKA_CFG_CONTROLLER_QUORUM_VOTERS 和 KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP 配置确保Kafka节点和控制器的通信方式。
- KAFKA_CFG_ADVERTISED_LISTENERS 使Kafka能够广告自己的地址，确保外部服务可以连接。
Clickhouse 服务 (miaoma-monitor-clickhouse )：
- 使用Bitnami提供的Clickhouse镜像 bitnami/clickhouse:25.3.1。
- 提供了HTTP接口（端口8123）和本地客户端接口（端口9000）。
- 环境变量配置了默认用户、密码和数据库。

网络配置

所有服务都通过 miaoma-monitor-kafka-clickhouse-network 网络进行通信，使用 bridge 网络驱动。这样配置后，服务能够互相发现，并且可以在网络中顺利通信。

export function init(options: {
    dsn: string;
    integrations?: Integration[]
}) {
    const monitoring = new Monitoring({
        dsn: options.dsn,
        integrations: options.integrations,
    })

    const transport = new BrowserTransport(options.dsn)
    monitoring.init(transport)
    new Errors(transport).init()  // <mcsymbol name="Errors" filename="index.ts" path="/packages/browser/src/index.ts"startline="11" type="class"></mcsymbol>
    new Metrics(transport).init() // <mcsymbol name="Metrics" filename="index.ts" path="/packages/browser-utils/src/index.ts" startline="21" type="class"></mcsymbol>
}

预览: