Servicing a larger number of Data Scientists and development teams productionalizing their ML models and Big Data concepts, would definitely raise concerns of efficiency, reuse and security as scale increases. Then component-level multi-tenancy can offer benefits and superior solutions over whole-stack-deploy-per-each-tenant one (as offered by cloud providers).
In the series of articles "Note Web - Multi-tenant Big Data Lake" we'll present interlinked notes on building such an agile service environment for development teams and individual researches. "Note Web" part in the series' title is not only to reflect the nature of this series - architect's notes, but also is an Active Content Management & Development Platform offered as SaaS at NoteInWeb.com. It is currently in beta-testing.
Arguably, multi-tenancy, metering-monitoring, as well as authentication, authorization, audit are cornerstones of any SaaS/PaaS offering. That's why we'll start with the first two in this series, covering implementation of the others in subsequent articles. There is another such architectural cornerstone present implicitly in any SaaS/PaaS, that NoteWeb would make explicit and put at end users and tenant's control. That's a topic of a different series of articles in development, though.
Multi-tenancy traditionally poses architectural challenges for distributed systems with heterogeneous storage. This cross-cutting concern needs to be implemented holistically across all participating system components and services to deliver coherent and manageable customer experience.
Considering a compute - storage relationship (and thus architectural view) in each given service or component, there are following types of multi-tenancy that can be defined:
Though the number of Big Data open source products offering multi-tenancy support out of the box increases (Apache Atlas, Knox, to name a few), resource usage metering and policy enforcing remain among the most interesting concerns for productionizing Big Data applications without much investment into boilerplate like authorization, audit, authentication, data governance, monitoring.
Cloud providers meter compute and storage resources for billing of each cloud account as part of their business model. This offers multi-tenancy of type 3 and metering. So if spinning up a separate AWS EMR cluster and tasking a DevOps colleague with big data stack setup for every little Data Science research endeavor are fine for your budget, then this article can represent a merely architectural exercise and reading in your case (not necessarily a technical guide for the needs in hand at the this instant).
In this article we'd review usage metering support offered by modern big data containers and processors, preferably with multi-tenancy support of types 1 or 2. Practical part of the article would cover putting up a simple Technical POC for defining Note Web business metrics, setting up metrics collection and rendering status onto an application's and Grafana dashboards. This would also include lower level metrics for the following components:
At high level metering solution requirements can be grouped into:
NW Tenant board should display resource and content usage statistics at the tenant account level. User Board would represent the same for the individual user account. CRUD and business event streams are to be produced by each Note Web micro service and component, routed via Apache Kafka, processed by Apache Spark streaming, event stream statistics are to be aggregated at Note Web tag level, charted on Note Web and Grafana boards. Both push and pull approaches for metrics collections to be supported.
In Note Web every business entity starts with a note. Note's purpose and behavior, including data structure and available actions evolve further as it is assigned Tags.
Tags:
Authentication and Authorization services are provided by KeyCloak for non-Hadoop components and Apache Ranger for the Hadoop stack parts. Hadoop data governance is covered by Apache Atlas. Tenant and user accounts are stored in Note Web's LDAP server. Apache Kafka facilitates event streams. Apache Spark Streaming aggregates event streams at necessary levels and pushes data for the monitoring dashboards. Kubernetes Namespaces provide tenant segregation for tenant-specific services.
Main Data Flow cases are displayed on the UML sequence diagram below: