How to Implement Multi-Cloud File System Abstraction in Linkis Using FsPath, HDFSFileSystem, and S3FileSystem
Linkis provides a pluggable file system abstraction layer through the Fs interface, FsPath wrapper, and factory classes like BuildHDFSFileSystem and BuildS3FileSystem, enabling seamless multi-cloud storage operations without changing business logic.
Apache Linkis delivers a unified storage abstraction that decouples your application code from underlying storage implementations. By leveraging the Linkis file system abstraction, developers can write once and deploy across HDFS, Amazon S3, and other cloud storage systems using the same API surface. This architecture relies on a minimal interface contract, a lightweight path wrapper, and runtime factories that inject cross-cutting concerns like auditing and permission checks.
Core Components of the Linkis File System Abstraction
The Fs Interface Contract
At the foundation of the abstraction lies the Fs interface defined in org.apache.linkis.common.io.Fs. This minimal contract specifies the essential operations every storage backend must implement, including fsName(), read(), write(), list(), and delete(). By programming against this interface rather than concrete implementations, your code remains agnostic to whether data resides on HDFS, S3, or local disk.
The interface resides in linkis-commons/linkis-common/src/main/java/org/apache/linkis/common/io/Fs.java and serves as the entry point for all file system operations within the Linkis ecosystem.
FileSystem Abstract Base Class
The FileSystem abstract class in org.apache.linkis.storage.fs.FileSystem implements most of the Fs contract while adding common utilities for permission handling, ownership validation, and path manipulation. Located at linkis-commons/linkis-storage/src/main/java/org/apache/linkis/storage/fs/FileSystem.java, this class reduces boilerplate for concrete implementations by providing default implementations of canRead(), canWrite(), and canExecute() based on POSIX-style permission strings.
Concrete backends only need to override low-level operations like list(), mkdir(), and renameTo(), while inheriting consistent security semantics from the base class.
FsPath as the Universal Path Wrapper
FsPath is a lightweight wrapper around string paths that carries essential metadata including owner, group, permissions, and timestamps. Defined in linkis-commons/linkis-common/src/main/java/org/apache/linkis/common/io/FsPath.java, this class ensures that all storage backends operate on the same data structure, eliminating the need for path format conversion when switching between HDFS and S3.
Unlike raw strings, FsPath objects preserve context about the file's security attributes, enabling the permission checks implemented in the abstract FileSystem class to function uniformly across disparate storage systems.
Concrete Implementations for Multi-Cloud Storage
HDFSFileSystem for Hadoop Clusters
The HDFSFileSystem class in linkis-commons/linkis-storage/src/main/java/org/apache/linkis/storage/fs/impl/HDFSFileSystem.java wraps the native Hadoop FileSystem API to provide full HDFS integration. This implementation respects HDFS Access Control Lists (ACLs) and integrates with Kerberos authentication when configured.
When you invoke canWrite() on an HDFSFileSystem instance, the implementation consults the NameNode to verify actual HDFS permissions against the current user and group context.
S3FileSystem for AWS Object Storage
For Amazon S3 compatibility, Linkis provides S3FileSystem in linkis-commons/linkis-storage/src/main/java/org/apache/linkis/storage/fs/impl/S3FileSystem.java. This implementation wraps the AWS S3 SDK and emulates directory semantics by treating zero-length marker files as folder indicators.
Because S3 does not enforce POSIX permissions, the canRead() and canWrite() methods in this implementation typically return true, delegating access control to AWS IAM policies configured at the bucket level. The path format remains compatible with FsPath, though the underlying implementation translates logical paths into S3 object keys.
Factory Pattern and Runtime Selection
BuildHDFSFileSystem and BuildS3FileSystem
Linkis uses factory classes to instantiate the appropriate file system implementation at runtime. The BuildHDFSFileSystem and BuildS3FileSystem classes in linkis-commons/linkis-storage/src/main/java/org/apache/linkis/storage/factory/impl/ handle construction and configuration of their respective backends.
These factories create CGLIB proxies around the concrete FileSystem instances, injecting Linkis IO method interceptors that enable transparent auditing, metrics collection, and permission validation. You obtain a file system instance by calling getFs(String user, String proxyUser) on the appropriate builder.
BuildFactory for Label-Based Selection
The BuildFactory interface in linkis-commons/linkis-storage/src/main/java/org/apache/linkis/storage/factory/BuildFactory.java provides a higher-level abstraction for selecting storage backends. The static method BuildFactory.getFactory(String label) maps string labels like "hdfs" or "s3" to their corresponding factory implementations, returning either BuildHDFSFileSystem or BuildS3FileSystem as appropriate.
This label-based resolution enables configuration-driven storage selection, allowing operations teams to change the underlying storage system for a deployment without modifying application code.
Practical Implementation Examples
Reading and Writing to HDFS
The following example demonstrates basic file operations using the HDFS implementation:
// 1. Build an HDFS-backed Fs (proxy mode will be used if the node has HDFS config)
Fs fs = new BuildHDFSFileSystem().getFs("alice", "proxyAlice");
// 2. Wrap the target path in an FsPath
FsPath path = new FsPath("/user/alice/input.txt");
// 3. Write data (overwrite = true)
try (OutputStream out = fs.write(path, true)) {
out.write("Hello Linkis".getBytes(StandardCharsets.UTF_8));
}
// 4. Read the data back
try (InputStream in = fs.read(path)) {
String content = new BufferedReader(new InputStreamReader(in))
.lines().collect(Collectors.joining("\n"));
System.out.println(content); // → Hello Linkis
}
Key classes used: BuildHDFSFileSystem, Fs, FsPath, HDFSFileSystem
Switching to S3 Without Code Changes
To migrate the same logic to S3, simply swap the factory implementation:
// Obtain an S3-backed Fs (the label "s3" can be used to pick the right factory)
Fs s3Fs = new BuildS3FileSystem().getFs("bob", "proxyBob");
// S3 uses the same FsPath abstraction – the "bucket" is configured in the
// StorageConfiguration, the path is logical without a scheme.
FsPath s3Path = new FsPath("/datasets/sample.csv");
// Write a CSV file to S3
try (OutputStream out = s3Fs.write(s3Path, false)) {
out.write("id,value\n1,foo\n2,bar".getBytes(StandardCharsets.UTF_8));
}
// List objects under a directory (S3 treats "/" as a virtual folder)
List<FsPath> files = s3Fs.list(new FsPath("/datasets"));
files.forEach(fp -> System.out.println(fp.getPath()));
Key classes used: BuildS3FileSystem, S3FileSystem, FsPath
Using BuildFactory for Storage-Agnostic Code
For maximum portability, use the generic factory to hide concrete implementations:
// BuildFactory decides the concrete implementation based on the label
BuildFactory factory = BuildFactory.getFactory("s3"); // returns BuildS3FileSystem
Fs fs = factory.getFs("carol", "proxyCarol");
// From here the code is identical to the HDFS example
FsPath path = new FsPath("/logs/2024/03/01.log");
fs.mkdir(new FsPath("/logs/2024/03")); // creates virtual "directory" in S3
Handling Permissions Across Storage Types
Always verify permissions before sensitive operations, noting that semantics vary by backend:
FsPath dir = new FsPath("/secure/data");
if (fs.canWrite(dir)) {
fs.create(new FsPath("/secure/data/new.txt"));
} else {
throw new IOException("Current user lacks write permission on " + dir.getPath());
}
The canWrite implementation in HDFSFileSystem consults HDFS ACLs, while S3FileSystem returns true since S3 does not enforce POSIX permissions.
Summary
- Linkis file system abstraction relies on the
Fsinterface inlinkis-commonand theFileSystemabstract class inlinkis-storageto provide a unified API across storage backends. FsPathserves as the universal path wrapper carrying metadata (owner, group, permissions) for all file systems, located inorg.apache.linkis.common.io.FsPath.- Concrete implementations like
HDFSFileSystemandS3FileSystemhandle protocol-specific operations while inheriting common utilities from the base class. - Factory classes (
BuildHDFSFileSystem,BuildS3FileSystem, andBuildFactory) instantiate proxied file system instances at runtime, enabling label-based storage selection and transparent interceptor injection. - Permission semantics differ by backend—HDFS enforces POSIX-style ACLs while S3 delegates to IAM—though the API remains consistent through the abstraction layer.
Frequently Asked Questions
How does Linkis handle directory creation differently between HDFS and S3?
In HDFSFileSystem, the mkdir() operation creates physical directories in the Hadoop namespace with proper inode allocation and permission bits. Conversely, S3FileSystem emulates directories by creating zero-length marker files with trailing slash keys, since S3 is a flat object store without native directory concepts. Both implementations expose the same mkdir(FsPath) signature, so callers use identical code regardless of the underlying storage architecture.
Can I implement a custom file system for another cloud provider using Linkis?
Yes, you can extend the FileSystem abstract class and implement the required abstract methods such as list(), read(), write(), and exists(). Place your implementation in the org.apache.linkis.storage.fs package or a custom package, then create a corresponding factory class extending the factory pattern used by BuildHDFSFileSystem. Your custom factory should return a CGLIB-proxied instance if you require Linkis interceptors for auditing or security.
What is the purpose of the CGLIB proxy created by BuildHDFSFileSystem and BuildS3FileSystem?
The CGLIB proxy wraps the concrete FileSystem implementation to inject Linkis IO method interceptors at runtime. These interceptors enable cross-cutting concerns such as operation auditing, performance metrics collection, and additional permission validation without cluttering the core file system logic. The proxy is created transparently when you call getFs() on the factory, requiring no changes to client code.
How does StorageUtils determine which file system to instantiate?
StorageUtils provides utility methods like isHDFSNode() and constants such as HDFS() and S3() that inspect the runtime environment and configuration properties. The factory classes consult these utilities to determine whether HDFS configuration is present on the node or whether S3 credentials are configured, defaulting to the appropriate implementation. This allows Linkis deployments to automatically adapt to their infrastructure without explicit configuration of the storage backend in application code.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →