Virastar is a Persian text cleaner.
کتابخانه ویراستاری متن فارسی برای PHP
This repository is PHP port of brothersincode/virastar
Official website and Persian usage guide
composer require alirezasedghi/virastar
// Require Composer's autoloader.
require 'vendor/autoload.php';
// Using Virastar namespace.
use Alirezasedghi\Virastar\Virastar;
$virastar = new Virastar();
$text = "فارسي را كمی درست تر می نويسيم";
$cleaned = $virastar->cleanup($text);
echo $cleaned; // Outputs: "فارسی را کمی درستتر مینویسیم"
Type: array
$virastar = new Virastar([
"fix_english_numbers": false,
"cleanup_line_breaks": false
]);
Virastar comes with a list of options to control its behavior.
default: true
- replace windows end of lines with unix eol (
\n
)
default: true
- converts numeral and selected html character-sets into original characters
default: true
- replaces triple dash to mdash
- replaces double dash to ndash
default: true
- removes spaces between dots
- replaces three dots with ellipsis character
default: true
- replaces more than one ellipsis with one
- replaces (space|tab|zwnj) after ellipsis with one space
default: true
- re-orders date parts with slash as delimiter
default: true
- replaces english quote pairs (
“”
) with their persian equivalent («»
)
default: true
- replaces english quote marks with their persian equivalent
default: true
- replaces
ه
followed by (space|ZWNJ|lrm) follow byی
withهٔ
- replaces
ه
followed by (space|ZWNJ|lrm|nothing) follow byء
withهٔ
- replaces
هٓ
or single-characterۀ
with the standardهٔ
default: false
- converts arabic hamzeh
ة
toهٔ
default: true
- converts Right-to-left marks followed by persian characters to zero-width non-joiners (ZWNJ)
default: true
- converts all soft hyphens (
­
) into zwnj - removes more than one zwnj
- cleans zwnj after characters that don't connect to the next
- cleans zwnj before and after numbers, english words, spaces and punctuations
- removes unnecessary zwnj on start/end of each line
default: true
- replaces arabic numbers with their persian equivalent
default: true
- replaces english numbers with their persian equivalent
default: true
- replaces english percent signs (U+066A)
- replaces dots between numbers into decimal separator (U+066B)
- replaces commas between numbers into thousands separator (U+066C)
default: true
- replaces arabic normal/swash kaf with its persian equivalent
- replaces arabic/urdu/pushtu/uyghur yeh with its persian equivalent
- replaces kurdish he with its persian equivalent
default: true
- replaces
,
,;
with its persian equivalent
default: true
- replaces question marks with its persian equivalent
default: true
- puts zwnj between the word and the prefix:
mi*
,nemi*
,bi*
default: true
- puts zwnj between the word and the suffix:
*ha
,*haye
*am
,*at
,*ash
,*ei
,*eid
,*eem
,*and
,*man
,*tan
,*shan
*tar
,*tari
,*tarin
*hayee
,*hayam
,*hayat
,*hayash
,*hayetan
,*hayeman
,*hayeshan
default: true
- replaces
ه
followed byئ
orی
, and then byی
, withهای
default: true
- removes inside spaces and more than one outside for
()
,[]
,{}
,“”
and«»
default: true
- removes space before punctuations
- removes more than one space after punctuations, except followed by new-lines
- removes space after colon that separates time parts
- removes space after dots in numbers
- removes space before some common domain tlds
- removes space between question and exclamation marks
- removes space between same marks
default: true
- cleans zwnj before diacritic characters
- cleans more than one diacritic characters
- clean spaces before diacritic characters
default: false
- removes all diacritic characters
default: true
- converts incorrect persian glyphs to standard characters
default: true
- removes space before parentheses on misc cases
- removes space before braces containing numbers
default: true
- replaces more than one space with just a single one
- cleans whitespace/zwnj between new-lines
default: true
- cleans more than two contiguous line breaks
default: true
- removes space/tab/zwnj/nbsp from the beginning of the new-lines
- remove spaces, tabs, zwnj, direction marks and new lines from the beginning and end of text
default: true
- remove spaces between
[]
and()
([text] (link)
into[text](link)
) - removes space between
!
and opening brace (! [alt](src)
into![alt](src)
) - remove spaces inside double
()
,[]
,{}
([[ text ]]
into[[text]]
) - remove spaces between double
()
,[]
,{}
([[text] ]
into[[text]]
)
default: true
- removes extra lines between two items on a markdown list beginning with
-
,*
or#
default: false
- skips converting english numbers of ordered lists in markdown
default: true
- replaces more than one exclamation mark with just one
- replaces more than one english or persian question mark with just one
- re-orders consecutive marks:
?!
into!?
default: true
- replaces kashidas to ndash in parenthetic
default: true
- converts kashida between numbers to ndash
- removes all kashidas between non-whitespace characters
default: true
- preserves front matter data in the text
default: true
- preserves all html tags in the text
default: true
- preserves all html comments in the text
default: true
- preserves all html entities in the text
default: true
- preserves all uri strings in the text
default: false
- preserves strings inside square brackets (
[]
)
default: false
- preserves strings inside curly braces (
{}
)
default: true
- preserves all no-break space entities in the text
This software is licensed under the MIT License. View the license.